Algorithms and Library Software for Periodic and Parallel ... · 1.1 Motivation for this work 1 1.2 Parallel computations, computers and programming models 2 1.3 Matrix computations

Algorithms and Library Software for

Periodic and Parallel Eigenvalue Reordering and

Sylvester-Type Matrix Equations with

Condition Estimation

Robert Granat

PHD THESIS, NOVEMBER 2007

DEPARTMENT OF COMPUTING SCIENCE AND HPC2NUMEA UNIVERSITY

Department of Computing ScienceUmea UniversitySE-901 87 Umea, Sweden

[email protected]

Copyright c© 2007 by Robert GranatExcept Paper I, c© SIAM Journal on Matrix Analysis and Applications, 2006

Paper II, c© Springer Netherlands, 2007

Papers III-IV, c© Robert Granat and Bo Kagstrom, 2007

Paper V, c© Robert Granat, Bo Kagstrom and Daniel Kressner, 2007

Cover, c© Bjorn Granat, 2007

ISBN 978-91-7264-410-6ISSN 0348-0542UMINF-07.21

Printed by Print & Media, Umea University, 2007:2003806

Abstract

This Thesis contains contributions in two different but closely related subfields of Sci-entific and Parallel Computing which arise in the context of various eigenvalue prob-lems: periodic and parallel eigenvalue reordering and parallel algorithms for Sylvester-type matrix equations with applications in condition estimation.

Many real world phenomena behave periodically, e.g., helicopter rotors, revolvingsatellites and dynamic systems corresponding to natural processes, like the water flowin a system of connected lakes, and can be described in terms of periodic eigenvalueproblems. Typically, eigenvalues and invariant subspaces (or, specifically, eigenvec-tors) to certain periodic matrix products are of interest and have direct physical in-terpretations. The eigenvalues of a matrix product can be computed without formingthe product explicitly via variants of the periodic Schur decomposition. In the firstpart of the Thesis, we propose direct methods for eigenvalue reordering in the periodicstandard and generalized real Schur forms which extend earlier work on the standardand generalized eigenvalue problems. The core step of the methods consists of solv-ing periodic Sylvester-type equations to high accuracy. Periodic eigenvalue reorderingis vital in the computation of periodic eigenspaces corresponding to specified spectra.The proposed direct reordering methods rely on orthogonal transformations and can begeneralized to more general periodic matrix products where the factors have varyingdimensions and ±1 exponents of arbitrary order.

In the second part, we consider Sylvester-type matrix equations, like the continuous-time Sylvester equation AX −XB = C, where A of size m×m, B of size n× n, and Cof size m× n are general matrices with real entries, which have applications in manyareas. Examples include eigenvalue problems and condition estimation, and severalproblems in control system design and analysis. The parallel algorithms presented arebased on the well-known Bartels–Stewart’s method and extend earlier work on triangu-lar Sylvester-type matrix equations resulting in a novel software library SCASY. Theparallel library provides robust and scalable software for solving 44 sign and transposevariants of eight common Sylvester-type matrix equations. SCASY also includes aparallel condition estimator associated with each matrix equation.

In the last part of the Thesis, we propose parallel variants of the direct eigenvalue

iii

Abstract

reordering method for the standard and generalized real Schur forms. Together with theexisting and future parallel implementations of the non-symmetric QR/QZ algorithmsand the parallel Sylvester solvers presented in the Thesis, the developed software canbe used for parallel computation of invariant and deflating subspaces corresponding tospecified spectra and associated reciprocal condition number estimates.

iv

Preface

This PhD thesis consists of the following five papers and an introduction including asummary of the papers.

Paper I R. Granat and B. Kagstrom. Direct Eigenvalue Reordering in a Product ofMatrices in Periodic Schur Form.1 From Technical Report UMINF-05.05,Department of Computing Science, Umea University. Published in SIAMJ. Matrix Anal. Appl. 28(1), 285–300, 2006.

Paper II R. Granat, B. Kagstrom, and D. Kressner. Computing Periodic DeflatingSubspaces Associated with a Specified Set of Eigenvalues.2 From Techni-cal Report UMINF-06.29, Department of Computing Science, Umea Uni-versity. Accepted for publication in BIT Numerical Mathematics, June2007.

Paper III R. Granat and B. Kagstrom. Parallel Solvers for Sylvester-type MatrixEquations with Applications in Condition Estimation, Part I: Theory andAlgorithms. From Technical Report UMINF-07.15, Department of Com-puting Science, Umea University. Submitted to ACM Transactions onMathematical Software, July 2007.

Paper IV R. Granat and B. Kagstrom. Parallel Solvers for Sylvester-type MatrixEquations with Applications in Condition Estimation, Part II: The SCASYSoftware Library. From Technical Report UMINF-07.16, Department ofComputing Science, Umea University. Submitted to ACM Transactions onMathematical Software, July 2007.

Paper V R. Granat, B. Kagstrom, and D. Kressner. Parallel Eigenvalue Reorderingin Real Schur Forms. From Technical Report UMINF-07.20, Departmentof Computing Science, Umea University. Submitted to Concurrency andComputation: Practice and Experience, September 2007. Also publishedas LAPACK Working Note #192.

1 Reprinted by permission of SIAM Journal on Matrix Analysis and Applications.2 Reprinted by permission of Springer Netherlands.

v

Preface

The topic of the papers concern periodic and parallel algorithms and software for eigen-value reordering, Sylvester-type matrix equations and condition estimation.

vi

Acknowledgements

First of all, I thank my supervisor Professor Bo Kagstrom, co-author of all papers inthis contribution, for the past 5+ years of cooperation. Thanks for your guidance, yourtrue commitment to our common projects and your care about me as a student. Boalso read earlier versions of this manuscript and gave many constructive comments, forwhich I am grateful.

Next, I want to send big thanks to Dr Isak Jonsson, my assistant supervisor, espe-cially for the assistance with the linear algebra kernel implementations.

Thanks to Dr Daniel Kressner, co-author of two of the papers in this Thesis, forfruitful collaboration. I am looking forward to visit you, Ana and your daughter Marain Switzerland. Also, thanks for constructive comments on earlier versions of thismanuscript.

Thanks to Dr Ing Andras Varga for valuable discussions on periodic systems andrelated applications.

For the last years I have enjoyed the company of Lars Karlsson who started the PhDprogramme last year. Thank you and good luck in your future thesis work. And watchout for antique programming models!

Very special thanks to Pedher Johansson for the superior LATEX-templates you pro-vided and many thanks to my younger brother Bjorn (”I don’t do photos, only pic-tures...”) for all assistance with the cover.

Thanks to all other friends and colleagues at the Department of Computing Sci-ence, especially the members of the Numerical Linear Algebra and Parallel and High-Performance Computing Groups.

Thanks to the staff at HPC2N (High Performance Computing Center North) forproviding a great computing environment and superior technical support. Sorry guys,there will be no more cakes after today. By the way, it was not my fault!

I wish to thank my family Eva, Elias and Anna for enriching my life in so manyways. I love you with all of my heart!

Also many thanks to my parents Ann and Ulf, my two other younger brothers Arvid(”Thanks, but no thanks!”) and Lars (”Coolt!”), and my grandparents Margareta andSigvard for all your encouragement. I also want to thank my mother-in-law, Gun-Brith,

vii

Acknowledgements

for the greatest moose stew in the world!Financial support was provided jointly by the Faculty of Science and Technology,

Umea University, by the Swedish Research Council under grant VR 621-2001-3284,and by the Swedish Foundation for Strategic Research under the frame program grantA3 02:128.

Finally, without faith in the Lord Jesus Christ, I should never have been able tocomplete this Thesis in the first place. I owe Him everything and dedicate my work tothe glory of His name. The Lord is my Shepherd; I shall not want. (Psalms 23:1)

Umea, November 2007

Robert Granat

viii

Contents

1 Introduction 11.1 Motivation for this work 11.2 Parallel computations, computers and programming models 21.3 Matrix computations in CACSD and periodic systems 51.4 Standard and periodic eigenvalue problems 71.5 Sylvester-type matrix equations 131.6 High performance linear algebra software libraries 171.7 The need for improved algorithms and library software 19

2 Contributions in this Thesis 212.1 Paper I 212.2 Paper II 222.3 Paper III 232.4 Paper IV 242.5 Paper V 25

3 Ongoing and future work 293.1 Software tools for periodic eigenvalue problems 293.2 Frobenius norm-based parallel estimators 303.3 Parallel periodic Sylvester-solvers 303.4 New parallel versions of real Schur decomposition algorithms 313.5 Parallel periodic eigenvalue reordering 31

4 Related papers and project websites 33

Paper I 47

Paper II 67

Paper III 101

Paper IV 135

ix

Contents

Paper V 155

x

CHAPTER 1

Introduction

This chapter motivates and introduces the work presented in this Thesis and gives somebackground to the topics considered.

1.1 Motivation for this work

The growing demand for high-quality and high-performance library software is drivenby the technical development of new computer architectures as well as the increasingneed from industry, research facilities and communities to be able to solve larger andmore complex problems faster than ever before. Often the problems considered areso complex that they cannot be solved by ordinary desktop computers in a reasonableamount of time. Scientists and engineers are more and more forced to utilize high-endcomputational devices, like specialized high performance computational units fromvendor-constructed shared or distributed memory parallel computers or self-made socalled cluster-based systems consisting of high-end commodity PC processors con-nected with high-speed networks. High performance computing clusters, even with arelatively small number of processor nodes, are getting very common due to their goodscalability properties and high cost-effectiveness, i.e., a relatively low cost per per-formance unit compared to the more specialized and highly complex supercomputersystems. In fact, any local area network (LAN) connecting a number of workstationscan be considered to be a cluster-based system. However, the latter systems are mainlyused for high-throughput computing applications (see, e.g., the Condor project [29]),while clusters with specialized high-speed networks are used for challenging parallelcomputations.

To solve a problem in a reliable and efficient way, a lot of out-of-application con-siderations must be made regarding solution method, discretization, data distributionand granularity, expected and achieved accuracy of computed results, how to utilizethe available computer power in the best way, e.g., would it be beneficial to use a par-allel computer or not, does the state-of-the-art algorithm at hand match the memory

1

Chapter 1

hierarchy of the target computer system well enough, etc. Typically, an appropriateand efficient usage of high performance computing (HPC) systems, like parallel com-puters, calls for non-trivial reformulations of the problem settings and development ofnovel and improved, highly efficient and scalable algorithms which are able to matchthe properties of a wide range of target computer platforms. Therefore, scientists andresearchers, engineers and programmers can save a lot of time and efforts by utilizingextensively tested, highly efficient, robust and reliable high-quality software librariesas basic building blocks in solving their computational problems. By this procedure, alot more attention can be focused on the applications and related theory.

Most problems in the real world are non-linear, i.e., the output from a phenomenaor a process does not depend linearly on the input. Moreover, most problems are alsocontinuous, i.e., the output depends continuously on the input. Roughly speaking, thegraph of a continuous function can be painted without ever lifting the pen from thepaper. Very few real world problems can be solved analytically, that is, finding an ex-act mathematical expression that fully describes the relation between the input and theoutput of the process or phenomena. Typically, we are forced to linearize and discretizethe problems to make them solvable in a finite amount of time using a finite amount ofcomputational resources (such as computing devices, data storage, network bandwidth,etc.). This means that the computed solution always will be a more or less valid ap-proximation. The good thing is that by linearization and discretization processes, manyproblems can be solved effectively by standard linear algebra methods.

In numerical linear algebra, systems of linear equations, eigenvalue problems, opti-mization problems and related solution methods, i.e., matrix computations, are studied.The focus is on reliable and efficient algorithms for large-scale matrix computationalproblems from the point of view of using finite-precision arithmetic. Developmentsin the area of numerical linear algebra often results in widely available public domainsoftware libraries, like LAPACK, SLICOT and ScaLAPACK (see Section 1.6).

1.2 Parallel computations, computers and programming models

The basic idea behind parallel computations is to solve larger computational problemsfaster: Larger, in the sense that the main memory in an ordinary workstation or laptopis not enough to store all data structures associated with a very large computationalproblems; to have enough storage space we might need to use p times as much mainmemory. Faster, by using more than one (say p) working processing units (usuallycorresponding to a single processor or core) concurrently. Consequently, by using pprocessing units equipped with (common or individual) p storage modules, we may

2

Introduction

solve a problem that is up to p times larger (in the sense of the storage requirements)up to p times faster. This is the good news.

The bad news are that it is often required that the problem at hand and existingserial solution methods or algorithms are reformulated in terms of parallel tasks thatcan be computed concurrently, each making up a small piece of the total solution. Thiscan be very hard, especially when the problem at hand has a very strong sequentialnature; one often mentioned simple example is to compute the Fibonacci numbers

F(n) :=

0, if n = 01, if n = 1

F(n−1)+F(n−2), if n > 1.

(1.1)

As can be seen from the recursive formula above, no numbers in the sequence can becomputed in parallel because of their strong internal dependency. Fortunately, mostreal world computational problems can be reformulated such that they can be solvedby parallel computations. Some problems are indeed embarrassingly parallel and re-quire no synchronization of cooperation between the involved processing units. Otherproblems can be parallelized to the cost of some explicit cooperation and synchroniza-tion among the processing units. To be able to use computer resources effectively, anyparallel algorithm should strive to minimize synchronization costs in relation to theamount of arithmetical work performed.

FIGURE 1: A concept view of a shared memory architecture [101].

Since many decades, the two main streams of parallel computers are shared mem-ory machines (or multiprocessors) and distributed memory machines (or multicomput-ers). Shared memory machines are typically made up of several high-end processors

3

Chapter 1

having their own respective local cache memories and sharing a number of large mainmemory modules accessed through a common bus, a multi-stage network or a cross-bar network, see Figure 1. Distributed memory machines are in principle a number ofstand-alone computers with their own local caches and main memory modules, con-nected to each other via a network of a certain topology, see Figure 2. The most com-mon topology today is the ring-based k-dimensional torus. For example, for k = 3 thiscorresponds to a three-dimensional mesh with wrap-around connections, see Figure 3.

FIGURE 2: A concept view of a distributed memory architecture [89].

These two types of parallel computers are (mainly) programmed using two dis-tinct classes of programming models, quite naturally called the shared memory (SM)model and the distributed memory (DM) model. In the SM model, focus is more onfine-grained parallelism, e.g., parallelizing loops by assigning loop indices to specificthreads executing in the shared memory environment. This programming model isvery useful in applications of massive data parallelism. In the DM model, focus is setmore towards parallel execution of disjunct (and, mostly, partly connected) tasks andexplicit synchronization through communication operations taking the underlying net-work topology into account. Notice that the latter model can be simulated in the SMmodel. For example, DM point-to-point communication can be simulated by copyinga specific area of the memory which contains a message (using, e.g., the C++ memcpyfunction) into an area reserved for the receiving process. Such an approach may bebeneficial when porting existing or simplifying new implementations of well-knownDM algorithm variants for SM environments.

Hybrid variants of the mentioned mainstream parallel computers also exist in whichprogramming models from both paradigms can be combined. For an example, see the64-bit Linux cluster sarek in Paper IV of this Thesis.

One of the challenges today is how to incorporate multicore processors, e.g., dual-cores and quadcores, and the future manycore processors in the parallel programming

4

Introduction

FIGURE 3: A square mesh of processing units (nodes )with wrap-around connections [21].

models. For now, many multicore processors can be programmed using programmingtechniques that have evolved in the SM model, e.g., heavy processes (using fork inUnix/Linux), lightweight processes (OpenMP [93] or POSIX threads, see, e.g., [97]).The natural motivation is that the cores in most cases share all or parts of the processorsmemory hierarchy (from L1 cache to main memory). With the upcoming evolution ofmanycores [7], which raise the need for local memory modules connected to each in-dividual core and an on-chip inter-core network (e.g., the local storage (LS) modulesand the ring network in IBM’s Cell processor), the picture becomes more complex andunclear. However, our feeling is that future programming techniques for manycore en-vironments will resemble much from DM programming, with explicit point-to-pointcommunication between the cores over the local network, global communication op-erations and execution of standard highly mature and well-known distributed memoryalgorithms. Notice that some or all of these DM-like features may be hidden for ordi-nary users behind software libraries or directly in the hardware. Nevertheless, a futurescenario like this is very interesting from the point of view that users may executereal parallel algorithms and get true supercomputer performance and speedup on theirmanycore-based laptops! Therefore, we expect parallel computing to become an evenhotter area in the future.

For a good introduction to parallel models, computers and algorithms see, e.g., thestandard textbook [48].

1.3 Matrix computations in CACSD and periodic systems

Matrix computations are fundamental to many areas of science and engineering andoccur frequently in a variety of applications, for example in Computer-Aided ControlSystem Design (CACSD). In CACSD various linear control systems are considered,like the following linear continuous-time descriptor system

5

Chapter 1

Ex(t) = Ax(t)+Bu(t)y(t) = Cx(t)+Du(t),

(1.2)

or a similar discrete-time system of the form

Exk+1 = Axk +Buk

yk = Cxk +Duk,(1.3)

where x(t), xk ∈ Rn are state vectors, u(t), uk ∈ Rm are the vectors of inputs (or con-trols) and y(t), yk ∈Rr are the vectors of output. The systems are described by the statematrix pair (A,E) ∈ R(n×n)×2, the input matrix B ∈ Rn×m, the output matrix C ∈ Rr×n

and the feed-forward matrix D ∈Rr×m. The matrix E is possibly singular. With E = I,where I is the identity matrix of order n, standard state-space systems are considered.Other subsystems described by the tuples (E,A,B) and (E,A,C), which arise from thesystem pencil corresponding to Equations (1.2)–(1.3):

S(λ ) = A−λB =

[A BC D

]−λ

[E 00 0

],

are studied when investigating the controllability and observability characteristics of asystem (see, e.g, [36, 64])

Applications with periodic behavior, e.g., rotating helicopter blades and revolvingsatellites can be described by discrete-time periodic descriptor systems of the form

Ekxk+1 = Akxk +Bkuk

yk = Ckxk +Dkuk,(1.4)

where the matrices Ak, Ek ∈ Rn×n, Bk ∈ Rn×m, Ck ∈ Rr×n and Dk ∈ Rr×m are periodicwith periodicity K ≥ 1. For example, this means that AK = A0, AK+1 = A1, etc.

Important problems studied in CACSD include state-space realization, minimal re-alization, linear-quadratic (LQ) optimal control, pole assignment, distance to control-lability and observability considerations, etc. For details see, e.g., [91].

The systems (1.2)–(1.4) can be studied by various matrix computational approaches,e.g., by solving related eigenvalue and subspace problems. In this area, improved algo-rithms and software are developed for computing and investigating different subspaces,e.g., condition estimation of invariant or deflating subspaces [73, 75, 74], solving var-ious important matrix equations, like (periodic) Sylvester-type and (periodic) Riccatimatrix equations and computing canonical structure information [32, 33, 37, 38, 64,63]. One common step in computing such structure information is the need to separatethe stable1 and unstable eigenvalues by an eigenvalue reordering technique (see, e.g.,[9, 103, 57, 109, 108]).

6

Introduction

1.4 Standard and periodic eigenvalue problems

Given a general matrix A ∈ Rn×n, the standard eigenvalue problem (see, e.g., [83])consists of finding n eigenvalue-eigenvector pairs (λi,xi) such that

Axi = λixi, i = 1, . . . ,n. (1.5)

Notice that Equation (1.5) only concerns right eigenvectors. Left eigenvectors are de-fined by yT

i A = λiyTi [46], i.e., they are right eigenvectors of the transposed matrix

AT .The standard method for the general standard eigenvalue problem is the unsymmet-

ric QR algorithm (see, e.g., [81, 44]), which is a backward stable algorithm belongingto a large family of bulge-chasing algorithms [111] that by iteration reduces the matrixA to real Schur form via an orthogonal similarity transformation Q ∈ Rn×n such that

QT AQ = S, (1.6)

where all eigenvalues of A appear as 1× 1 and 2× 2 blocks on the main diagonal ofthe quasi-triangular matrix S. The column vectors qi, i = 1,2, . . . ,n of Q are calledthe Schur vectors of the decomposition (1.6). If S(2,1) = 0 in (1.6), then q1 = x1 isan eigenvector associated with the eigenvalue λ1 = S(1,1). More importantly, givenk ≤ n such that no 2×2 block resides in S(k : k +1,k : k +1), the k first Schur vectorsqi, i = 1,2 . . . ,k, form an orthonormal basis for an invariant subspace of A associatedwith the k first eigenvalues λ1,λ2, . . . ,λk.

In most practical applications, the retrieved information from the Schur decomposi-tion (eigenvalues and invariant subspaces) is sufficient and the eigenvectors need not becomputed explicitly. By definition, the eigenvector xi belongs to the null space of theshifted matrix A−λiI (see, e.g., [83, 63]). In case the matrix A is diagonalizable (see,e.g., [46]), the eigenvectors can be computed by successively reordering each of theeigenvalues in the Schur form to the top-left corner of the matrix (see, e.g., [9, 35, 23])and reading off the first Schur vector q1. However, the latter approach is not utilized inpractice for the standard eigenvalue problem but the basic idea can be useful in othercontexts (see below).

Example 1 Consider the closed system of lakes2 in Figure 4. The system was pollutedwhen a close-by factory closed down and dumped w0 cubic meters of toxic waste into

1 The definition of stable and unstable eigenvalues depends on the system considered. However, the com-mon definitions of a stable eigenvalue λ for discrete-time and continuous-time systems are |λ | ≤ 1 andRe(λ )≤ 0, respectively.

2 The original reference for this problem is unknown, but several variants of it can be found at, e.g.,[85, 24].

7

Chapter 1

FIGURE 4: The considered system of three (polluted) connected lakes

one of the lakes (marked polluted lake). Every year there is a certain flow of the toxicwaste between the three lakes in the system such that fi j percent of the waste in lake iis transferred to lake j. Given w0 and fi j, i, j = 1,2,3, the task is to formulate this asan linear algebra problem and to predict the long-time behavior of the system.

Assume w0 = 1000 and f12 = 2, f13 = 5, f21 = 3, f23 = 5, f31 = 3, f32 = 2. Then,the concentration c = [c1 c2 c3]T , at year k +1 can be computed as

ck+1 = A · ck = Ak · c0, (1.7)

with

A =

0.93 0.03 0.030.02 0.92 0.020.05 0.05 0.95

and c0 = [1000,0,0]T .To predict the long-time behavior, denote the (right) eigenvalue-eigenvector pairs

of A by (λi,xi) and notice that c0 can be written as

c0 = β1x1 +β2x2 +β3x3 (1.8)

where βi for i = 1,2,3, represent a change-of-basis coordinates of c0 computed fromthe linear system [x1,x2,x3][β1,β2,β3]T = c0.

By combining (1.7) and (1.8), we get

ck+1 = Ak(β1x1 +β2x2 +β3x3) = β1λ k1 x1 +β2λ k

2 x2 +β3λ k3 x3, (1.9)

i.e., the long-time behavior of the system can be predicted by solving an eigenvalueproblem of the form (1.5). Actually, the values βi, i = 1,2,3, do not need to be com-puted explicitly for this example (see below).

8

Introduction

By invoking the MATLAB Schur decomposition command schur, we get the fol-lowing output matrices3

S =

0.9000 0.0149 −0.00500 1.0000 −0.03400 0 0.9000

, Q =

−0.7926 0.6097 00.2265 0.2944 −0.92850.5661 0.7359 0.3714

,

i.e., the eigenvalues of A are λ1 = λ3 = 0.9000 and λ2 = 1.0000. This implies that

c∞ = β2x2, (1.10)

since all other terms of (1.9) converge to zero as k → ∞. The remaining task is now tocompute x2, which we do (and illustrate) from a reordered Schur form. By invoking theMATLAB command ordschur, the eigenvalue λ2 is reordered to the top-left cornerof the updated Schur decomposition

S =

1.0000 0.0149 −0.03430 0.9000 0.00000 0 0.9000

, Q =

0.4867 0.8736 00.3244 −0.1807 −0.92850.8111 −0.4519 0.3714

,

and x2 = Q(:,1) can be read off. By scaling down each element in x2 by ‖x2‖1 = ∑(x2)(such that each element in x2 represents the amount of toxic waste in percent of w0),we arrive at

c∞ = w0[0.3000,0.2000,0.5000]T = [300.0,200.0,500.0]T ,

i.e., in the long run the system arrives at a stable equilibrium where lake 1 contains300 cubic meters of toxic waste, lake 2 contains 200 cubic meters and lake 3 contains500 cubic meters. Furthermore, this equilibrium is reached after 50 to 60 years, seeFigure 5.

The periodic (or product) eigenvalue problem (see, e.g., [79, 111]) consists, in itssimplest form, of computing eigenvalues and invariant subspaces of the matrix product

A = AK−1AK−2 · · ·A0, (1.11)

where A0,A1, . . . ,AK−1,AK+i = Ai, i = 0,1, . . ., is a K-cyclic matrix sequence. Suchproblems can arise from forming the monodromy matrix [110] of discrete-time peri-odic descriptor systems of the form (1.4) with E = I. Another well-known producteigenvalue problem is the singular value decomposition which, implicitly, consists ofcomputing eigenvalues and eigenvectors of the matrix product AAT [111].

3 All computed numbers in this introduction are rounded and presented to only four decimals accuracy,while all computations are performed in full double precision arithmetic.

9

Chapter 1

0 20 40 60 80 1000

100

200

300

400

500

600

700

800

900

1000Evolution of amount of pollution in three−lake system

time (years)

Am

ount

of p

ollu

tion

(m3 )

Lake 1Lake 2Lake 3

FIGURE 5: The level of pollution in each lake as a function of time illustrated by explicit appli-cation of Equation (1.7). A stable equilibrium is reached after 50 to 60 years.

From cost and accuracy reasons, it is necessary to work with the individual factorsin (1.11) and not forming the product A explicitly [22, 111].

In general, the eigenvalues of a K-cyclic matrix product is obtained by computingthe periodic real Schur form (PRSF) [22, 57]

ZTk⊕1AkZk = Sk, k = 0,1, . . . ,K−1, (1.12)

where k⊕ 1 = (k + 1) mod K, and Z0,Z1, . . . ,ZK−1 are orthogonal and the sequenceSk consists of K− 1 upper triangular matrices and one upper quasi-triangular matrix.This is done by applying the periodic QR algorithm [22, 78] to the matrix sequenceA0,A1, . . . ,AK−1. The periodic QR algorithm is essentially analogous to the standardQR algorithm applied to a (block) cyclic matrix [80]. The placement of the quasi-triangular matrix may be specified to fit the actual application. Sometimes, for examplein pole assignment, the resulting PRSF should be ordered [107], i.e., the eigenvaluesmust appear in a specified order along the diagonals of the individual Sk factors.

Periodic descriptor systems of the form (1.4) are conceptually studied by comput-ing eigenvalues and eigenspaces of matrix products of the form

E−1K−1AK−1E−1

K−2AK−2 · · ·E−11 A1E−1

0 A0, (1.13)

which can be accomplished via the generalized periodic Schur decomposition [22, 57]:there exists a K-cyclic orthogonal matrix pair sequence (Qk,Zk) with Qk,Zk ∈ Rn×n

10

Introduction

such that Sk = QT

k AkZk,

Tk = QTk EkZk⊕1,

(1.14)

where all matrices Sk, except for some fixed index j with 0 ≤ j ≤ K− 1, and all ma-trices Tk are upper triangular. The matrix S j is upper quasi-triangular; typically j ischosen to be 0 or K−1, but can, in principle, be applied to any matrix in the sequence.The sequence (Sk,Tk) is called the generalized periodic real Schur form (GPRSF) of(Ak,Ek), k = 0,1, . . . ,K−1, and the generalized eigenvalues λi = (αi,βi) of the corre-sponding product (1.13) can be computed from the diagonals of the triangular factors.Notice that calculation of the eigenvalues corresponding to a 2×2 diagonal block in S j,which signals a complex conjugate pair of eigenvalues, requires some postprocessingincluding forming explicit matrix products of dimension 2×2 of length K, which maycause bad accuracy4 or even numerical over- or underflow; scaling techniques couldoffer a remedy for such ill-conditioned cases. It is sometimes better to keep the gen-eralized eigenvalues in factored form which allows, e.g., computing the logarithms ofeigenvalue magnitudes without forming the product explicitly.

The GPRSF provides the necessary tools to avoid computing the matrix product(1.13) explicitly (which can cause numerical instabilities) and allows to handle possiblysingular factors Ek. Moreover, by ordering the eigenvalues in the GPRSF, periodic de-flating subspaces associated with a specified set of eigenvalues can be computed. Thisis important in certain applications, e.g., solving periodic Riccati-type matrix equations[57, 65, 52].

Each formal K-cyclic matrix product of the form (1.11) is associated with a matrixtuple

A = (AK−1,AK−2, . . . ,A1,A0)

and an associated vector tuple x = (xK−1,xK−2, · · · ,x1,x0), with xk 6= 0, is called a righteigenvector of the tuple A corresponding to the eigenvalue λ if there exist scalars µk,possibly complex, such that the relations

Akxk = µkxk+1, k = 0,1, . . . ,K−1,

λ := ∏0k=K−1 µk

(1.15)

hold with xK = x0 [14]. A left eigenvector y of the tuple A corresponding to λ is definedsimilarly. In this context, a direct eigenvalue reordering method may be utilized tocompute eigenvectors corresponding to each eigenvalue in the corresponding periodicSchur form by reordering each eigenvalue to the top left corner of the periodic Schurform, similarly to the non-periodic case. Corresponding definitions of eigenvectors canbe defined for more general matrix products of the form (1.13), see, e.g., [14].

4 Matrix multiplication is not backward stable in general for n > 1 [61].

11

Chapter 1

Example 2 Consider again the system of lakes in Example 1. By a more rigorousinvestigation of the system, it was observed that the water flow between the lakes couldnot be captured by the simple full-year model above. Furthermore, it had a clearperiodic behavior and differed depending on the seasons, as follows:

ck+1 = A4A3A2A1 · ck, (1.16)

or more specifically:ck,2 = A1 · ck,1,

ck,3 = A2 · ck,2,

ck,4 = A3 · ck,3,

ck+1,1 = A4 · ck,4,

(1.17)

where

A1 =

0.9400 0.0300 0.02000.0100 0.9300 0.02000.0500 0.0400 0.9600

, A2 =

0.9200 0.0500 0.03000.0100 0.9400 0.01000.0700 0.0100 0.9600

,

A3 =

0.9300 0.0100 0.01000.0400 0.9100 0.04000.0300 0.0800 0.9500

, A4 =

0.9300 0.0300 0.06000.0200 0.9000 0.01000.0500 0.0700 0.9300

,

and the subindices 1, 2, 3, and 4, correspond to the seasons summer, autumn, win-ter and spring, respectively. Notice that A = (A1 + A2 + A3 + A3)/4, where A is thetransition matrix from Example 1, i.e., the model in Example 1 was based on averagemeasurements.

To not lose any accuracy or information we keep the matrix product above fac-torized and compute the eigenvalues via the periodic Schur form which results in theeigenvalues λ1,2 = 0.6555± 0.0004i, λ3 = 1.0000. Notice that |λ1,2| < 1. After re-ordering the diagonal block sequence corresponding to the eigenvalue 1.0000 to thetop-left corner of the periodic Schur form, we end up with the sequence Sk as

S1 =

−1.0020 −0.0023 −0.0271

0 0.9150 0.00070 −0.0007 0.9132

, S2 =

−1.0028 −0.0150 0.0191

0 0.8864 0.03500 0 0.9313

,

S3 =

0.9960 0.0468 0.05140 0.9100 −0.03720 0 0.8831

, S4 =

0.9993 −0.0251 0.04950 0.8883 0.00280 0 0.8727

,

12

Introduction

and the sequence Qk of transformation matrices as

Q1 =

−0.4964 0.7844 −0.3718−0.3226 −0.5643 −0.7599−0.8059 −0.2572 0.5332

, Q2 =

0.4914 0.7827 −0.38200.3205 −0.5704 −0.75630.8098 −0.2492 0.5311

,

Q3 =

−0.4911 0.7635 −0.4195−0.3134 −0.6042 −0.7327−0.8128 −0.2283 0.5360

, Q4 =

−0.4698 0.7953 −0.3831−0.3387 −0.5632 −0.7537−0.8152 −0.2244 0.5340

.

Moreover, the periodic eigenvector corresponding to the eigenvalue λ = 1.0000 can beread off as

(x1,x2,x3,x4) = (Q1(:,1), Q2(:,1), Q3(:,1), Q4(:,1))

corresponding to

(µ1,µ2,µ3,µ4) = (S1(1,1), S2(1,1), S3(1,1), S4(1,1)).

This eigenvector corresponds to the stable equilibriums for each season which arereached after about 20 years, see Figures 6-7. Even though the differences betweenthe equilibriums corresponding to the different seasons are not very large, the periodicmodel does not mislead us to believe that the level of pollution in each lake stays fixedwhen arriving at the steady-state as is the case with the non-periodic model in Example1. Moreover, we get a more accurate prediction of the long-time behavior by workingwith a periodic Schur form of a matrix product of period K = 4. Working directlywith the explicitly formed product A = A4A3A2A1 in the spirit of Example 1 would onlyresult in one stable equilibrium corresponding to the x1 part of the periodic eigenvectorabove.

For other applications with periodic behavior, see, e.g., the modelling of the growth ofthe citrus trees in [27].

There exists a number of generalizations of the presented periodic Schur forms.For example, the extended periodic real Schur form (EPRSF) [106] generalizes PRSFfor handling square products where the involved matrices are rectangular. EPRSF canbe computed by a slightly modified periodic QR algorithm. We also remark that theGPRSF also can be modified to cover matrix products with rectangular factors and/or±1 exponents of arbitrary order (see Section 3).

1.5 Sylvester-type matrix equations

Matrix equations have been in focus of the numerical community for quite some time.Applications include eigenvalue problems including condition estimation (e.g., see [73,

13

Chapter 1

0 50 1000

200

400

600

800

1000Pollution in three−lake system − summer

time (years)

Am

ount

of p

ollu

tion

(m3 )

Lake 1Lake 2Lake 3

0 50 1000

200

400

600

800

1000Pollution in three−lake system − autumn

time (years)

Am

ount

of p

ollu

tion

(m3 )

Lake 1Lake 2Lake 3

0 50 1000

200

400

600

800

1000Pollution in three−lake system − winter

time (years)

Am

ount

of p

ollu

tion

(m3 )

Lake 1Lake 2Lake 3

0 50 1000

200

400

600

800

1000Pollution in three−lake system − spring

time (years)

Am

ount

of p

ollu

tion

(m3 )

Lake 1Lake 2Lake 3

FIGURE 6: The level of pollution in each lake as a function of time illustrated by explicit applica-tion of Equation (1.17). The stable equilibriums corresponding to summer, autumn,winter and spring are reached after about 20 years.

60, 96]) as well as various control problems (e.g., see [36, 77]).Already in 1972, R. H. Bartels and G. W. Stewart published the paper Algorithm

432: Solution of the Matrix Equation AX + XB = C [11], which presented a Schur-based method for solving the continuous-time Sylvester (SYCT) equation

AX +XB = C, (1.18)

where A of size m×m, B of size n×n, and C of size m×n are arbitrary matrices withreal entries. Equation (1.18) has a unique solution if and only if A and −B have noeigenvalues in common.

The solution method in [11] follows the general idea from mathematics of problemsolving via reformulations and coordinate transformations: First transform the prob-lem to a form where it is (more easily) solvable, then solve the transformed problem

14

Introduction

0.5 1 1.5 2 2.5 3 3.5 4 4.50

100

200

300

400

500

600

Season (1=summer,2=autumn,3=winter,4=spring)

Am

ount

of p

ollu

tion

(m3 )

Stable equilibriums for three−lake system

Lake 1Lake 2Lake 3

FIGURE 7: The stable equilibriums corresponding to summer, autumn, winter and spring whichare reached after about 20 years.

and finally transform the solution back to the original coordinate system. Other exam-ples include computing derivatives by a spectral method using forward and backwardFourier Transform as transformation method [39], and computing explicit inverses ofgeneral square matrices by using LU-factorization and matrix multiplication as trans-formation methods [61].

The Bartels–Stewart’s method for Equation (1.18) is as follows:

1. Transform the matrix A and the matrix B to real Schur form:

SA = QT AQ, (1.19)

SB = PT BP, (1.20)

where Q and P are orthogonal matrices and SA and SB are in real Schur form.

2. Update the matrix C with respect to the two Schur decompositions:

C = QTCP. (1.21)

3. Solve the resulting reduced triangular matrix equation:

SAX + XSB = C. (1.22)

15

Chapter 1

4. Transform the obtained solution back to the original coordinate system:

X = QXPT . (1.23)

The first step, which is performed by reducing the left hand side coefficient matrices toHessenberg forms and applying the QR algorithm to compute their real Schur forms,is also known to be the dominating part in terms of floating point operations [46] andexecution time (see Paper III). By recent developments in obtaining close to level 3performance in the bulge-chasing [25] and advanced deflating techniques for the QR-algorithm [26], this might change in the future.

The classic paper of Bartels and Stewart [11] has served as a foundation for laterdevelopments of direct solution methods for related problem, see, e.g., Hammarling’smethod [56] and the Hessenberg-Schur approach by Golub, Nash and Van Loan [45].However, these methods were developed before matrix blocking became vital for han-dling the increasing performance gap between processors and memory modules.

SYCT can be formulated in terms of computing a block diagonalization of a matrixin standard real Schur form

[I X0 I

]−1 [S11 S12

0 S22

][I X0 I

]=

[S11 S12 +S11X−XS22

0 S22

]. (1.24)

This leads to the ideas of computing reciprocal condition number estimates of selectedeigenvalue clusters [10] and the corresponding eigenspaces [73] which are, roughlyspeaking, based on how ”hard” it is to solve the corresponding triangular Sylvesterequation. A similiar reasoning can be applied to the generalized real Schur form (see,e.g., [75, 74] in terms of solving triangular generalized coupled Sylvester (GCSY)equations

(AX−Y B,DX−Y E) = (C,F). (1.25)

Other condition estimates can be formulated for other Sylvester-type matrix equationsas well (see, e.g., [66, 67] and Papers III-IV below).

Bartels–Stewart-style methods can be formulated for all matrix equations in Table2.1 (see Paper III). For example, for the generalized matrix equations such methodswould require efficient and reliable implementations of QZ algorithm. For parallel DMenvironments, the lack of such an implementation has driven the development of fullyiterative methods for Sylvester-type matrix equations based on Newton-iteration of thematrix sign function [15, 16, 17]. A comparison of such a fully iterative method andBartels–Stewart’s method was presented in [50]. We remark that a preliminary versionof a highly efficient version of the parallel QZ algorithm is included in the softwarepackage SCASY, see Papers III-IV of this Thesis.

16

Introduction

Periodic Sylvester-like equations can be formulated as periodic counterparts ofsimilar problems, e.g., condition estimation of eigenvalues cluster or certain periodiceigenspaces (see Section 2 and Papers I-II below). Also, periodic variants of Bartels–Stewart’s method follow straightforwardly by considering periodic matrices, a periodicSchur decomposition, by performing K-periodic updates of the right hand side and bysolving triangular periodic Sylvester-type matrix equations, see, e.g., [105, 49].

1.6 High performance linear algebra software libraries

Since long, software libraries have been a fundamental tool for problem solving inComputational Science and Engineering (CSE). Many computational problems arisingfrom real world problems or from discretizations and linearizations of mathematicalmodels can be formulated in terms of common matrix and vector operations. Alongwith the development of more advanced computer systems with complex memory hi-erarchies, there is a continuing demand for new algorithms that efficiently utilize andadapt to new architecture features. Besides, as we get new insight from CSE researchand developments, new demands and challenges emerge that request the solution ofnew and more complex matrix computational problems.

These facts have been driving forces behind library software packages such asBLAS [20], LAPACK [82, 5], SLICOT [41, 102] and their parallel counterparts ScaLA-PACK [18] and PSLICOT [90]. Below, we give some insight into the functionality ofthese libraries. For other attempts of providing high performance software librariesin matrix computations, see, e.g., FLAME [54], where the triangular SYCT equation(see Section 1.5) was used as a model problem for formal derivation and automaticgeneration of numerical algorithms [98], and PLAPACK [104]. Moreover, recently theUmea HPC research group in collaboration with the IBM T.J. Watson Research Centerpresented novel recursive blocked algorithms and hybrid data formats for dense linearalgebra library software targeting deep memory hierarchies of processor nodes (see thereview paper [40] and further references therein).

The BLAS (Basic Linear Algebra Subprograms) are structured in three levels. Thelevel 1 BLAS are concerned with vector-vector operations, e.g., scalar products, rota-tions etc., and was developed during the seventies. The level 2 BLAS perform matrix-vector operations and were originally motivated by the increasing number of vectormachines during the eighties. The level 3 BLAS concern matrix-matrix operations,such as the well known GEMM (GEneral Matrix Multiply and add) operation

C = βC +αAB,

17

Chapter 1

where α and β are scalars, and A is an m× k matrix, B is a k× n matrix, and C is anm×n matrix. In general, the level 3 BLAS perform O(n3) arithmetic operations whilemoving O(n2) data element through the memory hierarchy of the computer, leading toa volume-to-surface effect on the performance. If the level 3 BLAS are properly tunedfor the cache memory hierarchy of the target computer system, and the computationsin the actual program are organized into level 3 operations, the execution may run withclose to practical peak performance. In fact, the whole level 3 BLAS may be organizedin GEMM-operations [71, 72] which means that the performance will depend mainlyon how well tuned the GEMM-operation is. Computer vendors often supply their ownhigh performance implementation of the BLAS which are optimized for their specificarchitecture. Automatically tuned libraries also exists, see, e.g., ATLAS [8]. See alsothe GOTO-BLAS [47] which makes use of data streaming to efficiently utilize thememory hierarchy of the target computer.

The LAPACK (Linear Algebra Package) combines the functionality of the formerLINPACK and EISPACK libraries and performs all kinds of matrix computations, fromsolving linear systems of equations to calculating all eigenvalues of a general matrix.The computations in LAPACK are organized to perform as much as possible in level3 operations for optimal performance. The LAPACK project [5, 34, 31] has been ex-tremely successful and now forms the underlying ”computational layer” of the inter-active MATLAB [86] environment, which perhaps is the most popular tool for solvingcomputational problems in science and engineering and for educational purposes.

ScaLAPACK (Scalable LAPACK) [18] implements a subset of the algorithms inLAPACK for distributed memory environments (see also Section 1.2). Basic buildingblocks are two-dimensional (2D) block cyclic data distribution (see, e.g., [48]) over alogical rectangular processor mesh in combination with a Fortran 77 object oriented ap-proach for handling the involved global distributed matrices. In connection to ScaLA-PACK, parallel versions of the BLAS exist, the PBLAS (Parallel BLAS) [94]. Explicitcommunication in ScaLAPACK is performed using the BLACS (Basic Linear AlgebraCommunication Subprograms) library [19], which provide processor mesh setup toolsand basic point-to-point, collective and reduction communication routines. BLACS isusually implemented using MPI (Message Passing Interface) [88], the de-facto stan-dard for message passing communication.

We remark that the basic building blocks of LAPACK and ScaLAPACK are underreconsideration [31].

SLICOT (Subroutine Library in Systems and Control Theory) provides Fortran 77implementations of numerical algorithms for computations in systems and control ap-plications. Based on numerical linear algebra routines from the BLAS and LAPACKlibraries, SLICOT provides methods for the design and analysis of control systems.

18

Introduction

Similarly to LAPACK and ScaLAPACK, a parallel version of SLICOT, called PSLI-COT, is under development. The goal is to include most functionality of SLICOT ina parallel version. PSLICOT builds also on the existing functionality of ScaLAPACK,PBLAS and BLACS.

We refer to [28] for more on the discussion of the impact of multicore architectureson mathematical software.

1.7 The need for improved algorithms and library software

As computational science evolves together with technical developments of new andcomplex computational platforms, the need for reliable and efficient numerical soft-ware continues to grow. Typically, new and complex computational-intensive prob-lems arise in the context of real applications ranging from bridge construction, auto-mobile design and satellite control to particle simulations and quantum computationsin physics and chemistry.

In the rest of this Thesis, we present and discuss our recent contributions regardingnew and improved algorithms and library software for computational problems in thefollowing topics of numerical linear algebra:

1. Periodic eigenvalue reordering, which arises in the context of computing peri-odic eigenspaces to certain matrix products; these contributions also include newand improved algorithms for solving periodic Sylvester-type matrix equations.

2. Parallel solvers and condition estimators for Sylvester-type matrix equations fordistributed memory and hybrid distributed and shared memory environments.

3. Parallel eigenvalue reordering in the standard Schur form of a matrix (see Sec-tion 1.4) which fills one of the gaps of the functionality in ScaLAPACK (seeSection 1.6), providing novel algorithms and software for parallel computationsof invariant subspaces.

In Chapter 2 of this introduction, we give brief summaries of the contributions ofthe five papers included in this Thesis. Chapter 3 outlines some ongoing and futurework and the last Chapter 4 presents some of our related publications and project web-sites.

19

20

CHAPTER 2

Contributions in this Thesis

This chapter gives a brief summary of the contributions of the five papers in this The-sis. The first two papers are concerned with periodic eigenvalue reordering in periodic(cyclic) matrix and matrix pair sequences. The third and fourth papers concern parallelsolvers and condition estimators for Sylvester-type matrix equations. The last paperconcerns a parallel version of the standard algorithm for eigenvalue reordering in thereal Schur form of a general matrix.

The library software deliverables from this Thesis are novel contributions that openup new possibilities to solve challenging computational problems regarding periodicsystems as well as other large-scale control systems applications. Examples includeoptimal linear quadratic control problems which involve the solution of various alge-braic and differential Riccati equations [87, 57, 4, 65, 52].

In the next Chapter 3, we discuss how the contributions from this Thesis can befurther extended and refined to cover new challenging computational problems in theconsidered areas.

2.1 Paper I

The first contribution concerns periodic eigenvalue problems. The paper presents thederivation and the analysis of a direct method (see, e.g., [9, 69]) for eigenvalue reorder-ing in a K-cyclic matrix product AK−1AK−2 · · ·A1A0 without evaluating any part of thematrix product.

The method relies on orthogonal transformations only and performs the reorderingtentatively to guarantee backward stability. By applying a bubble-sort-like procedureof adjacent swaps of diagonal block sequences in the corresponding periodic real Schurform, an ordered periodic real Schur form is computed robustly. One important stepin the method is computing the numerical solution of an associated triangular periodicSylvester equation (PSE)

21

Chapter 2

A(k)11 Xk−Xk+1A(k)

22 =−A(k)12 , k = 0,1, . . . ,K−1, (2.1)

where XK = X0. Methods for solving Equation (2.1) are discussed, including Gaussianelimination with partial or complete pivoting (GEPP/GECP) and iterative refinement(see, e.g., [61]).

An error analysis of the direct reordering method is presented that reveals that theaccuracy of the reordered eigenvalues is essentially guided by the accuracy of the com-puted solution to the associated PSE. The theoretical results are also backtracked to thestandard case K = 1, delivering an even sharper error bound than the previously knownresult in [9].

Some experimental results are presented that illustrate the reliability and robustnessof the direct reordering method for a selected number of problems, including well- andill-conditioned artificial problems with short and long periods and an application withlong period from satellite control.

2.2 Paper II

The second contribution concerns periodic eigenvalue reordering in periodic matrixpairs which arise in the study of generalized periodic eigenvalue problems, i.e., com-puting eigenvalues and eigenspaces of matrix products of the form

E−1K−1AK−1E−1

K−2AK−2 · · ·E−11 A1E−1

0 A0, (2.2)

where K is the period. Such matrix product was studied, e.g., in [12, 14, 84, 111].Extending and generalizing the results from Paper I and [69, 74], the swapping of

the diagonal block sequences in the associated generalized periodic real Schur form byorthogonal transformations is based on computing the numerical solution to an associ-ated periodic generalized coupled Sylvester equation (PGCSY) of the form

A(k)

11 Rk − LkA(k)22 = −A(k)

12 ,

E(k)11 Rk⊕1 − LkE(k)

22 = −E(k)12 ,

(2.3)

and the computation of K pairs of orthogonal matrices from the solution to (2.3).In this paper, the discussion of solution methods for periodic Sylvester equations

from Paper I is broadened to also cover structured variants of an orthogonal QR fac-torization of a matrix representation of the periodic generalized Sylvester operator of(2.3). This QR-based procedure is known to be numerically stable even when Gaus-sian elimination with partial pivoting fails because of successive pivot growth (see, e.g.,[112]).

22


The presented error analysis generalizes the existing non-periodic result [69] and isconnected to the expected accuracy of the QR-based solution method for the associatedperiodic generalized coupled Sylvester equation.

The paper ends with some experimental results that demonstrate the reliability androbustness of the developed reordering method, including an example with infiniteeigenvalues and an example related to periodic LQ-optimal control.

2.3 Paper III

The third contribution presents the theory and algorithms which serve as foundation ofthe parallel SCASY software library presented in Paper IV. The aim is to provide robustand efficient parallel algorithms and software for computing the numerical solution ofthe eight common Sylvester-type matrix equations displayed in Table 2.1 by extend-ing and parallelizing the different steps of the well-known Schur decomposition-basedBartels–Stewart’s method (see Section 1.5). The algorithms presented are novel andcomplete ScaLAPACK-style implementations.

TABLE 2.1: Considered standard and generalized matrix equations. CT and DT denote thecontinuous-time and discrete-time variants, respectively.

Name Matrix Equation Acronym

Standard CT Sylvester AX−XB = C ∈ Rm×n SYCTStandard CT Lyapunov AX +XAT = C ∈ Rm×m LYCTStandard DT Sylvester AXB−X = C ∈ Rm×n SYDTStandard DT Lyapunov AXAT −X = C ∈ Rm×m LYDTGeneralized Coupled Sylvester (AX−Y B,DX −Y E) = (C,F) ∈ R(m×n)×2 GCSYGeneralized Sylvester AXBT −CXDT = E ∈ Rm×n GSYLGeneralized CT Lyapunov AXET +EXAT = C ∈ Rm×m GLYCTGeneralized DT Lyapunov AXAT −EXET = C ∈ Rm×m GLYDT

The work by Kagstrom-Poromaa [73] and Poromaa [96] on blocked and parallelalgorithms for the triangular continuous-time Sylvester (SYCT) matrix equation (1.18)and the generalized coupled Sylvester (GCSY) matrix equation (1.25) is extended andrefined to cover other types of Sylvester-like matrix equations, such as the discrete-timeand/or generalized variants of the Sylvester and Lyapunov matrix equations. For thediscrete-time matrix equations, a novel technique for reducing the arithmetic complex-ity needed in the explicit blocking is proposed. Two different communication schemesused in the triangular solvers are presented and a new algorithm for performing globalscaling to prevent numerical overflow is discussed.

One of the shortcomings of the previous algorithms was the lack of support for

23

Chapter 2

handling quasi-triangular matrices where some 2×2 blocks, corresponding to complexconjugate pairs of eigenvalues, were shared between different blocks (and processors)in the explicit blocking of the algorithms. This problem was resolved by proposing animplicit redistribution of the matrices in the initial stage of step 3 in Bartels–Stewart’smethod. Notice that a second possible solution is outlined below in Section 2.5.

The paper includes a generic scalability analysis of the developed parallel algo-rithms which provides theoretical insight in the expected behavior of the algorithms,including performance models that predict a certain level of scaled parallel speedup.

The developed algorithms are also combined with the (Sca)LAPACK-style con-dition estimation functionality [5] to produce scalable and robust parallel conditionestimators.

Experimental results from three parallel platforms with different characteristics arepresented and analyzed using several performance and accuracy metrics.

We remark that the work presented in Papers III-IV in this Thesis was based onpreliminary results from [51].

2.4 Paper IV

In this fourth contribution, the parallel ScaLAPACK-style software package SCASYis presented. SCASY builds on the theory and algorithms presented in Paper III andextends the functionality by providing parallel solvers and condition estimators for44 different sign and transpose variants of the eight Sylvester-type matrix equationsconsidered in Paper III. By using and extending existing functionality from the BLAS,LAPACK, ScaLAPACK and RECSY [68] libraries, SCASY delivers highly performingroutines for general and triangular Sylvester-type matrix equations and is designed andimplemented for DM and hybrid DM and SM platforms.

FORTRAN interfaces to the different core subroutines and test utilities, SCASY’sinternal design and routine dependency hierarchy, see Figure 8, the software documen-tation, install instructions and library usage are discussed in some detail.

Some experimental results from demonstrating SCASY’s novel capacity of han-dling two different levels of parallelism and programming models concurrently arealso presented. This is conducted by combining the distributed memory model fromthe ScaLAPACK-style core routines in SCASY with the shared memory model fromthe OpenMP version of the node solver library RECSY and a shared memory imple-mentation of the BLAS (SMP-BLAS) on a target distributed memory machine withSMP-aware NUMA1 multiprocessor nodes.

1 Non-Uniform Memory Access

24


FIGURE 8: Software hierarchy of SCASY.

SCASY is currently used by researchers at the ICPS Group [62] in the contextof a CFD application for solving the linearized Navier-Stokes equation with stochasticforcing, see, e.g., [42]. The application is implemented in Python with f2py-generated[100] ScaLAPACK wrappers and one core step is to use the general Lyapunov solverPGELYCTD from SCASY to solve general Lyapunov equations of size 20000×20000which are so ill-conditioned that fully iterative methods (See Section 1.5) fail to com-pute the solution.

2.5 Paper V

The final contribution in this Thesis presents novel parallel variants of the standard(blocked) algorithm for eigenvalue reordering in the standard (and generalized) realSchur form of a square matrix (regular matrix pair).

High serial performance is accomplished by adopting the recently developed idea

25

Chapter 2

(1,0) (1,1) (1,2)

(0,1) (0,2)

(0,0)

(0,0)

(0,1) (0,2)

FIGURE 9: Using multiple concurrent computational windows in parallel eigenvalue reorderingin the real Schur form. The computational windows (red) are local and the updateregions (green and blue) are shared.

of using computational windows and delaying the outside-window updates until eachwindow has been completely reordered locally. By using multiple concurrent windowsthe parallel algorithm has a high level of concurrency, and most work is performed aslevel 3 BLAS operations which in turn lead to high parallel performance. The basicideas are illustrated in Figure 9, where a real Schur form is distributed over a 2×3 meshof processors and the eigenvalues are reordered using two concurrent computationalwindows.

The parallel Sylvester solvers and the associated condition estimators from Pa-pers III-IV are applied to compute reciprocal condition number estimates for the se-lected cluster of eigenvalues and the associated eigenspaces (invariant and deflatingsubspaces). As an application example, stable invariant subspaces and associated con-dition number estimates of Hamiltonian matrices are computed with satisfying results.

Experimental results for ScaLAPACK-style FORTRAN implementations on a Linuxcluster confirm the efficiency and scalability of our algorithms in terms of the serialspeedup going up to 14 times on one processor and in addition delivering more than 16times of parallel speedup using 64 processor for large scale problems.

Paper V also provides the software for an alternative solution of the problem withthe shared 2× 2 blocks considered in Papers III-IV by solving the problem alreadyat step 1 in Bartels–Stewart’s method by means of computing an ordered (generalized)real Schur form that do not have any shared 2×2 blocks in the parallel data distribution.However, we do not expect this to have any obvious effect on the parallel performance

26


of the parallel Sylvester solvers since its complexity, although still negligible comparedto the total cost of solving the actual Sylvester equations, is higher than the cost ofperforming the implicit redistribution.

27

28

CHAPTER 3

Ongoing and future work

In this section, we discuss how the algorithms and software presented in the Thesis canbe used and further developed to solve new challenging and open problems. Work inthese directions is already in progress in our research team.

3.1 Software tools for periodic eigenvalue problems

The reordering methods from Papers I-II can be extended to consider product eigen-value problems in their most general form:

Aspp A

sp−1p−1 · · ·As1

1 , (3.1)

with exponents s1, . . . ,sp ∈ 1,−1. The dimensions of A1, . . . ,Ap should match, i.e.,

Ak ∈Rnk+1×nk if sk = 1,Rnk×nk+1 if sk =−1,

where np+1 = n1. Additionally, the condition

p

∑k=1sk=1

nk +p

∑k=1

sk=−1

nk+1 =p

∑k=1

sk=−1

nk +p

∑k=1sk=1

nk+1, (3.2)

ensures that the corresponding lifted block cyclic pencil is square (see, e.g., [43] fordetails) and the eigenvalues (and eigenvectors) are properly defined.

The level of generality imposed by (3.1) allows to cover other applications (see,e.g., [13]) that do not necessarily lead to a matrix product having the somewhat sim-pler form (1.13). It is worth mentioning that this would admit matrices Ak correspond-ing to an index sk = −1 to be singular or even rectangular, in which case the matrixproduct (3.1) should only be understood in a formal1 sense.

1 Formally, if A ∈Rm×n, then A−1 ∈Rn×m. Notice that this is consistent with the definition of the pseduo-inverse A+ which actually can be computed via an SVD-decomposition.

29

Chapter 3

In this context, we rely on a different definition of the periodic Schur decomposi-tion which allows for handling dimension-induced zero and infinite eigenvalues. Theperiodic Sylvester-like matrix equation associated with eigenvalue reordering and con-dition estimation of eigenvalues and periodic eigenspaces now takes the form

A(k)

11 Xk−Xk+1A(k)22 =−A(k)

12 , for sk = 1,

A(k)11 Xk+1−XkA(k)

22 =−A(k)12 , for sk =−1,

(3.3)

and can be addressed by extending and refining the methods presented in Papers I-II.We refer to [52, 53] for details.

3.2 Frobenius norm-based parallel estimators

The current release of the SCASY library (see the library homepage [99]) includes par-allel condition estimators based on the LAPACK-style matrix norm estimation tech-nique developed by Hager and Higham [55, 59] which was based on the matrix 1-normand was successfully applied to SYCT in [73]. Kagstrom and Westin introduced a cor-responding matrix norm estimation technique in [76] that was based on the Frobenius-norm which is needed in, e.g., computing reciprocal condition number estimates forselected cluster of eigenvalues in the generalized real Schur form [75, 74]. We planto include such Frobenius-norm based techniques in the next release of SCASY. Thiswould also make the current implementation of the parallel eigenvalue reordering rou-tines from Paper V complete in comparison to the functionality offered by the corre-sponding LAPACK routines [5].

3.3 Parallel periodic Sylvester-solvers

The algorithms and techniques in SCASY can be extended to cover also K-periodicmatrix equations (see, e.g., [49] for a summary of various periodic Sylvester-type ma-trix equations). For example, by assuming that the involved K-periodic matrices Ak,Bk and Ck in the PSE (2.1) are distributed over the process mesh such that they arealigned internally, and that Ak and Bk are in periodic Schur form with quasi-triangularfactors Ar and Bs, the solution sequence Xk can be obtained by computing the implicitdistribution from Ar and Bs, applying it to all of Ak, Bk, and Ck, and traversing theblock diagonals of the Ck sequence just as in the non-periodic case. Node subsystemscan be solved by utilizing the periodic Sylvester solvers from Papers I-II or from theupcoming PEP library [52, 95]. Right hand side updates can be performed as beforeusing level 3 BLAS operations (now in a periodic sense). See [6] for some preliminary

30

Ongoing and future work

work in this direction.Of course, such parallel periodic matrix equation solvers would in the unreduced

case rely on the existence of a parallel distributed memory implementation of the pe-riodic Schur decomposition. To the best of our knowledge, no such parallel algorithmexists today. Until today most underlying physical systems studied have very smalldimensions; very long periods are much more common (see, e.g., Example 2 in PaperI). For that reason, a distributed memory implementation along the lines of the cur-rent de-facto standards as ScaLAPACK and PSLICOT would have very little practicalrelevance. Very long sequences of periodic matrices with small dimensions naturallycall for shared memory implementations which are able to repeat large amounts ofsmall (and sometimes simple) operations with relatively low complexity in a scalablehigh-throughput way, which makes much more sense in this case. Nevertheless, whendiscretizations of 3D models of periodic behavior are considered, the request of parallelperiodic Schur decomposition algorithms and Sylvester solvers may appear.

3.4 New parallel versions of real Schur decomposition algorithms

As demonstrated in Papers III-V, the performance bottleneck in many parallel Schur-based algorithms is the currently slow parallel Schur decomposition from ScaLA-PACK. We plan to contribute to the development of new parallel implementations ofthe parallel QR and QZ algorithms based on the work presented in [30, 1, 2, 3]. Forexample, the aggressive early deflation techniques developed and presented in [26, 70]can be implemented in parallel using the parallel eigenvalue reordering techniques pre-sented in Paper V. Moreover, high serial performance should be accomplished by usingthe same delay-and-accumulate technique for the local orthogonal transformations aswas used in the parallel reordering algorithms (see also [25]). A high level of con-currency and scalability can be realized by chasing long and separated chains of smallto medium-sized bulges (see, e.g., [58]) over the diagonal blocks of the correspond-ing Hessenberg matrix, similarly to working with separated selected subclusters in theparallel reordering algorithms in Paper V.

3.5 Parallel periodic eigenvalue reordering

A straightforward extension of the parallel reordering algorithm is to include periodiceigenvalue reordering of a sequence of matrices in periodic (generalized) real Schurform, very much like the ideas of developing parallel periodic Sylvester solvers. Weonly have to require that the matrices are properly aligned and distributed in the same

31

Chapter 3

way over the process mesh to be able to follow the outline of the parallel eigenvaluereordering algorithms, substantially guided by the quasi-triangular factor.

To enhance parallel computation of periodic eigenspaces associated with selectedspectra of a square matrix product, this would require a parallel implementation of theperiodic QR and QZ algorithms, as in the case of the parallel periodic Sylvester solvers.

32

CHAPTER 4

Related papers and projectwebsites

The following five peer-reviewed conference publications (papers I-II, IV-VI) and myLicentiate Thesis (paper III) are related to the contents of this Thesis.

I. R. Granat and B. Kagstrom. Evaluating Parallel Algorithms for Solving Sylvester-Type Matrix Equations: Direct Transformation-Based versus Iterative Matrix-Sign-Function-Based Methods. In: PARA 2004, J. Dongarra et al., Eds. LectureNotes in Computer Science (LNCS), Vol. 3732, Springer Verlag, pp. 719–729,2005.

II. R. Granat, I. Jonsson, and B. Kagstrom. Combining Explicit and RecursiveBlocking for Solving Triangular Sylvester-Type Matrix Equations in DistributedMemory Platforms. In: Euro-Par 2004, M. Danelutto et al., Eds. Lecture Notesin Computer Science (LNCS), Vol. 3149, Springer Verlag, pp. 742–750, 2004.

III. R. Granat. Contributions to Parallel Algorithms for Sylvester-type Matrix Equa-tions and Periodic Eigenvalue Reordering in Cyclic Matrix Products. Licenti-ate Thesis. From Technical Report UMINF-05.18, ISSN-0348-0542, ISBN 91-7305-903-X, Dept Computing Science, Umea University, May 2005.

IV. R. Granat, B. Kagstrom, and D. Kressner. Reordering the Eigenvalues of a Peri-odic Matrix Pair with Applications in Control. In: Proc. of 2006 IEEE Confer-ence on Computer Aided Control Systems Design (CACSD), pp. 25–30. ISBN:0-7803-9797-5, 2006.

V. R. Granat, I. Jonsson, and B. Kagstrom. Recursive Blocked Algorithms forSolving Periodic Triangular Sylvester-type Matrix Equations. In: Proc. ofPARA’06: State-of-the-art in Scientific and Parallel Computing, B. Kagstromet al., Eds. Lecture Notes in Computer Science (LNCS), Vol. 4699, SpringerVerlag, pp. 531–539, 2007.

33

Chapter 4

VI. R. Granat, B. Kagstrom, and D. Kressner. MATLAB Tools for Solving PeriodicEigenvalue Problems. In proceedings of the 3rd IFAC Workshop PSYCO’07,Saint Petersburg, Russia, 2007.

Also, the following project websites are related to the contents of this Thesis.

I. The SCASY homepage [99] is related to the public release and maintenanceof the parallel algorithms and software presented in Papers III-IV in this The-sis. Currently, the SCASY library is available for download as a β -version(beta0.10).

II. The PEP project [95] has the goal to provide a complete package of efficient andreliable software for computing eigenvalues and eigenspaces of matrix productsof varying dimensions and signatures (the order of the ±1 exponents), includingthe special cases treated in Papers I-II of this Thesis. A prospectus of this pack-age, including related MATLAB tools, was published as paper VI above. Thefinal results will be presented in [53].

III. OpenFGG - Open Fortran Gateway Generator [92] was initiated as a project run-ning in parallel with the LAPACK-style FORTRAN software development con-nected to Paper I. Writing numerical software is hard without proper debuggingtools and using the MATLAB environment to testdrive the developed softwaresimplifies and speeds up implementation, maintenance and debugging. FOR-TRAN routines are accessed from MATLAB using MEX-interfaces which providea so called gateway to the routine. To contruct a gateway, the user must providea MEX-interface file together with the FORTRAN source code to the MATLAB

MEX-compiler which in turn constructs the gateway. In this context, Open-FGG can simplify the gateway construction by generating it automatically fromthe FORTRAN source and a few hints from the user passed through OpenFGG’sJava-based GUI.

The implementation was designed and implemented in cooperation with threeMSc students: Magnus Andersson, Johan Sejdhage and Joakim Hjertstedt. AUser’s Guide and some documentation and examples can be found on the Open-FGG website (see above). The latest version, OpenFGG 1.5, also supports asubset of the constructs in FORTRAN 90/95.

The OpenFGG website has also been posted on the MATLAB Central, a world-wide open exchange forum for the MATLAB and Simulink user community.

34

References

[1] B. Adlerborn, K. Dackland, and B. Kagstrom. Parallel Two-Stage Reduction ofa Regular Matrix Pair to Hessenberg-Triangular Form. In T. Sørvik et al, editor,Applied Parallel Computing: New Paradigms for HPC Industry and Academia,volume 1947, pages 92–102. Springer-Verlag, Lecture Notes in Computer Sci-ence, 2001.

[2] B. Adlerborn, K. Dackland, and B. Kagstrom. Parallel and blocked algorithmsfor reduction of a regular matrix pair to Hessenberg-triangular and generalizedSchur forms. In J. Fagerholm et al., editor, Applied Parallel Computing PARA2002, volume 2367 of Lecture Notes in Computer Science, pages 319–328.Springer-Verlag, 2002.

[3] B. Adlerborn, D. Kressner, and B. Kagstrom. Parallel Variants of the MultishiftQZ Algorithm with Advanced Deflation Techniques . In B. Kagstrom et al.,editor, Applied Parallel Computing - State of the Art in Scientific Computing(PARA’06), volume 4699, pages 117–126. Lecture Notes in Computer Science,Springer, 2007.

[4] G. S. Ammar, P. Benner, and V. Mehrmann. A multishift algorithm for thenumerical solution of algebraic Riccati equations. Electr. Trans. Num. Anal.,1:33–48, 1993.

[5] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. W. Demmel, J. J. Dongarra,J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. C. Sorensen.LAPACK Users’ Guide. SIAM, Philadelphia, PA, third edition, 1999.

[6] P. Andersson. Parallella algoritmer for periodiska ekvationer av Sylvester-typ.Master’s thesis, Deptartment of Computing Science, Umea University, 2007. InSwedish (to appear).

[7] K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer,D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick. The

35

References

Landscape of Parallel Computing Research: A View from Berkeley. Techni-cal Report UCB/EECS-2006-183, EECS Department, University of California,Berkeley, Dec 2006.

[8] ATLAS - Automatically Tuned Linear Algebra Software. See http://

math-atlas.sourceforge.net/.

[9] Z. Bai and J. W. Demmel. On swapping diagonal blocks in real Schur form.Linear Algebra Appl., 186:73–95, 1993.

[10] Z. Bai, J. W. Demmel, and A. McKenney. On computing condition numbers forthe nonsymmetric eigenproblem. ACM Trans. Math. Software, 19(2):202–223,1993.

[11] R. H. Bartels and G. W. Stewart. Algorithm 432: The Solution of the MatrixEquation AX−BX = C. Communications of the ACM, 8:820–826, 1972.

[12] P. Benner and R. Byers. Evaluating products of matrix pencils and collapsingmatrix products. Numerical Linear Algebra with Applications, 8:357–380, 2001.

[13] P. Benner, R. Byers, V. Mehrmann, and H. Xu. Numerical computation of deflat-ing subspaces of skew-Hamiltonian/Hamiltonian pencils. SIAM J. Matrix Anal.Appl., 24(1), 2002.

[14] P. Benner, V. Mehrmann, and H. Xu. Perturbation analysis for the eigenvalueproblem of a formal product of matrices. BIT, 42(1):1–43, 2002.

[15] P. Benner and E. S. Quintana-Ortı. Solving Stable Generalized Lyanpunov Equa-tions with the matrix sign functions. Numerical Algorithms, 20(1):75–100, 1999.

[16] P. Benner, E.S. Quintana-Ortı, and G. Quintana-Ortı. Numerical Solution of Dis-crete Stable Linear Matrix Equations on Multicomputers. Parallel Algorithmsand Applications, 17(1):127–146, 2002.

[17] P. Benner, E.S. Quintana-Ortı, and G. Quintana-Ortı. Solving Stable SylvesterEquations via Rational Iterative Schemes. Preprint sfb393/04-08, TU Chemnitz,2004.

[18] L. S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. W. Demmel, I. Dhillon,J. J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, andR. C. Whaley. ScaLAPACK Users’ Guide. SIAM, Philadelphia, PA, 1997.

[19] BLACS - Basic Linear Algebra Communication Subprograms. See http://www.netlib.org/blacs/index.html.

36

References

[20] BLAS - Basic Linear Algebra Subprograms. See http://www.netlib.

org/blas/index.html.

[21] BlueGene/L - Advanced Simulation and Computing. See https:

//asc.llnl.gov/computing_resources/bluegenel/

configuration.html.

[22] A. Bojanczyk, G. H. Golub, and P. Van Dooren. The periodic Schur decomposi-tion; algorithm and applications. In Proc. SPIE Conference, volume 1770, pages31–42, 1992.

[23] A. Bojanczyk and P. Van Dooren. Reordering diagonal blocks in the real Schurform. In NATO ASI on Linear Algebra for Large Scale and Real-Time Applica-tions, volume 1770, pages 351–356, 1993.

[24] Biochemical and Biophysical Kinetics in Freshwater Lakes. See http://

www.pitb.de/biophys/bp19/.

[25] K. Braman, R. Byers, and R. Mathias. The multishift QR algorithm, I: Main-taining well-focused shifts and level 3 performance. SIAM J. Matrix Anal. Appl.,23(4):929–947, 2002.

[26] K. Braman, R. Byers, and R. Mathias. The multishift QR algorithm, II: Aggres-sive early deflation. SIAM J. Matrix Anal. Appl., 23(4):948–973, 2002.

[27] R. Bru, R. Canto, and B. Ricarte. Modelling nitrogen dynamics in citrus trees.Mathematical and Computer Modelling, 38:975–987, 2003.

[28] A. Buttari, J. Dongarra, J. Kurzak, J. Langou, P. Luszczek, and S. Tomov. TheImpact of Multicore on Math Software. In B. Kagstrom et al., editor, AppliedParallel Computing - State of the Art in Scientific Computing (PARA’06), volume4699, pages 1–10. Lecture Notes in Computer Science, Springer, 2007.

[29] Condor - High throughput computing. See http://www.cs.wisc.edu/condor/.

[30] K. Dackland and B. Kagstrom. Blocked Algorithms and Software for Reduc-tion of a Regular Matrix Pair to Generalized Schur Form. ACM Trans. Math.Software, 25(4):425–454, 1999.

[31] J. Demmel, J. Dongarra, B. Parlett, W. Kahan, M. Gu, D. Bindel, Y. Hida, X. Li,O. Marques, J. Riedy, C. Vomel, J. Langou, P. Luszczek, J. Kurzak, A. Buttari,J. Langou, and S. Tomov. Prospectus for the Next LAPACK and ScaLAPACK

37

References

libraries. In B. Kagstrom et al., editor, Applied Parallel Computing - State ofthe Art in Scientific Computing (PARA’06), volume 4699, pages 11–23. LectureNotes in Computer Science, Springer, 2007.

[32] J. Demmel and B. Kagstrom. The Generalized Schur Decomposition of an Arbi-trary Pencil A−λB: Robust Software with Error Bounds and Applications. PartI: Theory and Algorithms. ACM Trans. Math. Software, 19(2):160–174, June1993.

[33] J. Demmel and B. Kagstrom. The Generalized Schur Decomposition of an Arbi-trary Pencil A−λB: Robust Software with Error Bounds and Applications. PartII: Software and Applications. ACM Trans. Math. Software, 19(2):175–201,June 1993.

[34] J. W. Demmel and J. J. Dongarra. Lapack 2005 prospectus: Reliable and scal-able software for linear algebra computations on high end computers. Lapackworking note 164, University of Carlifornia, Berkeley and University of Ten-nessee, Knoxville, 2005.

[35] J. J. Dongarra, S. Hammarling, and J. H. Wilkinson. Numerical considerationsin computing invariant subspaces. SIAM J. Matrix Anal. Appl., 13(1):145–161,1992.

[36] G. E. Dullerud and F. Paganini. A Course in Robust Control Theory - A ConvexApproach. Springer-Verlag, New York, 2000.

[37] A. Edelman, E. Elmroth, and B. Kagstrom. A Geometric Approach to Perturba-tion Theory of Matrices and Matrix Pencils. Part I: Versal Deformations. SIAMJ. Matrix Anal. Appl., 18(3):653–692, 1997. (Awarded the SIAM Linear AlgebraPrize 2000 for the most outstanding paper published during 1997–99).

[38] A. Edelman, E. Elmroth, and B. Kagstrom. A Geometric Approach To Perturba-tion Theory of Matrices and Matrix Pencils. Part II: A Stratification-EnhancedStaircase Algorithm. SIAM J. Matrix Anal. Appl., 20(3):667–699, 1999.

[39] B. Eliasson. Numerical Vlasov-Maxwell Modelling of Space Plasma. PhD the-sis, Uppsala University, Department of Information Technology, Scientific Com-puting, 2004.

[40] E. Elmroth, F. Gustavson, I. Jonsson, and B. Kagstrom. Recursive Blocked Al-gorithms and Hybrid Data Structures for Dense Matrix Library Software. SIAMReview, 46(1):3–45, 2004.

38

References

[41] E. Elmroth, P. Johansson, B. Kagstrom, and D. Kressner. A web computingenvironment for the SLICOT library. In The Third NICONET Workshop onNumerical Control Software, pages 53–61, 2001.

[42] B. F. Farrell and Ioannou P. J. Stochastic forcing of the linearized Navier-Stokesequation. Phys. Fluids, A(5):2600–2609, 1993.

[43] D. S. Flamm. A new shift-invariant representation for periodic linear systems.Syst. Control Lett., 17(1):9–14, 1991.

[44] J. G. F. Francis. The QR Transformation, Parts I and II. Computer Journal,4:265–271, 332–345, 1961, 1962.

[45] G. H. Golub, S. Nash, and C. F. Van Loan. A Hessenberg-Schur method for theproblem AX +XB = C. IEEE Trans. Automat. Control, 24(6):909–913, 1979.

[46] G. H. Golub and C. F. Van Loan. Matrix Computations. Johns Hopkins Univer-sity Press, Baltimore, MD, third edition, 1996.

[47] GOTO-BLAS - High-Performance BLAS by Kazushige Goto. See http://www.cs.utexas.edu/users/flame/goto/.

[48] A. Grama, A. Gupta, G. Karypsis, and V. Kumar. Introduction to Parallel Com-puting, Second Edition. Addison-Wesley, 2003.

[49] R. Granat, I. Jonsson, and B. Kagstrom. Recursive Blocked Algorithms for Solv-ing Periodic Triangular Sylvester-type Matrix Equations. In B. Kagstrom et al.,editor, Applied Parallel Computing - State of the Art in Scientific Computing(PARA’06), volume 4699, pages 531–539. Lecture Notes in Computer Science,Springer, 2007.

[50] R. Granat and B. Kagstrom. Evaluating Parallel Algorithms for SolvingSylvester-Type Matrix Equations: Direct Transformation-Based versus IterativeMatrix-Sign-Function-Based Methods. In J. Dongarra et al., editor, PARA’04 -State-of-the-Art in Scientific Computing, volume 3732, pages 719–729. LectureNotes in Computer Science (LNCS), Springer Verlag, 2005.

[51] R. Granat and B. Kagstrom. Parallel Algorithms and Condition Estimatorsfor Standard and Generalized Triangular Sylvester-Type Matrix Equations. InB. Kagstrom et al., editor, Applied Parallel Computing - State of the Art in Sci-entific Computing (PARA’06), volume 4699, pages 127–136. Lecture Notes inComputer Science, Springer, 2007.

39

References

[52] R. Granat, B. Kagstrom, and D. Kressner. MATLAB tools for Solving PeriodicEigenvalue Problems. In proceedings of 3rd IFAC Workshop PSYCO’07, SaintPetersburg, Russia, 2007.

[53] R. Granat, B. Kagstrom, and D. Kressner. Algorithms and Software Tools forProduct and Periodic Eigenvalue Problems. 2007. In preparation.

[54] J. A. Gunnels, F. G. Gustavson, G. M. Henry, and R. A. van de Geijn. FLAME:Formal Linear Algebra Methods Environment. ACM Trans. Math. Softw.,27(4):422–455, 2001.

[55] W.W. Hager. Condition estimates. SIAM J. Sci. Statist. Comput., (3):311–316,1984.

[56] S. J. Hammarling. Numerical Solution of the Stable, Non-negative DefiniteLyapunov Equation. IMA Journal of Numerical Analysis, (2):303–323, 1982.

[57] J. J. Hench and A. J. Laub. Numerical solution of the discrete-time periodicRiccati equation. IEEE Trans. Automat. Control, 39(6):1197–1210, 1994.

[58] G. Henry, D. S. Watkins, and J. J. Dongarra. A parallel implementation of thenonsymmetric QR algorithm for distributed memory architectures. SIAM J. Sci.Comput., 24(1):284–311, 2002.

[59] N. J. Higham. Fortran codes for estimating the one-norm of a real or complexmatrix, with applications to condition estimation. ACM Trans. of Math. Soft-ware, 14(4):381–396, 1988.

[60] N. J. Higham. Perturbation theory and backward error for AX −XB = C. BIT,33(1):124–136, 1993.

[61] N. J. Higham. Accuracy and Stability of Numerical Algorithms. SIAM, Philadel-phia, PA, second edition, 2002.

[62] The ICPS Group - Scientific Parallel Computing and Imaging. See http://icps.u-strasbg.fr/.

[63] P. Johansson. Software Tools for Matrix Canonical Computations and Web-Based Software Library Environments. PhD Thesis UMINF-06.30, Departmentof Computing Science, Umea University, SE-901 87 Umea, Sweden, June, 2006.

[64] S. Johansson. Stratification of Matrix Pencils in Systems and Control: The-ory and Algorithms. Licentiate Thesis, Report UMINF-05.17, Department ofComputing Science, SE-901 87, Umea University, Sweden, 2005.

40

References

[65] S. Johansson, B. Kagstrom, A. Shiriaev, and A. Varga. Comparing One-shot andMulti-shot Methods for Solving Periodic Riccati Equations. 3rd IFAC WorkshopPSYCO’07, Saint Petersburg, Russia, 2007.

[66] I. Jonsson and B. Kagstrom. Recursive blocked algorithms for solving triangularsystems. I. One-sided and coupled Sylvester-type matrix equations. ACM Trans.Math. Software, 28(4):392–415, 2002.

[67] I. Jonsson and B. Kagstrom. Recursive blocked algorithms for solving trian-gular systems. II. Two-sided and generalized Sylvester and Lyapunov matrixequations. ACM Trans. Math. Software, 28(4):416–435, 2002.

[68] I. Jonsson and B. Kagstrom. RECSY - A High Performance Library for SolvingSylvester-Type Matrix Equations. In H. Kosch et al, editor, Euro-Par 2003 Par-allel Processing, volume 2790, pages 810–819. Springer-Verlag, Lecture Notesin Computer Science, 2003.

[69] B. Kagstrom. A direct method for reordering eigenvalues in the generalized realSchur form of a regular matrix pair (A,B). In Linear algebra for large scale andreal-time applications (Leuven, 1992), volume 232 of NATO Adv. Sci. Inst. Ser.E Appl. Sci., pages 195–218. Kluwer Acad. Publ., Dordrecht, 1993.

[70] B. Kagstrom and D. Kressner. Multishift Variants of the QZ Algorithm withAggressive Early Deflation. SIAM J. Matrix Anal. Appl., 29(1):199–227, 2006.

[71] B. Kagstrom, P. Ling, and C. Van Loan. GEMM-Based Level 3 BLAS: High-Performance Model Implementations and Performance Evaluation Benchmark.ACM Trans. Math. Software, 24(3):268–302, 1998.

[72] B. Kagstrom, P. Ling, and C. Van Loan. Algorithm 784: GEMM-Based Level3 BLAS: Portability and Optimization Issues. ACM Trans. Math. Software,24(3):303–316, 1998.

[73] B. Kagstrom and P. Poromaa. Distributed and shared memory block algorithmsfor the triangular Sylvester equation with sep−1 estimators. SIAM J. MatrixAnal. Appl., 13(1):90–101, 1992.

[74] B. Kagstrom and P. Poromaa. Computing eigenspaces with specified eigenvaluesof a regular matrix pair (A,B) and condition estimation: theory, algorithms andsoftware. Numer. Algorithms, 12(3-4):369–407, 1996.

[75] B. Kagstrom and P. Poromaa. LAPACK-style algorithms and software for solv-ing the generalized Sylvester equation and estimating the separation betweenregular matrix pairs. ACM Trans. Math. Software, 22(1):78–103, 1996.

41

References

[76] B. Kagstrom and L. Westin. Generalized Schur methods with condition estima-tors for solving the generalized Sylvester equation. IEEE Trans. Autom. Contr.,34(4):745–751, 1989.

[77] M. Konstantinov, D. Gu, V. Mehrmann, and P. Petkov. Perturbation Theoryfor Matrix Equations. Number 9 in Studies in Computational Mathematics.Elsevier, North Holland, 2003.

[78] D. Kressner. An efficient and reliable implementation of the periodic QZ algo-rithm. In IFAC Workshop on Periodic Control Systems, 2001.

[79] D. Kressner. Numerical Methods and Software for General and StructuredEigenvalue Problems. PhD thesis, TU Berlin, Institut fur Mathematik, Berlin,Germany, 2004.

[80] D. Kressner. The periodic QR algorithm is a disguised QR algorithm. LinearAlgebra Appl., 417(2–3):423–433, 2006.

[81] V. N. Kublanovskaya. On some algorithms for the solution of the completeeigenvalue problem. USSR Comp. Math Phys., 3:637–657, 1961.

[82] LAPACK - Linear Algebra Package. See http://www.netlib.org/

lapack/.

[83] D. C. Lay. Linear Algebra and its Applications, 2nd edition. Addison-Wesley,1997.

[84] W.-W. Lin and J.-G. Sun. Perturbation analysis for the eigenproblem of periodicmatrix pairs. Linear Algebra Appl., 337:157–187, 2001.

[85] Lecture series in Linear Algebra by Jorgen Lofstrom. University of Gothenburg,1998-1999.

[86] The MathWorks, Inc., Cochituate Place, 24 Prime Park Way, Natick, Mass,01760. MATLAB Version 6.5, 2002.

[87] V. Mehrmann. The Autonomous Linear Quadratic Control Problem, Theory andNumerical Solution. Number 163 in Lecture Notes in Control and InformationSciences. Springer-Verlag, Heidelberg, 1991.

[88] MPI - Message Passing Interface. See http://www-unix.mcs.anl.

gov/mpi/.

[89] Multiprocessors. See http://www.csee.umbc.edu/˜plusquel/

611/slides/chap8_1.html.

42

References

[90] Niconet Task II: Model Reduction. See http://www.win.tue.nl/

niconet/NIC2/NICtask2.html.

[91] N. S. Nise. Control Systems Engineering. Wiley, 2003. Fourth InternationalEdition.

[92] OpenFGG - Open Fortran Gateway Generator. See http://www.cs.umu.se/research/parallel/openfgg.

[93] OpenMP - Simple, Portable, Scalable SMP Programming. See http://www.openmp.org/.

[94] PBLAS - Parallel Basic Linear Algebra Subprograms. See http://www.

netlib.org/scalapack/html/pblas_qref.html.

[95] PEP - Matlab tools for solving periodic eigenvalue problems. See http://www.cs.umu.se/research/nla/pep.

[96] P. Poromaa. Parallel Algorithms for Triangular Sylvester Equations: Design,Scheduling and Scalability Issues. In B. Kagstrom et al., editor, Applied ParallelComputing. Large Scale Scientific and Industrial Problems, volume 1541, pages438–446. Springer Verlag, Lecture Notes in Computer Science, 1998.

[97] POSIX threads programming. See http://www.llnl.gov/computing/tutorials/pthreads/.

[98] E. S. Quintana-Ortı and R. A. van de Geijn. Formal derivation of algorithms:The triangular Sylvester equation. ACM Trans. Math. Software, 29(2):218–243,June 2003.

[99] SCASY - ScaLAPACK-style solvers for Sylvester-type matrix equations. Seehttp://www.cs.umu.se/˜granat/scasy.html.

[100] Scientific Tools for Python. See http://www.scipy.org/.

[101] Shared-memory MIMD computers. See http://www.netlib.org/utk/papers/advanced-computers/sm-mimd.html.

[102] SLICOT Library In The Numerics In Control Network (Niconet). See http://www.win.tue.nl/niconet/index.html.

[103] J. Sreedhar and P. Van Dooren. Pole placement via the periodic Schur decom-position. In Proceedings Amer. Contr. Conf., pages 1563–1567, 1993.

43

References

[104] R. A. van de Geijn. Using PLAPACK: Parallel Linear Algebra Package. TheMIT Press, 1997.

[105] A. Varga. Periodic Lyapunov equations: some applications and new algorithms.Internat. J. Control, 67(1):69–87, 1997.

[106] A. Varga. Balancing related methods for minimal realization of periodic sys-tems. Systems Control Lett., 36(5):339–349, 1999.

[107] A. Varga. Robust and minimum norm pole assignment with periodic state feed-back. IEEE Trans. Automat. Control, 45(5):1017–1022, 2000.

[108] A. Varga. On solving discrete-time periodic Riccati equations. In Proc. of 16thIFAC World Congress, Prague, Czech Republic, 2005.

[109] A. Varga. On solving periodic differential matrix equations with applicationsto periodic system norms computation. In Proc. of CDC’05, CDC’05, Seville,Spain), 2005.

[110] A. Varga and P. Van Dooren. Computational methods for periodic systems - anoverview. In Proc. of IFAC Workshop on Periodic Control Systems, Como, Italy,pages 171–176, 2001.

[111] D. S. Watkins. Product Eigenvalue Problems. SIAM Review, 47:3–40, 2005.

[112] S. J. Wright. A collection of problems for which Gaussian elimination withpartial pivoting is unstable. SIAM J. Sci. Comput., 14(1):231–238, 1993.

44

I

Paper I

Direct Eigenvalue Reordering in a Product ofMatrices in Periodic Schur Form∗

Robert Granat1, and Bo Kagstrom1

1 Department of Computing Science and HPC2N, Umea UniversitySE-901 87 Umea, Swedengranat, [email protected]

Abstract: A direct method for eigenvalue reordering in a product of a K-periodicmatrix sequence in periodic or extended periodic real Schur form is presented andanalyzed. Each reordering of two adjacent sequences of diagonal blocks is performedtentatively to guarantee backward stability and involves solving a K-periodic Sylvesterequation (PSE) and constructing a K-periodic sequence of orthogonal transformationmatrices. An error analysis of the direct reordering method is presented and resultsfrom computational experiments confirm the stability and accuracy of the method forwell-conditioned as well as ill-conditioned problems. These include matrix sequenceswith fixed and time-varying dimensions, and sequences of small and large periodicity.

Key words: Product of K-periodic matrix sequence, extended periodic real Schurform, eigenvalue reordering, K-periodic Sylvester equation, periodic eigenvalue prob-lem.

∗ Reprinted by permission of SIAM Journal on Matrix Analysis and Applications.

47

48

SIAM J. MATRIX ANAL. APPL. c© 2006 Society for Industrial and Applied MathematicsVol. 28, No. 1, pp. 285–300

DIRECT EIGENVALUE REORDERING IN A PRODUCT OFMATRICES IN PERIODIC SCHUR FORM∗

ROBERT GRANAT† AND BO KAGSTROM†

Abstract. A direct method for eigenvalue reordering in a product of a K-periodic matrixsequence in periodic or extended periodic real Schur form is presented and analyzed. Each reorderingof two adjacent sequences of diagonal blocks is performed tentatively to guarantee backward stabilityand involves solving a K-periodic Sylvester equation (PSE) and constructing a K-periodic sequenceof orthogonal transformation matrices. An error analysis of the direct reordering method is presented,and results from computational experiments confirm the stability and accuracy of the method forwell-conditioned as well as ill-conditioned problems. These include matrix sequences with fixed andtime-varying dimensions, and sequences of small and large periodicity.

Key words. product of K-periodic matrix sequence, extended periodic real Schur form, eigen-value reordering, K-periodic Sylvester equation, periodic eigenvalue problem

AMS subject classifications. 65F15, 15A18, 93B60

DOI. 10.1137/05062490X

1. Introduction. Given a K-periodic real matrix sequence, A0, A1, . . . , AK−1

with Ai+K = Ai, the periodic real Schur form (PRSF) is defined as follows [5, 13]:given the real matrix sequence Ak ∈ Rn×n, for k = 0, 1, . . . , K − 1, there exists anorthogonal matrix sequence Zk ∈ Rn×n such that the real sequence

ZTk+1AkZk = Tk, k = 0, 1, . . . , K − 1,(1.1)

with ZK = Z0, consists of K − 1 upper triangular matrices and one upper quasi-triangular matrix. The products of conforming 1 × 1 and 2 × 2 diagonal blocks ofthe matrix sequence Tk contain the real and complex conjugate pairs of eigenvaluesof the matrix product AK−1 · · ·A1A0. Similar to the standard case (K = 1; e.g., see[10, 25]), the periodic real Schur form is computed by means of a reduction to peri-odic Hessenberg form followed by applying a periodic QR-algorithm to the resultingsequence [5, 13]. The PRSF is an important tool in several applications, includingsolving periodic Sylvester-type and Riccati matrix equations [13, 22, 27, 30]. Thequasi-triangular matrix in the PRSF can occur anywhere in the sequence but is usu-ally chosen to be T0 or TK−1.

The extended periodic real Schur form (EPRSF) generalizes PRSF to the casewhen the dimensions of the matrices are time-variant [28]: given the real matrix se-quence Ak ∈ Rnk+1×nk , k = 0, 1, . . . , K−1, with nK = n0, there exists an orthogonalmatrix sequence Zk ∈ Rnk×nk , k = 0, 1, . . . , K − 1, such that the real sequence

ZTk+1AkZk = Tk ≡

[T

(k)11 T

(k)12

0 T(k)22

]∈ Rnk+1×nk ,(1.2)

∗Received by the editors February 21, 2005; accepted for publication (in revised form) by P. VanDooren January 12, 2006; published electronically April 7, 2006. This research was conductedusing the resources of the High Performance Computing Center North (HPC2N). Financial supportwas provided by the Swedish Research Council under grant VR 621-2001-3284 and by the SwedishFoundation for Strategic Research under the frame program grant A3 02:128.

http://www.siam.org/journals/simax/28-1/62490.html†Department of Computing Science and HPC2N, Umea University, SE-901 87 Umea, Sweden

([email protected], [email protected]).

285

286 ROBERT GRANAT AND BO KAGSTROM

for k = 0, 1, . . . , K − 1, with ZK = Z0, is block upper triangular and T(k)11 ∈

Rmink(nk)×mink(nk), T(k)22 ∈ R(nk+1−mink(nk))×(nk−mink(nk)). Moreover, the subsequence

T(k)11 , k = 0, 1, . . . , K − 1, is in PRSF (1.1) with eigenvalues called the core charac-

teristic values of the sequence Ak, and the matrices in the subsequence T(k)22 , k =

0, 1, . . . , K − 1, are upper trapezoidal. For EPRSF, the quasi-triangular matrixcan occur at any position in the sequence Tk. However, to simplify the reductionto extended periodic Hessenberg form it is normally placed at position j, wherenj+1 = mink(nk), i.e., in the matrix Tj which has the smallest row dimension inthe sequence [28]. For Tj , j ∈ [0, K − 1], to have a trapezoidal block T

(j)22 , it must

hold that nj , nj+1 > mink(nk). The EPRSF is motivated by the increasing interestin discrete-time periodic systems of the form

xk+1 = Akxk + Bkuk,yk = Ckxk + Dkuk,

(1.3)

where the matrices Ak ∈ Rnk+1×nk , Bk ∈ Rnk+1×m, Ck ∈ Rr×nk , and Dk ∈ Rr×m areperiodic with periodicity K ≥ 1. The state transition matrix of the system (1.3) isdefined as the nj × ni matrix ΦA(j, i) = Aj−1Aj−2 . . . Ai, where ΦA(i, i) = Ini . Thestate transition matrix over one whole period ΦA(j + K, j) ∈ Rnj×nj is called themonodromy matrix of (1.3) at time j, and its eigenvalues are called the characteristicmultipliers at time j. All t nonzero together with (mink(nk) − t) zero characteristicmultipliers belong to the set of core characteristic values. One important issue ishow to reorder the eigenvalues of the monodromy matrix without evaluating thecorresponding product. Evaluating the product is costly and may lead to a significantloss of accuracy [5], especially when computing eigenvalues of small magnitude.

Direct eigenvalue reordering in the real Schur form was investigated in [2, 8, 7] andin the generalized Schur form of a regular matrix pencil A− λB in [16, 18]. IterativeQR-based reordering methods have also been proposed [23, 26], but they may fail toconverge (e.g., see [16, 18]). Reordering of eigenvalues in PRSF and related problemshave also been considered; see, e.g., [5], where the approach is based on applyingGivens rotations on explicitly formed products of small (2× 2, 3× 3, or 4× 4) matrixsequences, and [6] for a discussion on swapping 1×1 blocks by propagating orthogonaltransformations through 2× 2 sequences. In this paper, we present a direct swappingalgorithm for performing eigenvalue reordering in a product of a K-periodic matrixsequence in (E)PRSF for K ≥ 2 without evaluating any part of the matrix product.Our direct algorithm relies on orthogonal transformations only and extends earlierwork on direct eigenvalue reordering of matrices to products of matrices [11, 19].

The rest of this paper is organized as follows. In section 2, we settle some im-portant notation and definitions. In section 3, we discuss reordering of two diagonalblocks (leaving the eigenvalues invariant) by cyclic orthogonal transformations, andin section 4, we present our direct periodic reordering algorithm. Next, we discuss thenumerical solution of the associated periodic Sylvester equation (PSE) in section 5.An error analysis of the direct periodic swapping algorithm is presented in section 6.Some numerical examples are presented and discussed in section 7, and finally, weoutline some future work in section 8.

2. Notation and definitions. We introduce some notation to simplify the pre-sentation that follows. Let In denote the identity matrix of order n. Let M+ denotethe pseudoinverse (see, e.g., [10]) of a matrix M . Let σ(M) and λ(M) denote thesets of the singular values and the eigenvalues of the matrix M , respectively. Let

EIGENVALUE REORDERING IN A PRODUCT OF MATRICES 287

A⊗ B denote the Kronecker product of two matrices, defined as the matrix with its(i, j)-block element as aijB. Let vec(A) denote a vector representation of an m × nmatrix A with the columns of A stacked on top of each other in the order 1, 2, . . . , n.Let ‖A‖F denote the Frobenius matrix norm defined as ‖A‖F =

√trace(AT A). We

define the periodic addition operator ⊕ such that a ⊕ b = (a + b) mod K, where K

denotes the periodicity. We use the product operator∏j

k=i bk to denote a productbibi−1 · · · bj+1bj of scalars, with the convention that

∏jk=i bk = 1 for i < j.

Each K-periodic matrix sequence Ak is associated with a matrix tuple A =(AK−1, AK−2, . . . , A1, A0) [4]. The vector tuple u = (uK−1, uK−2, . . . , u1, u0), withuk = 0, is called a right eigenvector of the tuple A corresponding to the eigenvalue λif there exist scalars αk, possibly complex, such that the relations

Akuk = αkuk⊕1, k = 0, 1, . . . , K − 1,

λ :=0∏

k=K−1

αk(2.1)

hold with uK = u0. A left eigenvector v of the tuple A corresponding to λ is definedsimilarly:

vHk⊕1Ak = βkvH

k , k = 0, 1, . . . , K − 1,

λ :=0∏

k=K−1

βk,(2.2)

where vk = 0, and βk are (possibly complex) scalars for k = 0, 1, . . . , K − 1.Without loss of generality, we assume that p < mink (nk) is specified such that no

2× 2 block corresponding to a complex conjugate pair of eigenvalues is positioned atrows (and columns) p and p+1 of ΦT (K, 0). Given such a p and with Zk and Tk from(1.2), the leading p columns of each Zk span an invariant subspace for ΦT (K + k, k)for k = 0, 1, . . . , K − 1. As a whole, the space spanned by the first p columns of eachmatrix in the matrix tuple Z is called a periodic invariant subspace of the tuple Acorresponding to the p eigenvalues located in the upper-leftmost part of ΦT (K, 0). Ingeneral, ΦT (K, 0)ij denotes the (i, j) block of the matrix product ΦT (K, 0).

3. Reordering diagonal blocks in a product of matrices in EPRSF byorthogonal transformations. Consider the K-periodic (or K-cyclic) matrix se-quences Ak ∈ Rnk⊕1×nk , Tk ∈ Rnk⊕1×nk , and Zk ∈ Rnk×nk , k = 0, 1, . . . , K − 1, suchthat Ak is general, Tk is in EPRSF, and Zk is the corresponding orthogonal transfor-mation, as in (1.2). The eigenvalues of the product ΦT (K, 0) = TK−1TK−2 . . . T1T0 ∈Rn0×n0 are contained in the diagonal blocks of size 1 × 1 (real) and 2 × 2 (complexconjugate pairs) of ΦT (K, 0).

Assume that each Tk, k = 0, 1, . . . , K − 1, is partitioned as

Tk =

⎡⎢⎢⎢⎣

T(k)11

0 T(k)22

0 0 T(k)33

0 0 0 T(k)44

⎤⎥⎥⎥⎦ ,(3.1)

where T(k)11 ∈ Rp1×p1 , T

(k)22 ∈ Rp2×p2 , T

(k)33 ∈ Rp3×p3 , T

(k)44 ∈ R(nk⊕1−p)×(nk−p), k =

0, 1, . . . , K−1, and p = p1+p2+p3. Assume there exists a K-cyclic orthogonal matrix


sequence Qk, k = 0, 1, . . . , K − 1, such that the cyclic transformation

QTk⊕1

[T

(k)22

0 T(k)33

]Qk =

[T

(k)22

0 T(k)33

](3.2)

results in λ(ΦT (K, 0)22) = λ(ΦT (K, 0)33), λ(ΦT (K, 0)33) = λ(ΦT (K, 0)22). Then thereordered EPRSF of the sequence Ak is the sequence Tk, where

Tk =

⎡⎣

Ip1 0 00 QT

k⊕1 00 0 Ip4

⎤⎦

︸︷︷︸QT

k⊕1

⎡⎢⎢⎢⎣

T(k)11

0 T(k)22

0 0 T(k)33

0 0 0 T(k)44

⎤⎥⎥⎥⎦

⎡⎣

Ip1 0 00 Qk 00 0 Ip4

⎤⎦

︸︷︷︸Qk

(3.3)= QT

k⊕1TkQk = QTk⊕1Z

Tk⊕1AkZkQk = ZT

k⊕1AkZk,

with the associated K-cyclic orthogonal sequence Zk = ZkQk, k = 0, 1, . . . , K − 1.The first p1 + p3 columns of Z0 span an orthonormal basis for the invariant subspaceof ΦA(K, 0) associated with the first p1 + p3 eigenvalues in the upper left part of theproduct ΦT (K, 0). In addition, the first p1+p3 columns of each transformation matrixZk in the tuple (ZK−1, ZK−2, . . . , Z1, Z0) span an orthonormal basis for the periodicinvariant subspace of the tuple A associated with the same p1 + p3 eigenvalues inΦT (K, 0).

4. A direct algorithm for periodic diagonal block reordering. In thissection, we focus on the K-cyclic swapping in (3.3). Without loss of generality, weassume that Tk in (3.1) is square, i.e., the sequence Tk is in PRSF, and partitioned as

Tk =

[T

(k)11 T

(k)12

0 T(k)22

], k = 0, 1, . . . , K − 1,(4.1)

and that we want to swap the blocks T(k)11 ∈ Rp1×p1 and T

(k)22 ∈ Rp2×p2 . Throughout

the paper we assume that ΦT (K, 0)11 and ΦT (K, 0)22 are of size 2 × 2 or 1 × 1 andhave no eigenvalues in common; otherwise, the diagonal blocks need not be swapped.Define the K-cyclic matrix sequence Xk as

Xk ≡[

Ip1Xk

0 Ip2

],(4.2)

where Xk ∈ Rp1×p2 , k = 0, 1, . . . , K − 1. The key observation is that the cyclictransformation

X−1k⊕1

[T

(k)11 T

(k)12

0 T(k)22

]Xk =

[T

(k)11 T

(k)12 + T

(k)11 Xk −Xk⊕1T

(k)22

0 T(k)22

](4.3)

block-diagonalizes Tk, k = 0, 1, . . . , K − 1, if and only if the sequence Xk satisfies theperiodic Sylvester equation (PSE)

T(k)11 Xk −Xk⊕1T

(k)22 = −T

(k)12 , k = 0, 1, . . . , K − 1.(4.4)


Replacing Ip2in X0 (4.2) by a p2 × p2 zero block results in a spectral projector (e.g.,

see [25]) associated with the matrix product ΦT (K, 0) that projects onto the spectrumof ΦT (K, 0)11. We refer to the matrix X0 as the generator matrix for the periodicreordering of the product ΦT (K, 0).

The similarity transformation

S−10 TK−1SK−1S

−1K−1TK−2SK−2 . . . S−1

2 T1S1S−11 T0S0

=

[T

(K−1)22 0

0 T(K−1)11

]. . .

[T

(1)22 00 T

(1)11

] [T

(0)22 00 T

(0)11

]

performs the wanted swapping of the diagonal blocks by the nonorthogonal sequence

Sk = Xk

[0 Ip1

Ip2 0

]=

[Xk Ip1

Ip2 0

], k = 0, 1, . . . , K − 1.

Since the first p2 columns of each Sk are linearly independent there exist orthogonalmatrices Qk of order p1 + p2 such that

Dk ≡[

Xk

Ip2

]= Qk

[Rk

0

],(4.5)

where Rk of size p2 × p2 is upper triangular and nonsingular, k = 0, 1, . . . , K − 1. Bypartitioning Qk conformally with Sk, we observe that

QTk Sk =

[Rk Q

(k)11

T

0 Q(k)12

T

], S−1

k Qk =

[R−1

k −R−1k Q

(k)11

TQ

(k)12

−T

0 Q(k)12

−T

].

An orthonormal similarity transformation of ΦT (K, 0) can now be written as

QT0 (TK−1TK−2 . . . T1T0)Q0 = QT

0 TK−1QK−1QTK−1TK−2QK−2 . . . QT

2 T1Q1QT1 T0Q0

= QT0 S0

[T

(K−1)22 0

0 T(K−1)11

]S−1

K−1QK−1QTK−1SK−1

[T

(K−2)22 0

0 T(K−2)11

]S−1

K−2QK−2

. . . QT2 S2

[T

(1)22 00 T

(1)11

]S−1

1 Q1QT1 S1

[T

(0)22 00 T

(0)11

]S−1

0 Q0 = TK−1TK−2 . . . T1T0,

where

Tk =

[T

(k)11 T

(k)12

0 T(k)22

]

and⎧⎪⎪⎨⎪⎪⎩

T(k)11 = Rk⊕1T

(k)22 R−1

k ,

T(k)22 = Q

(k⊕1)12

TT

(k)11 Q

(k)12

−T,

T(k)12 = −Rk⊕1T

(k)22 R−1

k Q(k)11

TQ

(k)12

−T+ Q

(k⊕1)11

TT

(k)11 Q

(k)12

−T

(4.6)


for k = 0, 1, . . . , K − 1. Thus, the orthogonal sequence Qk from (4.5) performs therequired reordering of the diagonal blocks. Observe that the sequences T

(k)11 and T

(k)22

in (4.6) may not be in PRSF and might have to be further transformed after periodicreordering by additional orthogonal transformations to get the sequence Tk in PRSF.

We summarize our direct algorithm for periodic eigenvalue reordering as follows:Step 1. Solve for the sequence Xk, k = 0, 1, . . . , K − 1, in the PSE

T(k)11 Xk −Xk⊕1T

(k)22 = −T

(k)12 , k = 0, 1, . . . , K − 1.

Step 2. Compute K orthogonal matrices Qk such that[

Xk

Ip2

]= Qk

[Rk

0

], k = 0, 1, . . . , K − 1.

Step 3. Perform reordering by the cyclic transformations

Tk = QTk⊕1TkQk, k = 0, 1, . . . , K − 1.(4.7)

Step 4. Restore the subsequences T(k)11 and T

(k)22 to PRSF using K-cyclic orthog-

onal transformations.Step 4 is conducted by computing PRSFs of the two K-periodic subsequences T

(k)11

and T(k)22 . Care must be taken to assure that each of the two quasi-triangular matrices

in the PRSFs appear in the same position of the Tk sequence, say Ti. However, fora K-periodic 2 × 2 sequence it is sufficient to compute a periodic Hessenberg form[5] specifying the position of the 2 × 2 Hessenberg matrix, given that the complexconjugate pair has not collapsed into two real eigenvalues because of round-off errors.

In the presence of rounding errors, the most critical step in the reordering processis to solve the PSE. In analogy to eigenvalue swapping in the real (generalized) Schurform, a small sep-function (defined in equation (5.3)) may ruin backward stability andthus forces us to perform the swapping tentatively to guarantee backward stability[2, 16, 18]. See also Kressner [19] for a brief discussion on direct swapping methodsfor PRSF.

The direct algorithm extends directly to EPRSF by considering reordering of thecore characteristic values (see section 2) of the sequence Tk.

5. The periodic Sylvester equation. In analogy with solving the standardSylvester equation (e.g., see [3]), we construct a matrix representation ZPSE of theperiodic Sylvester operator defined by the PSE (4.4) in terms of Kronecker products,where

(5.1)

ZPSE =

⎡⎢⎢⎢⎢⎣

−T(K−1)22

T⊗ Ip1 Ip2 ⊗ T

(K−1)11

Ip2 ⊗ T(0)11 −T

(0)22

T⊗ Ip1

. . .. . .

Ip2 ⊗ T(K−2)11 −T

(K−2)22

T⊗ Ip1

⎤⎥⎥⎥⎥⎦

.

Only the nonzero blocks of ZPSE are displayed explicitly in (5.1). Then we solvethe resulting linear system of equations ZPSEx = c, with x and c as stacked vector


representations of the matrix sequences Xk, for k = 0, 1, . . . , K − 1, and −T(k)12 ,

k = K − 1, 0, 1, . . . , K − 2, respectively:

x =

⎡⎢⎢⎣

vec(X0)vec(X1)· · ·vec(XK−1)

⎤⎥⎥⎦ , c =

⎡⎢⎢⎢⎣

vec(−T(K−1)12 )

vec(−T(0)12 )

· · ·vec(−T

(K−2)12 )

⎤⎥⎥⎥⎦ .(5.2)

To exploit the structure of the matrix ZPSE, Gaussian elimination with partial pivoting(GEPP) is used at the cost of O(K(p2

1p2 + p1p22)) flops, possibly combined with fixed

precision iterative refinement for improved accuracy on badly scaled problems. Bystoring only the block main diagonal, the block subdiagonal, and the rightmost blockcolumn vector, the storage requirement for ZPSE can be kept at 3Kp2

1p22.

Linear systems with this kind of sparsity structure, bordered almost block diagonal(BABD) linear systems, were studied extensively in [9, 32]. It appears that thereexists no general-purpose numerically stable method designed specifically for BABDsystems, and it is not clear under what conditions (if any) GEPP is stable for solvingPSEs of the form (5.1). As an alternative, it is possible to consider QR-factorizations[9] for solving (5.1). However, by introducing explicit stability tests (see section 7) theresulting periodic reordering algorithm is conditionally backward stable by rejectingswaps that appear unstable by some given criterion.

One could employ Gaussian elimination with complete pivoting (GECP) to solvethis linear system (see, e.g., LAPACK’s DTGSYL [18]), but that would make it difficult,if not impossible, to exploit the sparsity structure of the problem. The completepivoting process causes fill-in elements, requires explicit storage of the whole matrixZPSE, and increases the number of flops to O((Kp1p2)3).

Also in analogy with the standard Sylvester equation (e.g., see [14, 17]), theconditioning of the PSE is related to the sep-function

sep[PSE] = inf‖x‖2=1

‖ZPSEx‖2 = ‖Z−1PSE‖−1

2 = σmin(ZPSE)(5.3)

= inf(∑K−1

k=0‖Xk‖2F )1/2=1

(K−1∑

k=0

‖T (k)11 Xk −Xk⊕1T

(k)22 ‖2F

)1/2

.

The quantity sep[PSE] can be estimated at the cost of solving a few PSEs by exploitingthe estimation technique for the 1-norm of the inverse of a matrix [12, 14, 17, 18].

6. Error analysis. In this section, we present an error analysis of the directreordering method presented in section 4, where we extend the analysis from [2, 16]to the periodic case. For K = 1 we also get sharper error bounds compared to [2].

6.1. Perturbation of individual matrices under periodic reordering.If Householder reflections are used to compute the orthogonal sequence Qk, k =0, 1, . . . , K − 1, each matrix Qk is orthogonal up to machine precision [31], and thestability of the direct reordering method is mainly affected by the conditioning andaccuracy of the solution to the associated PSE.

Without loss of generality, we assume that p1 = p2 = 2. Let Xk be the computedsolution sequence to the PSE (4.4), where Xk = Xk+ΔXk, Xk is the exact and uniquesolution sequence and ΔXk is the corresponding error matrix for k = 0, 1, . . . , K − 1.We let

Yk ≡ T(k)11 Xk − Xk⊕1T

(k)22 + T

(k)12 = T

(k)11 ΔXk −ΔXk⊕1T

(k)22(6.1)


denote the residual sequence associated with the computed PSE solution sequence.Under mild conditions (such as ‖D+

k ‖2‖ΔXk‖F < 1, where Dk is defined in (4.5))the K QR-factorizations of (Xk, I)T can be written as

[Xk + ΔXk

I

]= Dk +

[ΔXk

0

]= Qk

[Rk

0

]= (Qk + ΔQk)

[Rk + ΔRk

0

],

where ΔQk and ΔRk are perturbations of the orthogonal matrices Qk and the tri-angular matrices Rk, and Qk = Qk + ΔQk is orthogonal [24]. Here ‖ΔQk‖F and‖ΔRk‖F are essentially bounded by ‖D+

k ‖2‖ΔXk‖F , k = 0, 1, . . . , K − 1 [24, 2]. Wedo not assume anything about the structure of these perturbation matrices.

Given the computed sequences Xk and Qk, the following theorem shows how theerrors in these quantities propagate to the results of the direct method for reorderingtwo adjacent sequences of diagonal blocks in the periodic Schur form.

Theorem 6.1. Let Xk = Xk + ΔXk with ΔXk = 0 nonsingular, Qk, and theresidual sequence Yk (6.1) be given for k = 0, 1, . . . , K − 1. By applying the computedsequence of transformations Qk from a periodic reordering of the (1, 1) and (2, 2)blocks of Tk (4.1) in a cyclic transformation, we get

Tk ≡ QTk⊕1

[T

(k)11 T

(k)12

0 T(k)22

]Qk = Tk + Ek,(6.2)

where

Tk =

[T

(k)11 T

(k)12

0 T(k)22

], Ek =

[E

(k)11 E

(k)12

E(k)21 E

(k)22

](6.3)

for k = 0, 1, . . . , K−1. Then the error matrices Ek satisfy the following norm boundsup to first order perturbations:

‖E(k)11 ‖2 ≤

σmax(Xk⊕1)(1 + σ2

max(Xk⊕1))1/2· 1(1 + σ2

min(Xk))1/2‖Yk‖F(6.4)

+ 2‖T (k)11 ‖2(‖D+

k ‖2‖ΔXk‖F + ‖D+k⊕1‖2‖ΔXk⊕1‖F ),

‖E(k)21 ‖2 ≤

1(1 + σ2

min(Xk⊕1))1/2· 1(1 + σ2

min(Xk))1/2‖Yk‖F ,(6.5)

‖E(k)22 ‖2 ≤

1(1 + σ2

min(Xk⊕1))1/2· σmax(Xk)(1 + σ2

max(Xk))1/2‖Yk‖F .(6.6)

Proof. Transform the sequence Tk with Qk in a cyclic transformation:

QTk⊕1TkQk = QT

k⊕1TkQk︸︷︷︸Tk

+ΔQTk⊕1TkQk + QT

k⊕1TkΔQk + ΔQTk⊕1TkΔQk.

Let Zk = QTk ΔQk. From (Qk + ΔQk)T (Qk + ΔQk) = I we have that QT

k ΔQk =−ΔQT

k Qk up to first order, and by dropping the second order term, we get

QTk⊕1TkQk = Tk + TkZk − Zk⊕1Tk

for k = 0, 1, . . . , K − 1.


Let Ek denote the error matrix corresponding to the kth cyclic transformation(4.7), i.e., Tk = Tk + Ek. Partition Zk, k = 0, 1, . . . , K − 1 conformally with Tk andobserve that

QTk⊕1TkQk = Tk + Ek =

[T

(k)11 T

(k)12

0 T(k)22

]+

[E

(k)11 E

(k)12

E(k)21 E

(k)22

],

where

Ek =

[E

(k)11 E

(k)12

E(k)21 E

(k)22

]= TkZk − Zk⊕1Tk,

i.e.,⎧⎪⎪⎪⎨⎪⎪⎪⎩

E(k)11 = T

(k)11 Z

(k)11 + T

(k)12 Z

(k)21 − Z

(k⊕1)11 T

(k)11 ,

E(k)12 = T

(k)11 Z

(k)12 + T

(k)12 Z

(k)22 − Z

(k⊕1)11 T

(k)12 − Z

(k⊕1)12 T

(k)22 ,

E(k)21 = T

(k)22 Z

(k)21 − Z

(k⊕1)21 T

(k)11 ,

E(k)22 = T

(k)22 Z

(k)22 − Z

(k⊕1)22 T

(k)22 − Z

(k⊕1)21 T

(k)12 .

(6.7)

As we will show below, E(k)22 and E

(k)11 perturb the eigenvalues of the matrix product

ΦA(K, 0) directly but do not affect stability. E(k)21 is critical since it affects both the

stability of the reordering and the eigenvalues. E(k)12 is of minor interest since it does

not perturb the eigenvalues explicitly nor does it affect the stability. The task is nowto derive norm bounds for the error matrix blocks E

(k)11 , E

(k)21 , and E

(k)22 .

By assuming that ΔXk, k = 0, 1, . . . , K − 1, are nonsingular and applyingthe analysis of the QR-factorization from [2] to each of our K independent QR-factorizations, we get

Z(k)11 = Q

(k)11

TΔXkR−1

k −ΔRkR−1k ,(6.8)

Z(k)21 = Q

(k)12

TΔXkR−1

k ,(6.9)

Z(k)22 = −Q

(k)12

TΔXkR−1

k Q(k)11

TQ

(k)12

−T.(6.10)

Using (6.8), (6.9), (6.10), (4.6), and (6.1), the error matrix blocks E(k)11 , E

(k)21 , and

E(k)22 in (6.7) boil down to

⎧⎪⎪⎨⎪⎪⎩

E(k)11 = Q

(k⊕1)11

TYkR−1

k − T(k)11 ΔRkR−1

k + ΔRk⊕1R−1k⊕1T

(k)11 ,

E(k)21 = Q

(k⊕1)12

TYkR−1

k ,

E(k)22 = −Q

(k⊕1)12

TYkR−1

k Q(k)11

TQ

(k)12

−T

(6.11)

as first order results. We see that E(k)22 , E

(k)21 , and E

(k)11 are essentially related to the

K residual matrices Yk of the associated PSE and the blocks Rk, Q(k)11 , and Q

(k)12 from

the K QR-factorizations. From (4.5) we have that

Q(k)21 = R−1

k , RTk Rk = I + XT

k Xk,

which gives

σ2(Rk) = λ(RTk Rk) = λ(I + XT

k Xk) = 1 + λ(XTk Xk) = 1 + σ2(Xk).


By the above argument we get

‖Q(k)21 ‖2 = ‖R−1

k ‖2 =1

σmin(Rk)=

1(1 + σ2

min(Xk))1/2.

Further, from [24] we have

‖ΔRkR−1k ‖F ≤ 2‖D+

k ‖2‖ΔXk‖F ,

and by the CS decomposition of Q (see, e.g., [10, 25]) we get the following normrelations:

‖Q(k)21 ‖2 = ‖Q(k)

12 ‖2, ‖Q(k)11 ‖2 = ‖Q(k)

22 ‖2.

Now by combining these facts with (6.11) and applying the product and triangleinequalities for norms, we obtain the bounds of the theorem.

Remark 1. For K = 1 and by inequality (1+σ2min(Xk))−1/2 ≥ (1+σ2

max(Xk))−1/2,the norm bounds of Theorem 6.1 can be further bounded from above to achieve

‖E11‖2 ≤ σmax(X)(1 + σ2

min(X))‖Y ‖F + 4‖T11‖2‖D+‖2‖ΔX‖F ,(6.12)

‖E21‖2 ≤ 1(1 + σ2

min(X))‖Y ‖F ,(6.13)

‖E22‖2 ≤ σmax(X)(1 + σ2

min(X))‖Y ‖F ,(6.14)

which are the norm bounds from the main theorem of [2] on the perturbation of theeigenvalues under standard eigenvalue reordering in the real Schur form.

Remark 2. Numerical experiments show that iterative refinement may improve onthe computed solution Xk, especially for badly scaled problems, but may not improveon the residual sequence Yk or on the computed eigenvalues. See also [2] for a similarobservation.

6.2. Perturbation of matrix products under periodic reordering. In thissection, we investigate how the errors in the individual matrices after a periodicreordering of two adjacent sequences of diagonal blocks in Tk propagate into thematrix product ΦT (K, 0) = TK−1TK−2 . . . T1T0.

We present a general result in the following theorem.Theorem 6.2. Let Tk be a matrix sequence in PRSF with periodicity K and

partitioned as

Tk =

[T

(k)11 T

(k)12

0 T(k)22

].

Let the sequence Qk, k = 0, 1, . . . , K − 1, be the computed orthogonal cyclic transfor-mation matrices defining the periodic eigenvalue reordering of the product ΦT (K, 0)as in (6.2). In addition, let the sequences Tk, Tk, and Ek be defined as in (6.2)–(6.3)of Theorem 6.1. Then we have

ΦT (K, 0) =0∏

k=K−1

QTk⊕1TkQk = ΦT (K, 0) + E,(6.15)


where ΦT (K, 0) = QT0 ΦT (K, 0)Q0 is the exact product of the reordered matrices and

E is the corresponding error matrix. Assuming that E is partitioned conformally withTk, we have the bounds

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

‖E11‖2 ≤ ∑K−1k=0 ((

∏k+1j=K−1 ‖T (j)

11 ‖2)‖E(k)11 ‖2

+ (∑k+1

j=K−1 ‖ϕ(k,j)1 ‖2)‖E(k)

21 ‖2)(∏0

j=k−1 ‖T (j)11 ‖2),

‖E21‖2 ≤ ∑K−1k=0 (

∏k+1j=K−1 ‖T (j)

22 ‖2)‖E(k)21 ‖2(

∏0j=k−1 ‖T (j)

11 ‖),

‖E22‖2 ≤ ∑K−1k=0 (

∏k+1j=K−1 ‖T (j)

22 ‖2)(‖E(k)21 ‖2

∑0j=k−1 ‖ϕ(k,j)

2 ‖2+ ‖E(k)

22 ‖2(∏0

j=k−1 ‖T (j)22 ‖2)),

(6.16)

where

‖ϕ(k,j)1 ‖2 ≤ ‖T (j)

12 ‖2j+1∏

l=K−1

‖T (l)11 ‖2

k+1∏

l=j−1

‖T (l)22 )‖2,(6.17)

‖ϕ(k,j)2 ‖2 ≤ ‖T (j)

12 ‖2j+1∏

l=k−1

‖T (l)11 ‖2

0∏

l=j−1

‖T (l)22 ‖2(6.18)

up to first order perturbations.Proof. Up to first order perturbations, we have

ΦT (K, 0) =0∏

k=K−1

QTk⊕1TkQk

= ΦT (K, 0) +K−1∑

k=0

ΦT (K, k + 1)EkΦT (k, 0) = ΦT (K, 0) + E.

(6.19)

The bounds follow by applying the triangle inequality and the submultiplicativity ofnorms to the error matrix E in block partitioned form. For details see [11].

For illustration, we display the explicit results of Theorem 6.2 for two simple casesin the following corollary.

Corollary 6.3. Under the assumptions of Theorem 6.2 and the periodicityK = 2, norm bounds for blocks of the error matrix E (6.15) can up to first orderperturbations be expressed as

‖E11‖2 ≤ ‖T (1)11 ‖2‖E(0)

11 ‖2 + ‖T (1)12 ‖2‖E(0)

21 ‖2 + ‖T (0)11 ‖2‖E(1)

11 ‖2,‖E21‖2 ≤ ‖T (1)

22 ‖2‖E(0)21 ‖2 + ‖T (0)

11 ‖2‖E(1)21 ‖2,

‖E22‖2 ≤ ‖T (1)22 ‖2‖E(0)

22 ‖2 + ‖T (0)12 ‖2‖E(1)

21 ‖2 + ‖T (0)22 ‖2‖E(1)

22 ‖2.For periodicity K = 3, we have the bounds

‖E11‖2 ≤ ‖T (2)11 ‖2‖T (1)

11 ‖2‖E(0)11 ‖2 + (‖T (2)

11 ‖2‖T (1)12 ‖2 + ‖T (2)

12 ‖2‖T (1)22 ‖2)‖E(0)

21 ‖2+ ‖T (2)

11 ‖2‖T (0)11 ‖2‖E(1)

11 ‖2 + ‖T (2)12 ‖2‖T (0)

11 ‖2‖E(1)21 ‖2 + ‖T (1)

11 ‖2‖T (0)11 ‖2‖E(2)

11 ‖2,

‖E21‖2 ≤ ‖T (2)22 ‖2‖T (1)

22 ‖2‖E(0)21 ‖2 + ‖T (2)

22 ‖2‖T (0)11 ‖2‖E(1)

21 ‖2 + ‖T (1)11 ‖2‖T (0)

11 ‖2‖E(2)21 ‖2,

‖E22‖2 ≤ ‖T (2)22 ‖2‖T (1)

22 ‖2‖E(0)22 ‖2 + ‖T (2)

22 ‖2‖T (0)12 ‖2‖E(1)

21 ‖2 + ‖T (2)22 ‖2‖T (0)

22 ‖2‖E(1)22 ‖2

+ (‖T (1)11 ‖2‖T (0)

12 ‖2 + ‖T (1)12 ‖2‖T (0)

22 ‖2)‖E(2)21 ‖2 + ‖T (1)

22 ‖2‖T (0)22 ‖2‖E(2)

22 ‖2


up to first order perturbations.We remark that the analysis in Theorem 6.2 and Corollary 6.3 assumes that

the involved matrix products and sums are computed exactly. For a rounding erroranalysis regarding matrix products and sums, see, e.g., [15].

Theorems 6.1 and 6.2 can be combined to produce computable bounds for theperturbations of the diagonal blocks of ΦT (K, 0) under periodic eigenvalue reordering.We can also apply known perturbation results for the standard eigenvalue problem [25]and the periodic eigenvalue problem [20, 4] to the submatrix products ΦT (K, 0)11 andΦT (K, 0)22. This is a matter of further investigation.

7. Computational experiments. We demonstrate the stability and reliabilityof the direct reordering method by considering some numerical examples. The testexamples range from well-conditioned to ill-conditioned problems, including matrixsequences with fixed and time-varying dimensions, and sequences of small and largeperiodicity. In the following, we present results for a representative selection of prob-lems, where, except for one example, two complex conjugate eigenvalue pairs of aperiodic real sequence Ak are reordered (p1 = p2 = 2). The associated PSEs of ourdirect periodic reordering method are solved by applying GEPP to ZPSEx = c andutilizing the structure of ZPSE in (5.1). All experiments are carried out in doubleprecision (εmach ≈ 2.2× 10−16) on an UltraSparc II (450 Mhz) workstation.

Examples 1 and 3 below are constructed as follows. First, we specify K, nk, k =0, 1, . . . , K − 1, and mink(nk) eigenvalues or K ·mink(nk) diagonal and mink(nk)− 1subdiagonal elements. Then a random sequence Tk as in (1.2) is generated with 1× 1and 2×2 diagonal blocks corresponding to specified eigenvalues or diagonal, subdiag-onal, and superdiagonal entries. Finally, orthogonal matrices Zk, k = 0, 1, . . . , K− 1,are constructed from QR-factorizing K uniformly distributed random nk × nk ma-trices, which are applied in a K-cyclic orthogonal transformation of Tk to get Ak.Examples 4 and 5 illustrate reordering of two periodic sequences already in PRSF.Finally, Example 2 is from a real application.

In Table 7.1, we display the periodicity K, problem dimensions nk for k =0, 1, . . . , K − 1, the computed value of sep[PSE], and a reciprocal condition number sfor the eigenvalues of ΦT (K, 0)11,

s = 1/√

1 + ‖X0‖2F ,

where X0 is the generator matrix for the periodic reordering of ΦT (K, 0) (see sec-tion 4). The last two quantities signal the conditioning of the problems considered.

Results from periodic reordering using our direct method are presented in Table7.2. We display the maximum relative change of the eigenvalues under the periodicreordering

eλ = maxk

|λk − λk||λk| , λk ∈ λ(ΦT (K, 0)).

In addition, we display five residual quantities for the computed results. These includetwo stability tests used in our method, namely a weak stability test

Rweak = maxk‖Q(k)

11 −XkQ(k)21 ‖F ,

and a strong stability test

Rstrong = maxk

(‖Tk − Qk⊕1TkQTk ‖F , ‖Tk − QT

k⊕1TkQk‖F ),


Table 7.1

Problem characteristics for the examples considered. 4a and 4b refer to Example 4 with period2 and 100, respectively.

Example K nk sep[PSE] s1 3 4+k 6.9E-01 7.2E-012 120 4 4.7E-03 5.5E-013 10 2 9.9E+00 1.0E+004a 2 4 4.5E-15 1.1E-144b 100 4 1.3E-16 1.3E-165 2 4 6.2E+03 6.6E-01

Table 7.2

Computational results for periodic reordering. 4a and 4b refer to Example 4 with period K = 2and 100, respectively. 5a and 5b refer to Example 5 without scaling and with scaling.

Example eλ Rweak Rstrong Reprsf Rreord Rorth

1 4.6E-16 2.2E-16 1.6E-15 4.7E-15 5.6E-15 1.3E+012 1.6E-15 2.9E-16 1.8E-15 9.0E-15 9.8E-15 2.0E+013 1.4E-15 1.9E-16 8.4E-15 7.3E-15 1.0E-14 4.1E+004a 3.6E-16 2.5E-16 1.4E-15 0 1.2E-15 2.1E+004b 3.7E-16 2.3E-16 3.2E-18 0 1.9E-15 3.6E+005a 2.2E-01 1.2E-16 6.6E-12 0 5.8E-12 3.3E+005b 2.0E-09 2.3E-16 4.3E-12 0 5.6E-12 3.3E+00

which is the maximum residual norm associated with the cyclic transformations Qk

used in the reordering. Tolerances for these tests can optionally be specified. De-pending on the outcome of our stability test (weak or strong), we either reject theswap or perform a swapping with guaranteed backward stability. Rejecting a swapmeans that we avoid the risk that errors induced during the reordering computa-tions may change the eigenvalues drastically. It is the sensitivity of the associatedeigenspaces that matters most (see [18]). Since the extra cost for the strong stabilitytest is marginal, it is recommended. The last three columns in Table 7.2 display themaximum residual norms of the (extended) periodic Schur decomposition (1.2) beforeand after reordering, computed as

Reprsf = maxk

(‖Ak − Zk⊕1TkZTk ‖F , ‖Tk − ZT

k⊕1AkZk‖F ),

and

Rreord = maxk

(‖Ak − Zk⊕1TkZTk ‖F , ‖Tk − ZT

k⊕1AkZk‖F ),

and a relative orthogonality check over the whole period K after periodic reordering:

Rorth =maxk(‖Ink

− ZTk Zk‖F , ‖Ink

− ZkZTk ‖F )

εmach.

For these three residual norms, the K-cyclic transformations Zk and Zk correspondto Zk and Zk in (3.3), respectively.

The computed eigenvalues before and after the periodic reordering are presentedto full machine accuracy under each example.

Example 1. We consider a time-varying sequence with K = 3 and nk = 4+k, k =0, 1, 2, and eigenvalues 1.0±2.0i,−7.0±0.5i. The computed eigenvalues of the matrixproduct ΦT (K, 0) = T2T1T0 are

λ1 = 1.000000000000000 ± 2.000000000000000i,λ2 = −7.000000000000001 ± 5.000000000000001i.


The spectrum is well separated. After the periodic reordering of the blocks we ob-tained λ1 = λ2 and λ2 = λ1 to full accuracy.

Example 2 (satellite control [29]). We consider reordering in a 4 × 4 periodicmatrix sequence that describes a control system of a satellite on orbit around theearth. The periodicity is K = 120. The computed eigenvalues of the sequence are

λ1 = 0.9941836588706161 ± 0.1076979685723037i,λ2 = 0.7625695885261465 ± 0.6469061930874623i.

The reordered eigenvalues are

λ1 = 0.7625695885261450 ± 0.6469061930874582i,

λ2 = 0.9941836588706161 ± 0.1076979685723021i.

This application example shows that periodic reordering works fine for well-conditionedproblems with large periods as well.

Example 3. We consider reordering a sequence with K = 10, and p1 = p2 = 1,and the computed sequence in PRSF is

Tk =[

101 t(k)12

0 10−1

], k = 0, 1, . . . , K − 1,

where |t(k)12 | ≤ 1. The computed eigenvalues of the product ΦT (K, 0) are

λ1 = 9.999999999999987× 109,λ2 = 1.000000000000013× 10−10.

After the periodic reordering we obtain

λ1 = 1.000000000000015× 10−10,

λ2 = 9.999999999999989× 109.

Reordering of 1 × 1 blocks in PRSF can be carried out by propagating a Givensrotation through the matrix product [5], but this process is not forward stable. Forthis example, the rotation approach does not deliver one single correct digit in thereordered eigenvalues, whereas the direct reordering method delivers an acceptableerror in the eigenvalues.

Example 4. We consider a sequence already in PRSF with K = 2 and nk = 4, k =0, 1, and eigenvalues 0.2± (1.2+10−14)i, 0.2± 1.2i. The computed eigenvalues of thematrix ΦT (K, 0) = T1T0 are

λ1 = 0.200000000000000 ± 1.200000000000001i,λ2 = 0.200000000000000 ± 1.200000000000000i.

The spectrum is not well separated. After the periodic reordering we obtained

λ1 = 0.200000000000000 ± 1.200000000000000i,

λ2 = 0.200000000000000 ± 1.200000000000001i,

so the periodic reordering was perfect, even though the problem has very close eigen-values. Indeed, we obtain reordered eigenvalues to full machine accuracy for periodsup to 100.


Example 5. First, we consider a problem already in PRSF with large separationand K = 2, nk = 4, k = 0, 1, and the eigenvalues ε

1/2mach ± ε

1/2machi, ε

−1/2mach ± ε

−1/2machi.

Moreover, the involved matrices have almost the same Frobenius norm (≈ 1.8 ×104), but the matrices in the subsequences T

(k)11 and T

(k)22 have very different norms:

‖T (0)11 ‖F ≈ 1.4×104, ‖T (1)

11 ‖F ≈ 1.4×104, ‖T (0)22 ‖F ≈ 7.0×10−12, ‖T (1)

22 ‖F ≈ 8.6×103.The computed eigenvalues of the product ΦT (K, 0) are

λ1 = 6.710886400000000× 107 ± 6.710886400000003× 107i,λ2 = 1.490116119384766× 10−8 ± 1.490116119384766× 10−8i.

After the periodic reordering without diagonal scaling we obtain

λ1 = 1.168840447839719× 10−8 ± 9.309493732240201× 10−9i,

λ2 = 6.710886400000001× 107 ± 6.710886400000000× 107i.

The problem is well-conditioned in the sense of sep[PSE], the norm of the generatormatrix (see s in Table 7.1), and the reordering passes the stability tests, but sincethe eigenvalues differ almost 16 orders of magnitude the relative error in the smallesteigenvalues become very large due to the finite precision arithmetic.

Next, we consider the same problem as above, but now we perform diagonalscaling T1T0 = T1D1D

−11 T0 before periodic reordering such that the blocks T

(0)22 and

T(1)22 have about the same norm. Now the periodic reordering gives

λ1 = 1.490116120748016× 10−8 ± 1.490116125160257× 10−8i,

λ2 = 6.710886400000000× 107 ± 6.710886400000001× 107i,

which is quite an improvement (8 orders of magnitude) compared to the results with-out scaling. Not surprisingly, periodic reordering is sensitive to large differences inthe norms within the subsequences T

(k)11 and T

(k)22 .

8. Future work. Next, we will focus on computing periodic eigenspaces withspecified eigenvalues and associated error bounds based on condition estimation (see,e.g., [18]), as well as producing library-standard (LAPACK [1], SLICOT [21]) softwarefor the eigenvalue reordering algorithm presented in this paper.

Acknowledgments. The authors are grateful to Daniel Kressner for construc-tive comments on the subject and earlier versions of this manuscript and to AndrasVarga for valuable comments on the subject and for providing us with software forcomputing the extended periodic Schur decomposition and data for Example 2.

REFERENCES

[1] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. W. Demmel, J. Dongarra, J. Du Croz,

A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen, LAPACK Users’Guide, 3rd ed., SIAM, Philadelphia, 1999.

[2] Z. Bai and J. W. Demmel, On swapping diagonal blocks in real Schur form, Linear AlgebraAppl., 186 (1993), pp. 73–95.

[3] R. H. Bartels and G. W. Stewart, Algorithm 432 : The Solution of the Matrix EquationAX −BX = C, Comm. ACM, 8 (1972), pp. 820–826.

[4] P. Benner, V. Mehrmann, and H. Xu, Perturbation analysis for the eigenvalue problem of aformal product of matrices, BIT, 42 (2002), pp. 1–43.

[5] A. Bojanczyk, G. H. Golub, and P. Van Dooren, The periodic Schur decomposition: Al-gorithm and applications, in Advanced Signal Processing Algorithms, Architectures, andImplementations III, Proc. SPIE Conference 1770, SPIE, Bellingham, WA, 1992, pp. 31–42.


[6] A. Bojanczyk and P. Van Dooren, On propagating orthogonal transformations in a productof 2 × 2 triangular matrices, in Numerical Linear Algebra, de Gruyter, New York, 1993,pp. 1–9.

[7] A. Bojanczyk and P. Van Dooren, Reordering diagonal blocks in the real Schur form, inLinear Algebra for Large Scale and Real-Time Applications, M. S. Moonen, G. H. Golub,and B. L. R. De Moor, eds., Kluwer Academic Publishers, Amsterdam, 1993, pp. 351–352.

[8] J. J. Dongarra, S. Hammarling, and J. H. Wilkinson, Numerical considerations in com-puting invariant subspaces, SIAM J. Matrix Anal. Appl., 13 (1992), pp. 145–161.

[9] G. Fairweather and I. Gladwell, Algorithms for almost block diagonal linear systems, SIAMRev., 46 (2004), pp. 49–58.

[10] G. H. Golub and C. F. Van Loan, Matrix Computations, 3rd ed., The Johns Hopkins Uni-versity Press, Baltimore, MD, 1996.

[11] R. Granat and B. Kagstrom, Direct Eigenvalue Reordering in a Product of Matrices in Ex-tended Periodic Real Schur Form, Report UMINF 05.05, Umea University, Umea, Sweden,2005.

[12] W. W. Hager, Condition estimates, SIAM J. Sci. Statist. Comput., 5 (1984), pp. 311–316.[13] J. J. Hench and A. J. Laub, Numerical solution of the discrete-time periodic Riccati equation,

IEEE Trans. Automat. Control, 39 (1994), pp. 1197–1210.[14] N. J. Higham, Fortran codes for estimating the one-norm of a real or complex matrix, with

applications to condition estimation, ACM Trans. Math. Software, 14 (1988), pp. 381–396.[15] N. J. Higham, Accuracy and Stability of Numerical Algorithms, 2nd ed., SIAM, Philadelphia,

2002.[16] B. Kagstrom, A direct method for reordering eigenvalues in the generalized real Schur form of

a regular matrix pair (A, B), in Linear Algebra for Large Scale and Real-Time Applications,M. S. Moonen, G. H. Golub, and B. L. R. De Moor, eds., Kluwer Academic Publishers,Amsterdam, 1993, pp. 195–218.

[17] B. Kagstrom and P. Poromaa, Distributed and shared memory block algorithms for the tri-angular Sylvester equation with sep−1 estimators, SIAM J. Matrix Anal. Appl., 13 (1992),pp. 90–101.

[18] B. Kagstrom and P. Poromaa, Computing eigenspaces with specified eigenvalues of a regularmatrix pair (A, B) and condition estimation: Theory, algorithms, and software, Numer.Algorithms, 12 (1996), pp. 369–407.

[19] D. Kressner, Numerical Methods and Software for General and Structured Eigenvalue Prob-lems, Ph.D. thesis, TU Berlin, Institut fur Mathematik, Berlin, Germany, 2004.

[20] W.-W. Lin and J.-G. Sun, Perturbation analysis for the eigenproblem of periodic matrix pairs,Linear Algebra Appl., 337 (2001), pp. 157–187.

[21] SLICOT Library, The Numerics in Control Network (Niconet), http://www.win.tue.nl/niconet/index.html.

[22] J. Sreedhar and P. Van Dooren, A Schur approach for solving some periodic matrix equa-tions, in Systems and Networks: Mathematical Theory and Applications, U. Helmke,R. Mennicken, and J. Saurer, eds., Akademie Verlag, Berlin, 1994, pp. 339–362.

[23] G. W. Stewart, Algorithm 407 : HQR3 and EXCHNG: FORTRAN programs for calculatingthe eigenvalues of a real upper Hessenberg matrix in a prescribed order, ACM Trans. Math.Software, 2 (1976), pp. 275–280.

[24] G. W. Stewart, Perturbation bounds for the QR factorization of a matrix, SIAM J. Numer.Anal., 14 (1977), pp. 509–518.

[25] G. W. Stewart and J.-G. Sun, Matrix Perturbation Theory, Academic Press, New York,1990.

[26] P. Van Dooren, Algorithm 590 : DSUBSP and EXCHQZ: Fortran subroutines for computingdeflating subspaces with specified spectrum, ACM Trans. Math. Software, 8 (1982), pp.376–382.

[27] A. Varga, Periodic Lyapunov equations: Some applications and new algorithms, Internat. J.Control, 67 (1997), pp. 69–87.

[28] A. Varga, Balancing related methods for minimal realization of periodic systems, SystemsControl Lett., 36 (1999), pp. 339–349.

[29] A. Varga and S. Pieters, Gradient-based approach to solve optimal periodic output feedbackcontrol problems, Automatica, 45 (1998), pp. 477–481.

[30] A. Varga and P. Van Dooren, Computational methods for periodic systems—an overview,in Proceedings of the of IFAC Workshop on Periodic Control Systems, Como, Italy, 2001,pp. 171–176.

[31] J. H. Wilkinson, The Algebraic Eigenvalue Problem, Clarendon Press, Oxford, UK, 1965.[32] S. J. Wright, A collection of problems for which Gaussian elimination with partial pivoting

is unstable, SIAM J. Sci. Comput., 14 (1993), pp. 231–238.

II

Paper II

Computing Periodic Deflating Subspaces Associatedwith a Specified Set of Eigenvalues∗

Robert Granat1, Bo Kagstrom1, and Daniel Kressner2


2 Department of Mathematics, University of Zagreb, [email protected]

Abstract: We present a direct method for reordering eigenvalues in the generalizedperiodic real Schur form of a regular K-cylic matrix pair sequence (Ak,Ek). Followingand generalizing existing approaches, reordering consists of consecutively computingthe solution to an associated Sylvester-like equation and constructing K pairs of or-thogonal matrices. These pairs define an orthogonal K-cyclic equivalence transforma-tion that swaps adjacent diagonal blocks in the Schur form. An error analysis of thisswapping procedure is presented, which extends existing results for reordering eigen-values in the generalized real Schur form of a regular pair (A,E). Our direct reorderingmethod is used to compute periodic deflating subspace pairs corresponding to a spec-ified set of eigenvalues. This computational task arises in various applications relatedto discrete-time periodic descriptor systems. Computational experiments confirm thestability and reliability of the presented eigenvalue reordering method.

Key words: Generalized product of a K-cyclic matrix pair sequence, generalized pe-riodic real Schur form, eigenvalue reordering, periodic generalized coupled Sylvesterequation, K-cyclic equivalence transformation, generalized periodic eigenvalue prob-lem.

∗ Reprinted by permission of Springer Netherlands.

67

68

BIT Numerical Mathematics (2007)

c© Springer 2007DOI: 10.1007/s10543-007-0143-y

COMPUTING PERIODIC DEFLATING SUBSPACESASSOCIATED WITH A SPECIFIED SET

OF EIGENVALUES,

R. GRANAT1, B. KAGSTROM1 and D. KRESSNER1,2

1Department of Computing Science and HPC2N, Umea University,SE-901 87 UMEA, Sweden. email: granat, bokg, [email protected]

2Department of Mathematics, University of Zagreb, Croatia

Abstract.

We present a direct method for reordering eigenvalues in the generalized periodicreal Schur form of a regular K-cyclic matrix pair sequence (Ak, Ek). Following andgeneralizing existing approaches, reordering consists of consecutively computing thesolution to an associated Sylvester-like equation and constructingK pairs of orthogonalmatrices. These pairs define an orthogonal K-cyclic equivalence transformation thatswaps adjacent diagonal blocks in the Schur form. An error analysis of this swappingprocedure is presented, which extends existing results for reordering eigenvalues in thegeneralized real Schur form of a regular pair (A,E). Our direct reordering method isused to compute periodic deflating subspace pairs corresponding to a specified set ofeigenvalues. This computational task arises in various applications related to discrete-time periodic descriptor systems. Computational experiments confirm the stability andreliability of the presented eigenvalue reordering method.

AMS subject classification (2000): 65F15, 15A18, 93B60.

Key words: generalized product of a K-cyclic matrix pair sequence, generalized pe-riodic real Schur form, eigenvalue reordering, periodic generalized coupled Sylvesterequation, K-cyclic equivalence transformation, generalized periodic eigenvalue prob-lem.

1 Introduction.

Discrete-time periodic descriptor systems of the form

Ekxk+1 = Akxk +Bkuk,

yk = Ckxk +Dkuk,(1.1)

Received December 22, 2006. Accepted June 25, 2007. Communicated by Axel Ruhe. This research was conducted using the resources of the High Performance ComputingCenter North (HPC2N). Financial support has been provided by the Swedish Research Councilunder grant VR 621-2001-3284 and by the Swedish Foundation for Strategic Research underthe frame program grant A3 02:128. The third author was also supported by the DFG EmmyNoether Fellowship KR 2950/1-1.

R. GRANAT ET AL.

with Ak = Ak+K , Ek = Ek+K ∈ Rn×n, Bk = Bk+K ∈ Rn×m, Ck = Ck+K∈ Rr×n and Dk = Dk+K ∈ Rr×m for some period K ≥ 1 arise naturally fromprocesses that exhibit seasonal or periodic behavior [6]. Design and analysisproblems of such systems (see, e.g., [31, 32, 39]) are conceptually studied in termsof state transition matrices [39] ΦE−1A(j, i) = E

−1j−1Aj−1E

−1j−2Aj−2 . . . E

−1i Ai

∈ Rn×n, with the convention ΦE−1A(i, i) = In. A state transition matrix overa complete period ΦE−1A(j +K, j) is the monodromy matrix of (1.1) at time j.Its eigenvalues are called the characteristic multipliers and are independent ofthe time j. Specifically, the monodromy matrix at time j = 0 corresponds to thematrix product

E−1K−1AK−1E−1K−2AK−2 · · ·E

−11 A1E

−10 A0.(1.2)

Matrix products of the general form (1.2) are studied, e.g., in [3, 5, 26, 40].

We study the K-cyclic matrix pair sequence (Ak, Ek) with Ak, Ek ∈ Rn×n

from (1.1) via the generalized periodic Schur decomposition [8, 18]: there existsa K-cyclic orthogonal matrix pair sequence (Qk, Zk) with Qk, Zk ∈ Rn×n suchthat, given k ⊕ 1 = (k + 1) mod K, we have

Sk = Q

TkAkZk,

Tk = QTkEkZk⊕1,

(1.3)

where all matrices Sk, except for some fixed index j with 0 ≤ j ≤ K − 1, andall matrices Tk are upper triangular. The matrix Sj is upper quasi-triangular;typically j is chosen to be 0 or K − 1. The sequence (Sk, Tk) is the general-ized periodic real Schur form (GPRSF) of (Ak, Ek), k = 0, 1, . . . ,K − 1. Thedecomposition (1.3) is a K-cyclic equivalence transformation of the matrix pairsequence (Ak, Ek).

Computing the GPRSF is the standard method for solving the generalizedperiodic (product) eigenvalue problem (GPEVP)

E−1K−1AK−1E−1K−2AK−2 · · ·E

−11 A1E

−10 A0x = λx,(1.4)

where all matrices in the pairs (Ak, Ek) are general and dense. For K = 1, (1.4)corresponds to a generalized eigenvalue problem Ax = λEx with (A,E) regular(see, e.g., [12]). Using the GPRSF to solve a GPEVP for K ≥ 1 means that wedo not need to compute any matrix products in (1.4) explicitly, which avoidsnumerical instabilities and allows to handle singular factors Ek.

The 1× 1 and 2× 2 blocks on the diagonal of a GPRSF define t ≤ n K-cyclic

diagonal block pairs (S(k)ii , T

(k)ii ), corresponding to real eigenvalues and complex

conjugate pairs of eigenvalues, respectively.

A real eigenvalue is simply given by

λi =S(K−1)ii

T(K−1)ii

S(K−2)ii

T(K−2)ii

· · ·S(0)ii

T(0)ii

.

PERIODIC DEFLATING SUBSPACES OF SPECIFIED SPECTRUM

This eigenvalue is called infinite if∏K−1k=0 T

(k)ii = 0 but

∏K−1k=0 S

(k)ii = 0. If there

are 1 × 1 blocks for which both∏K−1k=0 S

(k)ii = 0 and

∏K−1k=0 T

(k)ii = 0 then the

K-cyclic matrix pair sequence (Ak, Ek) is called singular, otherwise the sequence(Ak, Ek) is called regular. In the degenerate singular case, the eigenvalues becomeill-defined and other tools [28, 37] need to be used to study the periodic eigen-value problem. For the rest of the paper, it is therefore assumed that (Ak, Ek)is regular.For two complex conjugate eigenvalues λi, λi, all matrices T

(k)ii are nonsingular

and

λi, λi ∈ λ(T(K−1)ii

−1S(K−1)ii T

(K−2)ii

−1S(K−2)ii · · ·T (0)ii

−1S(0)ii

),

where λ(M ) denotes the set of eigenvalues of a matrix M . In finite preci-sion arithmetic, great care has to be exercised to avoid underflow and over-flow in the explicit eigenvalue computation, especially when it involves 2 × 2blocks [35].For every l with 1≤ l ≤ n such that no 2×2 block resides in Sj(l : l+1, l : l+1),the first l pairs of columns of (Q0, Z0) span a deflating subspace pair correspond-ing to the first l eigenvalues of the matrix product (1.2). More generally, thefirst l pairs of columns of (Qk, Zk) span a left and right periodic (or cyclic) de-flating subspace pair sequence associated with the first l eigenvalues of the matrixproduct (1.2) [5].The decomposition (1.3) is computed via the periodic QZ algorithm (see,e.g., [8, 18, 24, 25]), which consists of an initial reduction to generalized pe-riodic Hessenberg form and a subsequent iterative process to generalized pe-riodic Schur form. In [38], the generalized periodic Schur form is extended toperiodic matrix pairs with time-varying and possibly rectangular dimensions.This includes a preprocessing step that truncates parts corresponding to spuri-ous characteristic values, which then yields square system matrices of constantdimensions.

1.1 Ordered GPRSF and periodic deflating subspaces.

In many applications, it is desirable to have the eigenvalues along the diagonalof the GPRSF in a certain order. If the generalized periodic Schur form has itseigenvalues ordered in a certain way as in (1.5), it is called an ordered GPRSF.For example, if we have

Sk =

[S(k)11 S

(k)12

0 S(k)22

], Tk =

[T(k)11 T

(k)12

0 T(k)22

],(1.5)

with S(k)11 , T

(k)11 ∈ Rl×l such that the upper left part sequence (S(k)11 , T

(k)11 )

contains all eigenvalues in the open unit disc, then (Sk, Tk) is an ordered GPRSFand the first l columns of the sequence Zk span stable right periodic deflat-ing subspaces. For initial states x0 ∈ span(Z0e1, . . . , Z0el) with ei being the

R. GRANAT ET AL.

ith unit vector, the states of the homogeneous system Ekxk+1 = Akxk satisfyxk ∈ span(Zke1, . . . , Zkel) and 0 is an asymptotically stable equilibrium.Other important applications relating to periodic discrete-time systems includethe stable-unstable spectral separation for computing the numerical solution ofthe discrete-time periodic Riccati equation [38] in LQ-optimal control, whichwe illustrate in Section 2, and pole placement where the goal is to move someor all of the poles to desired locations in the complex plane [29, 15]. In [4],ordered Schur forms are used for solving generalized Hamiltonian eigenvalueproblems.

In this paper, we extend the work in [2, 14, 21, 25, 15] to perform eigenvaluereordering in a regular periodic matrix pair sequence in GPRSF.

The rest of the paper is organized as follows. In Section 2, we illustrate howan ordered GPRSF can be used to solve the discrete-time periodic Riccati equa-tion that arises in an LQ-optimal control problem. Section 3 presents our directmethod for reordering eigenvalues of a periodic (cyclic) matrix pair sequence(Ak, Ek) in GPRSF. To compute an ordered GPRSF, a method for reorderingadjacentK-cyclic diagonal block pairs is combined with a bubble-sort like proce-dure in an LAPACK-style [1, 2, 23] fashion. The proposed method for swappingadjacent diagonal block pair sequences relies on orthogonalK-cyclic equivalencetransformations and the core step consists of computing the solution to an as-sociated periodic generalized coupled Sylvester equation, which is discussed inSection 3.4. An error analysis of the direct reordering method is presented inSection 5, which extends and generalizes results from [21, 14]. In Section 6, wediscuss some implementation issues regarding the solution of small-sized periodicgeneralized coupled Sylvester equations and how we control and guarantee sta-bility of the reordering. Some examples and computational results are presentedand discussed in Section 7. Finally, in Section 8 we discuss some extensions ofthe reordering method.

2 LQ-optimal control and periodic deflating subspaces.

Given the system (1.1), the aim of linear quadratic (LQ) optimal control isto find a feedback sequence uk which stabilizes the system and minimizes thefunctional

1

2

∞∑k=0

(xTkHkxk + u

TkNkuk

),

with Hk ∈ Rn×n symmetric positive semidefinite and Nk ∈ Rm×m symmet-ric positive definite. Moreover, we suppose that the weighting matrices areK-periodic, i.e., Hk+K = Hk and Nk+K = Nk. Under mild assumptions [7],the optimal feedback is linear and unique. For each k, it can be expressed as

uk = −(Nk +B

Tk Xk+1Bk

)−1BTk Xk+1Akxk,


where Xk = Xk+K is the unique symmetric positive semidefinite solution of thediscrete-time periodic Riccati equation (DPRE) [18]

0 = CTk HkCk −ETk−1XkEk−1 +A

TkXk+1Ak

−ATkXk+1Bk(Nk +B

Tk Xk+1Bk

)−1BTkXk+1Ak,

(2.1)

provided that all Ek are invertible. The 2n× 2n periodic matrix pair

(Lk,Mk) =

([Ak 0

−CTk HkCk ETk−1

],

[Ek−1 BkN

−1k B

Tk

0 ATk

])

is closely associated with (2.1). Similarly as for the case Ek = In [18], it canbe shown that this pair has exactly n eigenvalues inside the unit disk underthe assumption that (1.1) is d-stabilizable and d-detectable. By reordering theGPRSF of (Lk,Mk) we can compute a periodic deflating subspace defined bythe orthogonal matrices Uk, Vk ∈ R2n×2n with Uk+K = Uk, Vk+K = Vk suchthat

UTk LkVk =

[S(k)11 S

(k)12

0 S(k)22

], UTk MkVk+1 =

[T(k)11 T

(k)12

0 T(k)22

],

where the n× n periodic matrix pair (S(k)11 , T(k)11 ) contains all eigenvalues inside

the unit disk. If we partition

Uk =

[U(k)11 U

(k)12

U(k)21 U

(k)22

]

with U(k)ij ∈ R

n×n, then

U(k)21

[U(k)11

]−1= XkEk−1,

from which Xk can be computed. The proof of this relation is similar as for thecase K = 1, see, e.g., [27]. We note that if Nk is not well-conditioned then it isbetter to work with (2n+m)× (2n+m) matrix pairs, as described in [27].

3 Direct method for eigenvalue reordering in GPRSF.

Given a regular K-cyclic matrix pair sequence (Ak, Ek) in GPRSF, our methodto compute an ordered GPRSF (1.5) with respect to a set of specified eigenvaluesreorders 1×1 and 2×2 diagonal blocks in the GPRSF such that the selected set of

eigenvalues appears in the matrix pair sequence (S(k)11 , T

(k)11 ). Following LAPACK,

we assume that the set of specified eigenvalues are provided as an index vector

for the blocks of eigenvalue pairs that should appear in (S(k)11 , T

(k)11 ). The proce-

dure is now to swap adjacent diagonal blocks in the GPRSF in a bubble-sortfashion such that the specified eigenvalue ordering is satisfied [1, 2, 23]. In thefollowing, we focus on the K-cyclic swapping of diagonal blocks using orthogonaltransformations.

R. GRANAT ET AL.

3.1 Swapping of K-cyclic diagonal block matrix pairs.

Consider a regular K-cyclic matrix pair sequence (Ak, Ek) in GPRSF

(Ak, Ek) =

([A(k)11 A

(k)12

0 A(k)22

],

[E(k)11 E

(k)12

0 E(k)22

])(3.1)

with A(k)11 , E

(k)11 ∈ R

p1×p1 and A(k)22 , E

(k)22 ∈ R

p2×p2 , for k = 0, 1, . . . ,K − 1.

Swapping consists of computing orthogonal matrices Uk, Vk such that

[A(k)11 A

(k)12

0 A(k)22

]= UTk

[A(k)11 A

(k)12

0 A(k)22

]Vk,(3.2)

[E(k)11 E

(k)12

0 E(k)22

]= UTk

[E(k)11 E

(k)12

0 E(k)22

]Vk⊕1,(3.3)

for k = 0, . . . ,K − 1, and

λ(Π11) = λ(Π22), λ(Π22) = λ(Π11),(3.4)

where

Πii =[E(K−1)ii

]−1A(K−1)ii · · ·

[E(0)ii

]−1A(0)ii ,(3.5)

Πii =[E(K−1)ii

]−1A(K−1)ii · · ·

[E(0)ii

]−1A(0)ii .(3.6)

If some of the E(k)ii are singular then the products (3.5) and (3.6) should only be

understood in a formal sense, with their finite and infinite eigenvalues defined viathe GPRSF. The relation (3.4) means that all eigenvalues of Π22 are transferred

to Π11 and all eigenvalues of Π11 to Π22. For our purpose, A(k)ii , E

(k)ii ∈ R

pi×pi

are the diagonal blocks of a GPRSF and it can thus be assumed that pi ∈ 1, 2.

The K-cyclic swapping is performed in two main steps. First, the sequence(Ak, Ek) in (3.1) is block diagonalized by a nonorthogonal K-cyclic equivalencetransformation. Second, orthogonal transformation matrices are computed fromthis matrix pair sequence that perform the required K-cyclic swapping.

3.2 Swapping by block diagonalization and permutation.

Let us consider aK-cyclic matrix pair sequence (Lk, Rk), with Lk, Rk ∈Rp1×p2 ,which solves the periodic generalized coupled Sylvester equation (PGCSY)

A(k)11 Rk − LkA

(k)22 = −A

(k)12 ,

E(k)11 Rk⊕1 − LkE

(k)22 = −E

(k)12 .

(3.7)


Then (Lk, Rk) defines an equivalence transformation that block diagonalizes theK-cyclic matrix pair sequence (Ak, Ek) in (3.1):[

A(k)11 A

(k)12

0 A(k)22

]=

[Ip1 Lk

0 Ip2

][A(k)11 0

0 A(k)22

][Ip1 −Rk

0 Ip2

],

[E(k)11 E

(k)12

0 E(k)22

]=

[Ip1 Lk

0 Ip2

][E(k)11 0

0 E(k)22

][Ip1 −Rk⊕1

0 Ip2

],

(3.8)

for k = 0, 1, . . . ,K − 1.The diagonal blocks of the block diagonal matrices in (3.8) are swapped bya simple equivalence permutation:[

0 Ip2

Ip1 0

]([A(k)11 0

0 A(k)22

],

[E(k)11 0

0 E(k)22

])[0 Ip1

Ip2 0

]

=

([A(k)22 0

0 A(k)11

],

[E(k)22 0

0 E(k)11

]).

(3.9)

Altogether, by defining the matrices

Xk =

[Lk Ip1

Ip2 0

], Yk =

[0 Ip2

Ip1 −Rk

], k = 0, . . . ,K − 1,(3.10)

we obtain a non-orthogonal K-cyclic equivalence transformation such that[A(k)11 A

(k)12

0 A(k)22

]=Xk

[A(k)22 0

0 A(k)11

]Yk,

[E(k)11 E

(k)12

0 E(k)22

]=Xk

[E(k)22 0

0 E(k)11

]Yk⊕1.

(3.11)

It remains to show the existence of a solution to (3.7).

Lemma 3.1. Let the K-cyclic matrix sequences (A(k)11 , B

(k)11 ) and (A

(k)22 , B

(k)22 )

be regular. Then the PGCSY (3.7) has a unique solution if and only if

λ(Π11) ∩ λ(Π22) = ∅,(3.12)

where Πii is the formal matrix product defined in (3.5).

Proof. Since (3.7) is a system of 2p1p2K linear equations in 2p1p2K vari-ables, it suffices to show that the corresponding linear operator L : (Rp1×p2)2K →(Rp1×p2)2K , defined by

L : (Lk, Rk)K−1k=0 →

(A(k)11 Rk − LkA

(k)22 , E

(k)11 Rk⊕1 − LkE

(k)22

)K−1k=0

(3.13)

has a trivial kernel if and only if (3.12) is satisfied.

R. GRANAT ET AL.

1. Let λ ∈ λ(Π11) ∩ λ(Π22) and assume λ = ∞ (the case λ = ∞ can betreated analogously by switching the roles of E and A, and reversing the in-dex k). By the complex periodic Schur decomposition, there are sequences ofnonzero, right and left eigenvectors x0, . . . , xK−1 ∈ Cp1 , y0, . . . , yK−1 ∈ Cp2

satisfying

λkE(k)11 xk⊕1 = A

(k)11 xk, µky

Hk E

(k1)22 = yHk⊕1A

(k)22 ,(3.14)

for k = 0, . . . ,K − 1, where

λ = λ0 · · ·λK−1 = µ0 · · ·µK−1(3.15)

and E(k)11 xk⊕1 = 0, y

Hk E

(k1)22 = 0. Here, k 1 denotes (k − 1) mod K. The

relation (3.15) implies the existence of a sequence γ0, . . . , γK−1 ∈ C suchthat

γkλk = γk⊕1µk, k = 0, . . . ,K − 1,(3.16)

with at least one γk being nonzero. Defining

Rk = γkxkyHk E

(k1)22 , Lk = γk⊕1E

(k)11 xk⊕1y

Hk⊕1,

this guarantees that at least one of the matrices Rk and Lk is nonzero.Moreover, (3.14) and (3.16) yield

A(k)11 Rk − LkA

(k)22 = γkA

(k)11 xky

Hk E

(k1)22 − γk⊕1E

(k)11 xk⊕1y

Hk⊕1A

(k)22

= (γkλk − γk⊕1µk)E(k)11 xk⊕1y

Hk E

(k1)22 = 0,

E(k)11 Rk⊕1 − LkE

(k)22 = γk⊕1E

(k)11 xk⊕1y

Hk⊕1E

(k)22 − γk⊕1E

(k)11 xk⊕1y

Hk⊕1E

(k)22

= 0.

Hence, the kernel of L is nonzero if (3.12) is not satisfied.2. For the other direction of the proof, assume that (3.12) is satisfied. We firsttreat the case when all coefficient matrices are of order 1, i.e., we consider

α(k)1 rk − lkα

(k)2 = 0,

β(k)1 rk⊕1 − lkβ

(k)2 = 0,

(3.17)

with scalars α(k)j and β

(k)j . Because of (3.12), one of the products

β(0)1 · · ·β

(K−1)1 or β

(0)2 · · ·β

(K−1)2 must be nonzero. Without loss of general-

ity, we may assume that β(0)2 · · ·β

(K−1)2 = 0. Then (3.17) implies

α(k)1 rk =

α(k)2 β

(k)1

β(k)2

rk⊕1, k = 0, . . . ,K − 1.(3.18)


Recursively substituting rk and rk⊕1 yields

(α(0)1 · · ·α

(K−1)1

)r0 =

(α(0)2 · · ·α

(K−1)2

)(β(0)1 · · ·β

(K−1)1

)(β(0)2 · · ·β

(K−1)2

) r0.

The regularity assumption implies that one of α(0)1 · · ·α

(K−1)1 or

β(0)1 · · ·β

(K−1)1 is nonzero. Together with (3.12), this implies r0 = 0, and

in combination with (3.18) we get rk = 0 for all k = 0, . . . ,K − 1. Inaddition, from (3.17) we have

lk =β(k)1

β(k)2

rk⊕1, k = 0, . . . ,K − 1,

which in turn results in lk = 0.For coefficient matrices of larger order, we proceed by induction. By the

complex Schur decomposition, we may assume that A(k)jj and E

(k)jj are upper

triangular. Conformably partition

L(k) =

[L(k)11 L

(k)12

L(k)21 L

(k)22

], R(k) =

[R(k)11 R

(k)12

R(k)21 R

(k)22

],

A(k)11 =

[A(k)11 A

(k)12

0 A(k)22

], A

(k)22 =

[A(k)33 A

(k)34

0 A(k)44

],

and in an analogous manner E(k)jj . Then (3.7) with the right hand sides

replaced by zero yields

A(k)22 R

(k)21 − L

(k)21 A

(k)33 = 0, E

(k)22 R

(k⊕1)21 − L(k)21 E

(k)33 = 0.

By the induction assumption, we have L(k)21 = R

(k)21 = 0 for all k. Sub-

sequently, analogous periodic PGCSYs of smaller order can be found for

(L(k)11 , R

(k)11 ), (L

(k)22 , R

(k)22 ), and (L

(k)12 , R

(k)12 ), see also [13], eventually showing

that L(k) = R(k) = 0. This completes the proof.

Related periodic Sylvester equations were also studied in, e.g., [30, 36] and anoverview was given in [39]. For a recursive solution method based on the lastpart of the proof of Lemma 3.1, see [13].

3.3 Swapping by orthogonal transformation matrices.

From the definition (3.10), it can be observed that the first block column ofXkand the last block row of Yk have full column and row ranks, respectively. Hence,if we choose orthogonal matrices Qk and Zk from QR and RQ factorizations suchthat [

Lk

Ip2

]= Qk

[T(k)L

0

],[Ip1 −Rk

]=[0 T

(k)R

]ZTk ,(3.19)

R. GRANAT ET AL.

then T(k)L ∈ Rp2×p2 , T (k)R ∈ Rp1×p1 are not only upper triangular but also non-

singular for k = 0, 1, . . . ,K − 1.Partitioning Qk and Zk in conformity with Xk and Yk as

Qk =

[Q(k)11 Q

(k)12

Q(k)21 Q

(k)22

], Zk =

[Z(k)11 Z

(k)12

Z(k)21 Z

(k)22

],

we obtain

QTkXk =

⎡⎣T (k)L Q

(k)12

T

0 Q(k)22

T

⎤⎦, YkZk =

[Z(k)21 Z

(k)22

0 T(k)R

].(3.20)

By applying (Qk, Zk) as an orthogonal K-cyclic equivalence transformation to(Ak, Ek) we obtain(QTkAkZk,Q

TkEkZk⊕1

)=

(QTk

[A(k)11 A

(k)12

0 A(k)22

]Zk, Q

Tk

[E(k)11 E

(k)12

0 E(k)22

]Zk⊕1

)

=

(QTkXk

[A(k)22 0

0 A(k)11

]YkZk, Q

TkXk

[E(k)22 0

0 E(k)11

]Yk⊕1Zk⊕1

)

≡

([A(k)11 A

(k)12

0 A(k)22

],

[E(k)11 E

(k)12

0 E(k)22

]),

where ⎧⎪⎪⎪⎨⎪⎪⎪⎩A(k)11 = T

(k)L A

(k)22 Z

(k)21 ,

A(k)12 = T

(k)L A

(k)22 Z

(k)22 +Q

(k)11

TA(k)11 T

(k)R ,

A(k)22 = Q

(k)12

TA(k)11 T

(k)R ,

(3.21)

and ⎧⎪⎪⎪⎨⎪⎪⎪⎩E(k)11 = T

(k)L E

(k)22 Z

(k⊕1)21 ,

E(k)12 = T

(k)L E

(k)22 Z

(k⊕1)22 +Q

(k)11

TE(k)11 T

(k⊕1)R ,

E(k)22 = Q

(k)12

TE(k)11 T

(k⊕1)R .

(3.22)

Note that (3.19) implies the nonsingularity of Q(k)12 and Z

(k)21 . Hence, from the

equations above, we see that (A(k)11 , E

(k)11 ) and (A

(k)22 , E

(k)22 ) areK-cyclic equivalent

to (A(k)22 , E

(k)22 ) and (A

(k)11 , E

(k)11 ), respectively. In other words, the eigenvalues of

the K-cyclic matrix pair sequence (Ak, Ek) have been reordered as desired.

We remark that (A(k)11 , E

(k)11 ) and (A

(k)22 , E

(k)22 ) are generally not in GPRSF

after the K-cyclic swapping and have to be further transformed by orthogonaltransformations to restore the GPRSF of the matrix pair sequence (Ak, Ek) (seeSection 5.2).


3.4 Matrix representation of the PGCSY.

The key step of the reordering method is to solve the associated PGCSY (3.7).Using Kronecker products this problem can be rewritten as a linear system ofequations

ZPGCSYx = c,(3.23)

where ZPGCSY is a 2Kp1p2 × 2Kp1p2 matrix representation of the associatedlinear operator (3.13):

ZPGCSY

=

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

−A(0)22T⊗ Ip1 Ip2 ⊗A

(0)11

−E(0)22

T⊗ Ip1 Ip2 ⊗ E

(0)11

Ip2 ⊗ A(1)11 −A(1)22

T⊗ Ip1

−E(1)22T⊗ Ip1

. . .

. . .

−A(K−1)22

T⊗ Ip1

−E(K−1)22

T⊗ Ip1 Ip2 ⊗ E

(K−1)11

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦,

and x and c are 2Kp1p2 × 1 vector representations of the assembled unknownsand right hand sides, respectively:

x =

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

vec(L0)

vec(R1)

vec(L1)

vec(R2)

...

vec(RK−1)

vec(LK−1)

vec(R0)

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

, c =

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

vec(−A(0)12

)vec(−E(0)12

)vec(−A(1)12

)vec(−E(1)12

)...

vec(− A(K−1)12

)vec(−E(K−1)12

)

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

.

Here, the operator vec(M ) stacks the columns of a matrix M on top each otherinto one long vector. Note also that only the nonzero blocks of ZPGCSY are dis-played explicitly above. The sparsity structure of ZPGCSY can be exploited whenusing Gaussian elimination with partial pivoting (GEPP) or a QR factorizationto solve (3.23), see Section 5 for more details.By Lemma 3.1, the matrix ZPGCSY is invertible if and only if the eigenvaluecondition (3.12) is fulfilled. Throughout the rest of this paper we assume that thiscondition holds. If the condition is violated then, since (Ak, Ek) is in GPRSF,the eigenvalues of Π11 and Π22 are actually equal and there is in principle noneed for swapping.The invertibility of ZPGCSY is equivalent to

sep[PGCSY] = σmin(ZPGCSY) = 0.(3.24)

R. GRANAT ET AL.

As for deflating subspaces of regular matrix pairs (see, e.g., [33, 23]), the quantitysep[PGCSY] measures the sensitivity of the periodic deflating subspace pair ofthe GPRSF [5, 26, 34]. IfK, p1 or p2 become large this quantity is very expensiveto compute explicitly. By using the well-known estimation technique describedin [17, 19, 22, 23], reliable sep[PGCSY]-estimates can be computed at the costof solving a few PGCSYs.

4 Error analysis of K-cyclic equivalence swapping of diagonal blocks.

In this section, we present an error analysis of the direct method described inSection 3 by extending the results in [21] to the case of periodic matrix pairs.We sometimes omit the index range k = 0, 1, . . . ,K − 1, assuming that it isimplicitly understood.In finite precision arithmetic, the transformed matrix pair sequence will be af-fected by roundoff errors, resulting in a computed sequence (Ak, Ek). We expressthe computed transformed matrix pairs as

(Ak, Ek) = (Ak +∆Ak, Ek +∆Ek),

where (Ak, Ek) for k = 0, . . .K − 1 correspond to the exact matrix pairs in thereordered GPRSF of (Ak, Bk). Our task is to derive explicit expressions andupper bounds for the error matrices ∆Ak and ∆Ek. Most critical are of coursethe subdiagonal blocks of a 2 × 2 block partioned sequence (∆Ak,∆Ek). Thesemust be negligible in order to guarantee numerical backward stability for theswapping of diagonal blocks.Let (Lk, Rk) = (Lk + ∆Lk, Rk + ∆Rk) denote the computed solution to theassociated PGCSY. The residual pair sequence of the computed solution is then

given by (Y(k)1 , Y

(k)2 ), where

Y(k)1 ≡ A(k)11 Rk − LkA

(k)22 + A

(k)12 ,

Y(k)2 ≡ E(k)11 Rk⊕1 − LkE

(k)22 +E

(k)12 .

(4.1)

In addition, let Qk, T(k)L denote the computed factors of the kth QR factorization

G(k)L ≡

[Lk

Ip2

]= Qk

[T(k)L

0

],(4.2)

where Qk = Qk+∆Qk, T(k)L = T

(k)L +∆T

(k)L and Qk, T

(k)L are the exact factors.

Similarly, let Zk, T(k)R denote the computed factors of the kth RQ factorization

G(k)R ≡

[Ip1 −Rk

]=[0 T

(k)R

]ZTk ,(4.3)

where Zk = Zk+∆Zk, T(k)R = T

(k)R +∆T

(k)R and Zk, T

(k)R are the exact factors. If

Householder transformations are used to compute the factorizations (4.2)–(4.3),Qk and Zk are orthogonal to machine precision [41]. The error matrices ∆Qkand ∆Zk are essentially bounded by the condition numbers of G

(k)L and G

(k)R ,

respectively, times the relative errors in these matrices (e.g., see [33, 20]).


We transform (Ak, Ek) using the computed (Qk, Zk) in a K-cyclic equivalencetransformation giving

QT (Ak, Ek)Zk = (Ak +∆Ak, Ek +∆Ek),(4.4)

where (Ak, Ek) is the exact reordered GPRSF of the periodic (Ak, Bk) se-quence. Our aim is to derive explicit expressions and norm bounds for blocks of(∆Ak,∆Ek). First,

QTAkZk = (Qk +∆Qk)TAk(Zk +∆Zk)

= QTkAkZk +∆QTkAkZk +Q

TkAk∆Zk +∆Q

TkAk∆Zk,

(4.5)

and by dropping the second order term and using Ak = QTkAkZk and ∆Q

TkQk =

−Qk∆QTk up to first order we get

QTAkZk = Ak + Ak(ZTk ∆Zk

)+(−Qk∆Q

Tk

)Ak = Ak +∆Ak,

with ∆Ak ≡ AkUk +WkAk, where Uk = ZTk ∆Zk and Wk = −Qk∆Q

Tk .

(4.6)

Similarly, we get

QTBkZk⊕1 = Ek +∆Ek with ∆Ek ≡ EkUk⊕1 +WkEk.(4.7)

After partitioning Uk, Uk⊕1, Wk and (∆Ak,∆Ek) in conformity with (Ak, Ek)and doing straightforward block matrix multiplications we get

∆A(k)11 = A11U

(k)11 +W

(k)11 A

(k)11 + A

(k)12 U

(k)21 ,

∆A(k)12 = A

(k)11 U

(k)12 + A

(k)12 U

(k)22 +W

(k)11 A

(k)12 +W

(k)12 A

(k)22 ,

∆A(k)21 = A

(k)22 U

(k)21 +W

(k)21 A

(k)11 ,

∆A(k)22 = A

(k)22 U

(k)22 +W

(k)22 A

(k)22 +W

(k)21 A

(k)12 ,

and

∆E(k)11 = E11U

(k⊕1)11 +W

(k)11 E

(k)11 + E

(k)12 U

(k⊕1)21 ,

∆E(k)12 = E

(k)11 U

(k⊕1)12 + E

(k)12 U

(k⊕1)22 +W

(k)11 E

(k)12 +W

(k)12 E

(k)22 ,

∆E(k)21 = E

(k)22 U

(k⊕1)21 +W

(k)21 E

(k)11 ,

∆E(k)22 = E

(k)22 U

(k⊕1)22 +W

(k)22 E

(k)22 +W

(k)21 E

(k)12 .

Observe that ∆A(k)11 , ∆A

(k)22 , ∆E

(k)11 , ∆E

(k)22 affect the reordered K-cyclic diag-

onal block pairs and possibly the eigenvalues, while ∆A(k)21 and ∆E

(k)21 are even

more critical since they affect the eigenvalues as well as the stability of the re-ordering; these are the perturbations of interest that we investigate further. Theanalysis in [21] applied to (4.2)–(4.3), results in the following expressions for

R. GRANAT ET AL.

blocks of Uk and Wk:

U(k)11 = −Z

(k)21

−1Z(k)22 T

(k)R

−1∆RkZ

(k)21 ,

U(k)21 = T

(k)R

−1∆RkZ

(k)21 ,

U(k)22 = T

(k)R

−1∆RkZ

(k)22 ,

and

W(k)11 = −Q

(k)11

T∆LkT

(k)L

−1,

W(k)21 = −Q

(k)12

T∆LkT

(k)L

−1,

W(k)22 = Q

(k)12

TT(k)L

−1Q(k)11

TQ(k)12

−T,

up to first order perturbations. By substituting the expressions for U(k)ij and

W(k)ij in ∆A

(k)ij ,∆E

(k)ij we obtain

∆A(k)11 = Q

(k)11

TY(k)1 Z

(k)21 ,(4.8)

∆A(k)21 = Q

(k)12

TY(k)1 Z

(k)21 ,(4.9)

∆A(k)22 = Q

(k)12

TY(k)1 Z

(k)22 ,(4.10)

and

∆E(k)11 = Q

(k)11

TY(k)2 Z

(k⊕1)21 ,(4.11)

∆E(k)21 = Q

(k)12

TY(k)2 Z

(k⊕1)21 ,(4.12)

∆E(k)22 = Q

(k)12

TY(k)2 Z

(k⊕1)22 ,(4.13)

with the residuals (Y(k)1 , Y

(k)2 ) as in (4.1). From the QR and RQ factorizations

(3.19) we have

Q(k)21 = T

(k)L

−1, T

(k)L

TT(k)L = Ip2 + L

TkLk,(4.14)

and

Z(k)12

T= T

(k)R

−1, T

(k)R T

(k)R

T= Ip1 +RkR

Tk .(4.15)

From (4.14)–(4.15) we obtain the following relations between the singular values

of T(k)L , T

(k)R , Lk and Rk:

σ2(T(k)L

)= 1 + σ2(Lk), σ

2(T(k)R

)= 1 + σ2(Rk).(4.16)

Further, from the CS decomposition (see, e.g., [12]) of Qk and Zk, respectively,we obtain the relations∥∥Q(k)12 T∥∥2 = ∥∥Q(k)21 ∥∥2, ∥∥Q(k)22 ∥∥2 = ∥∥Q(k)11 ∥∥2,∥∥Z(k)12 T∥∥2 = ∥∥Z(k)21 ∥∥2, ∥∥Z(k)22 ∥∥2 = ∥∥Z(k)11 ∥∥2.


Combining these results, we get∥∥Q(k)12 T∥∥2 = ∥∥T (k)L −1∥∥2=

1

σmin(T(k)L

) = 1(1 + σ2min(Lk)

)1/2 ,∥∥Q(k)11 ∥∥2 = σmax(Lk)(

1 + σ2max(Lk))1/2 ,

and ∥∥Z(k)21 ∥∥2 = ∥∥T (k)R −1∥∥2=

1

σmin(T(k)R

) = 1(1 + σ2min(Rk)

)1/2 ,∥∥Z(k)22 ∥∥2 = σmax(Rk)(

1 + σ2max(Rk))1/2 ,

and we have proved the following theorem by applying the submultiplicativityof matrix norms to (4.8)–(4.13).

Theorem 4.1. After applying the computed transformation matrices Qk, Zkfrom (4.2)–(4.3) in a K-cyclic equivalence transformation of (Ak, Ek) definedin (3.1), we get

QTkAkZk = Ak, where Ak ≡ Ak +∆Ak =

[A(k)11 A

(k)12

0 A(k)22

]+

[∆A

(k)11 ∆A

(k)12

∆A(k)21 ∆A

(k)22

],

QTkEkZk⊕1 = Ek, where Ek ≡ Ek +∆Ek =

[E(k)11 E

(k)12

0 E(k)22

]+

[∆E

(k)11 ∆E

(k)12

∆E(k)21 ∆E

(k)22

].

The critical blocks of the error matrix pair (∆Ak,∆Ek) satisfy the followingerror bounds, up to first order perturbations:

∥∥∆A(k)11 ∥∥2 ≤ σmax(Lk)(1 + σ2max(Lk)

)1/2 · 1(1 + σ2min(Rk)

)1/2 · ∥∥Y (k)1 ∥∥F ,∥∥∆A(k)21 ∥∥2 ≤ 1(

1 + σ2min(Lk))1/2 · 1(

1 + σ2min(Rk))1/2 · ∥∥Y (k)1 ∥∥F ,

∥∥∆A(k)22 ∥∥2 ≤ 1(1 + σ2min(Lk)

)1/2 · σmax(Rk)(1 + σ2max(Rk)

)1/2 · ∥∥Y (k)1 ∥∥F ,and ∥∥∆E(k)11 ∥∥2 ≤ σmax(Lk)(

1 + σ2max(Lk))1/2 · 1(

1 + σ2min(Rk⊕1))1/2 · ∥∥Y (k)2 ∥∥F ,

∥∥∆E(k)21 ∥∥2 ≤ 1(1 + σ2min(Lk)

)1/2 · 1(1 + σ2min(Rk⊕1)

)1/2 · ∥∥Y (k)2 ∥∥F ,∥∥∆E(k)22 ∥∥2 ≤ 1(

1 + σ2min(Lk))1/2 · σmax(Rk⊕1)(

1 + σ2max(Rk⊕1))1/2 · ∥∥Y (k)2 ∥∥F ,

R. GRANAT ET AL.

for k = 0, 1, . . . ,K − 1. Moreover, the matrix pair sequences (A(k)11 , E(k)11 ),

(A(k)22 , E

(k)22 ) and (A

(k)11 , E

(k)11 ), (A

(k)22 , E

(k)22 ) are K-cyclic equivalent and have the

same generalized eigenvalues, respectively.

Remark 4.1. Theorem 4.1 shows that the stability and accuracy of thereordering method is governed mainly by the conditioning and accuracy of the

solution to the associated PGCSY. The errors ‖∆A(k)ij ‖2 and ‖∆E(k)ij ‖2 can be

as large as the norm of the residuals ‖Y (k)1 ‖F and ‖Y(k)2 ‖F , respectively. Indeed,

this happens when the smallest singular values of the exact sequences Lk and Rkare tiny, indicating an ill-conditioned underlying PGCSY equation. We have

experimental evidence that ‖Y (k)1 ‖F and ‖Y(k)2 ‖F can be large for large-normed

(ill-conditioned) solutions of the associated PGCSY. In the next section, we showhow we handle such situations and guarantee backward stability of the periodicreordering method.

Remark 4.2. For period K = 1, Theorem 4.1 reduces to the main the-orem of [21] on the perturbation of the generalized eigenvalues under eigenvaluereordering in the generalized real Schur form of a regular matrix pencil.

5 Algorithms and implementation aspects.

In this section, we address some implementation issues of the direct methodfor reordering eigenvalues in a generalized periodic real Schur form describedand analyzed in the previous sections.

5.1 Algorithms for solving the PGCSY.

The linear system (3.23) that arises from the PGCSY (3.7) has a particularstructure that needs to be exploited in order to keep the cost of the overall algo-rithm linear in K. The matrix ZPGCSY in (3.23) belongs to the class of borderedalmost block diagonal (BABD) matrices, which takes the more general form

Z =

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

Z0,0 Z0,2K−1Z1,0 Z1,1

Z2,1 Z2,2Z3,2 Z3,3

Z4,3. . .

. . .

Z2K−2,2K−2Z2K−1,2K−2 Z2K−1,2K−1

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦,(5.1)

where each nonzero block Zi,j is m × m (note that m = p1p2 for the matrixZPGCSY of size 2Km× 2Km). An overview of numerical methods that addresslinear systems with such BABD structure is given in [10]. Gaussian eliminationwith partial pivoting, for example, preserves much of the structure of Z and canbe implemented very efficiently. Unfortunately, matrices with BABD structure


happen to be among the rare examples of practical relevance which may lead tonumerical instabilities because of excessive pivot growth [43]. Gaussian elimina-tion with complete pivoting avoids this phenomenom but is too expensive bothin terms of cost and storage space. In contrast, structured variants of the QRfactorization are both numerically stable and efficient [11, 42]. In the following,we describe such a structured QR factorization in more detail.To solve a linear system Zx = y, we first reduce the matrix Z in (5.1) to uppertriangular form. For this purpose, we successively apply Householder transfor-

mations to reduce each block [ZTk,k, ZTk+1,k]

T, k = 0, 1, . . . , 2K − 2, to upper

trapezoidal form, and the block Z2K−1,2K−1 to upper triangular form. Eachcomputed Householder transformation is applied to the corresponding block row(as well as the right hand side y of the equation, which is blocked in conformitywith Z) before the next transformation is computed. The factorization procedureis outlined in Algorithm 5.1, where for simplicity of presentation the Householdertransformations are accumulated into orthogonal transformation matrices Qk.

Algorithm 5.1 Overlapping QR factorization of the BABD-system Zx = y

Input: Matrix Z ∈ R2Km×2Km, right hand side vector y ∈ R2Km.Output: Orthogonal transformations Qk ∈ R

2m×2m, k = 0, 1, . . . , 2K − 2,Q2K−1 ∈ Rm×m, triangular factor R ∈ R2Km×2Km withstructure as in Equation (5.2), vector y ∈ R2Km such that Rx = y.

for k = 0 up to 2K − 2 do

QR factorize: QkRk =[ZTk,k, Z

Tk+1,k

]T

Update:[ZTk,k+1, Z

Tk+1,k+1

]T= QTk

[ZTk,k+1, Z

Tk+1,k+1

]T

Update:[ZTk,K−1, Z

Tk+1,K−1

]T= QTk

[ZTk,K−1, Z

Tk+1,K−1

]T

Update right hand side: yk = QTk yk

end forQR factorize: Q2K−1R2K−1 = Z2K−1,2K−1Update right hand side: y2K−1 = Q

T2K−1y2K−1

It is straightforward to see that this procedure of computing overlapping or-thogonal factorizations produces the same amount of fill-in elements in the right-most block columns of Z as would GEPP produce in the worst case, see alsoFigure 5.1. More formally, the QR factorization reduces the matrix Z into thefollowing form:⎡

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

R0 G0 F0R1 L0 F1

R2 G1 F2R3 L1 F3

. . ....

. . ....

R2K−4 GK−2 F2K−4R2K−3 LK−2 F2K−3

R2K−2 GK−1R2K−1

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

,(5.2)

R. GRANAT ET AL.

Figure 5.1: The resulting R-factor from applying overlapping QR factorizations to thematrix ZPGCSY for K = 10, p1 = p2 = 2, visualized by the Matlab spy command.The “sawtooth” above the main block diagonal is typical for the PGCSY and does notoccur in the case of periodic matrix reordering [14].

with Rk, Lk, Fk, Gk ∈ Rm×m: Rk (k = 0, 1, . . . , 2K − 1) are upper triangu-lar, whereas Lk (k = 0, 1, . . . ,K − 2), Gk, (k = 0, 1, . . . ,K − 1), and Fk(k = 0, 1, . . . , 2K− 3) are dense matrices. Moreover, the blocks Lk are lower tri-angular provided that Z2,2, Z4,4, . . . , Z2K−2,2K−2 and Z2,1, Z4,3, . . . , Z2K−2,2K−1in (5.1) are lower and upper triangular, respectively, which is the case for if thequasi-triangular factor is placed at position k = K. To compute x we employbackward substitution on this structure, as outlined in Algorithm 5.2. All up-dates of the right hand side vector y in Algorithm 5.2 are general matrix-vector

Algorithm 5.2 Backward substitution for solving Rx = y

Input: Matrix R ∈ R2Km×2Km, with the upper triangular BABDstructure of (5.2), right hand side vector yε ∈ R2Km partitionedin conformity with the structure of R.

Output: Solution vector x ∈ R2Km such that Rx = y.

Solve: R2K−1x2K−1 = y2K−1Update and solve: R2K−2x2K−2 = y2K−2 −GK−1x2K−1for i = 0 to 2K − 3 doUpdate: yi = yi − Fix2K−1

end forfor i = K − 2 down to 0 doUpdate and solve: R2i+1x2i+1 = y2i+1 − Lix2i+2Update and solve: R2ix2i = y2i −Gix2i+1

end for


multiply and add (GEMV) operations, except the updates involving Li, whichare triangular matrix-vector multipy (TRMV) operations. All triangular solvesare level 2 TRMSV operations.

We remark that the new algorithms described here for solving small block-sized PGCSY equations can be used as kernel solvers in recursive blocked algo-rithms [13] for solving large-scale problems.

Remark 5.1. Solving a linear system with QR factorization yields a smallnorm-wise backward error [20], i.e., the computed solution x is the exact solu-tion of a slightly perturbed system (Z+Z)x = y, where ‖Z‖F = O(u‖Z‖F )with u denoting the unit roundoff. However, the standard implementation of theQR factorization is not row-wise backward stable, i.e., the norm of a row in Zmay not be negligible compared to the norm of the corresponding row in Z.This may cause instabilities if the norms of the coefficient matrices Ak, Ek differsignificantly. To avoid this effect, we scale each Ak and Ek to Frobenius norm 1before solving (3.7). Then each block row in ZPGCSY has Frobenius norm atmost

√2 and ‖ZPGCSY‖F ≤ 2

√K. The resulting swapping transformation is ap-

plied to the original unscaled K-cyclic matrix pair sequence. The correspondingresiduals satisfy

∥∥Y (k)1 ∥∥F = O(u‖Ak‖F‖(Lk, Rk)‖F ), ∥∥Y (k)2 ∥∥F = O(u‖Ek‖F ‖(Lk, Rk⊕1)‖F ).Combined with Theorem 4.1, this shows that the backward error of the developedreordering method is norm-wise small for each coefficient Ak and Ek, unless (3.7)is too ill-conditioned.

5.2 K-cyclic equivalence swapping algorithm with stability tests.

Considering the error analysis in Section 4 and in the spirit of [23, 14], weformulate stability test criteria for deciding whether a K-cyclic equivalence swapshould be accepted or not.

From Equation (3.19) and the following partition of the transformation matrixsequences Qk and Zk, we obtain the relations

LkQ(k)21 −Q

(k)11 = 0, Z

(k)12

TRk + Z

(k)22

T= 0,(5.3)

which can be computed before the swapping is performed. We use computedquantities of these relations to define the weak stability criterion:

Rweak = max0≤k≤K−1

max

⎛⎝∥∥LkQ(k)21 − Q(k)11 ∥∥F

‖Lk‖F,

∥∥Z(k)12 T Rk + Z(k)22 T∥∥F‖Rk‖F

⎞⎠.(5.4)

We remark that the relative criterion Rweak should be small even for ill-conditioned PGSCY equations with large normed solutions Lk and Rk (see alsoRemarks 5.1 and 6.1). After the swap has been performed, the maximum residual

R. GRANAT ET AL.

over the whole K-period defines a strong stability criterion:

Rstrong = max0≤k≤K−1

max

(∥∥Ak − QkAkZTk ∥∥F‖Ak‖F

,

∥∥Ek − QkEkZTk⊕1∥∥F‖Ek‖F

).(5.5)

If both Rweak and Rstrong are less than a specified tolerance εu (a small constanttimes the machine precision), the swap is accepted, otherwise it is rejected. Inthis way, backward stability is guaranteed for the K-cyclic equivalence swapping.In summary, we have the following algorithm for swapping two matrix pairsequences of diagonal blocks in the GPRSF of a regular K-cyclic matrix pair(Ak, Bk) of size (p1 + p2)× (p1 + p2):

1. Compute K-cyclic matrix pair sequence (Lk, Rk) by solving the scaledPGCSY (3.7) using Algorithm 5.1 and Algorithm 5.2.

2. Compute K-cyclic orthogonal matrix sequence Qk using QR factorizations:[LkIp2

]= Qk

[T(k)L

0

], k = 0, 1, . . . ,K − 1.

3. Compute K-cyclic orthogonal matrix sequence Zk using RQ factorizations:

[Ip1 −Rk] =[0 T

(k)R

]ZTk , k = 0, 1, . . . ,K − 1.

4. Compute (A, E) = (QTkAkZk, QTkEkZk⊕1) for k = 0, 1, . . . ,K − 1, i.e., an

orthogonal K-cyclic equivalence transformation of (Ak, Ek):

A ≡

[A(k)11 A

(k)12

A(k)21 A

(k)22

]= QTk

[A(k)11 A

(k)12

0 A(k)22

]Zk,

E ≡

[E(k)11 E

(k)12

E(k)21 E

(k)22

]= QTk

[E(k)11 E

(k)12

0 E(k)22

]Zk⊕1.

5. If Rweak < εu ∧Rstrong < εu, accept swap and

5a. set A(k)21 = E

(k)21 = 0,

5b. restore GPRSF of (A(k)11 , E

(k)11 ) and (A

(k)22 , E

(k)22 ) by applying the periodic

QZ algorithm to the two diagonal block matrix pair sequences;

otherwise reject swap.

The stability tests in step 5 for accepting a K-cyclic swap guarantee that the

subdiagonal blocks A(k)21 and E

(k)21 are negligible compared to the rest of the

matrices. Step 5b can be performed by a fixed number of operations for adjacentdiagonal blocks in the GPRSF, i.e., for pi ∈ 1, 2 (see [14] for the standardperiodic matrix case).Properly implemented, this algorithm requires O(K) floating point operations(flops), where K is the period. When it is used to reorder two adjacent diagonalblocks in a larger n × n periodic matrix pair in GPRSF then the off-diagonalparts are updated by the transformation matrices Qk and Zk, which additionallyrequires O(Kn) flops.


There are several other important implementation issues to be considered fora completely reliable implementation. For example, iterative refinement in ex-tended precision arithmetic can be used to improve the accuracy of the PGCSYsolution and avoid the possibility of rejection (see, e.g., [20]). Our experiencesso far concern iterative refinement in standard precision arithmetic and (as ex-pected) the results show no substantial improvements.

6 Computational experiments.

The direct reordering algorithm described in the previous sections has beenimplemented in MATLAB. A more robust and efficient Fortran implementa-tion will be included in a forthcoming software toolbox for periodic eigenvalueproblems [16]. In this section, we present some numerical results using our pro-totype implementation. All experiments were carried out in double precision(εmach ≈ 2.2× 10−16).The test examples range from well-conditioned to ill-conditioned problems,including matrix pair sequences of small and large period. In Table 6.1, we displaysome problem characteristics1: problem dimension n (2, 3 or 4 corresponding toswapping a mix of 1 × 1 and 2 × 2 blocks), period K, the computed value ofsep[PGCSY] = σmin(ZPGCSY) (see Section 3.4) and

s = 1/√1 + ‖(L0, R0)‖2F ,

where (L0, R0) are the first solution components of the associated PGCSY (3.7).The quantities s and sep[PGCSY] partly govern the sensitivity of the selectedeigenvalues and associated periodic deflating subspaces, see [5, 26, 34].

Table 6.1: Problem characteristics.

Example n K sep[PGCSY] s

I 2 2 1.1E−8 1.4E−4II 4 10 3.3E−2 4.9E−1III 4 100 1.4E−3 1.9E−1IV 4 100 1.4E−14 6.1E−7V 3 5 7.1E−2 6.2E−1VI 2 50 1.6E−2 5.8E−1

The results from the periodic reordering are presented in Table 6.2. Theseinclude the weak (Rweak) and strong (Rstrong) stability tests, the residual normsfor the GPRSF before (Rgprsf) and after (Rreord) the reordering computed as inEquation (5.5), a relative orthogonality check of the accumulated transforma-tions after (Rorth) the reordering computed as

Rorth =maxk

(∥∥Ink − WTk Wk∥∥F ,∥∥Ink − WkWTk ∥∥F )εmach

,

1 The test examples used are available at http://www.cs.umu.se/~granat/gpreord/examples.m.

R. GRANAT ET AL.

Table 6.2: Reordering results using QR factorization to solve the associated PGCSY.

Example Rweak Rstrong Rgprsf Rreord Rorth Reig

I 6.3E−17 5.0E−16 0 5.0E−16 2.0 3.2E−9II 1.6E−16 9.0E−16 4.8E−15 5.6E−15 7.5 4.6E−15III 1.8E−16 1.3E−15 2.2E−16 3.2E−15 8.3 3.3E−14IV 8.3E−17 1.0E−15 2.2E−16 2.4E−15 7.6 3.8E−14V 1.3E−16 7.0E−16 8.3E−17 9.1E−16 2.8 1.8E−15VI 3.8E−16 8.2E−16 0 9.8E−16 2.0 1.1E−16

where the maximum is taken over the period K for all transformation matri-ces Qk and Zk. The last column displays the maximum relative change of theeigenvalues after the periodic reordering

Reig = maxk

|λk − λk|

|λk|, λk ∈ λ(ΦE−1A(K, 0)).

Notice that we normally do not compute λi explicitly but keep it as an eigen-value pair (αi, βi) to avoid losing information because of roundoff errors. This isespecially important for tiny and large values of αi and/or βi.The eigenvalues before and after reordering are shown in full precision undereach example. For 2×2 matrix sequences, we compute the generalized eigenvaluesvia unitary transformations in the GPRSF as is done in LAPACK’s DTGSEN [1].

Example I. Consider the following sequence with n = 2,K = 2:

A1 =

[2ε1/2 −10 −2ε1/2

], A2 = E1 = E2 =

[ε1/2 10 ε1/2

].

This product has the (α, β)-pairs

(α1, β1) = (4.4408920985006, 2.2204460492503)× 10−16,

(α2, β2) = (−4.4408920985006,−2.2204460492503)× 10−16,

which correspond to well-defined eigenvalues λ1 = 2.0 and λ2 = −2.0. But allαi and βi are at the machine precision level and this fact signals an obvious riskfor losing accuracy after the reordering:

(α1, β1) = (9.5161972853921,−4.7580986273341)× 10−16,

(α2, β2) = (−2.0724163126336,−1.0362081563168)× 10−16,

which define the eigenvalues

λ1 = −2.00000000645717 and λ2 = 2.00000000000000.

Example II. Consider reordering the eigenvalues λ1,2 = 2±2i and λ3,4 = 1±iin a matrix pair sequence with dimension n = 4 and period K = 10. Thecomputed eigenvalues from the GPRSF are correct to full machine precision.


After reordering we get the following (α, β)-pairs:

(α1, β1) = (−6.69743899940721− 6.69743899940718i,−6.69743899940718),

(α2, β2) = (1.03550511685258− 1.03550511685258i, 1.03550511685258),

(α3, β3) = (1.93142454580911+ 1.93142454580911i, 0.96571227290455),

(α4, β4) = (0.29862160747967− 0.29862160747967i, 0.14931080373983).

A quick check reveals that these pairs correspond to a reordering at almost fullmachine precision.

Example III. The eigenvalue pair cos π4 ± sinπ4 i is located on the unit circle.

In LQ-optimal control (see Section 2) we want to compute a periodic deflatingsubspace corresponding to the stable eigenvalues, i.e., the eigenvalues inside theunit disc.For illustration, consider reordering the eigenvalues λ1,2 = (cos

π4 + δ) ±

(sin π4 + δ)i and λ3,4 = (cosπ4 − δ) ± (sin

π4 − δ)i, where δ ∈ [0, 1], in a matrix

pair sequence of period K = 100 arising, for example, from performing multi-rate sampling of a continuous-time system. At first, let δ = 10−1. The matrixproduct has the computed (α, β)-pairs

(α1, β1) = (0.80710678118654+ 0.80710678118654i, 1.00000000000002),

(α2, β2) = (0.80710678118654− 0.80710678118654i, 1.00000000000002),

(α3, β3) = (−0.60710678118655− 0.60710678118655i,−0.99999999999999),

(α4, β4) = (−0.60710678118655+ 0.60710678118655i,−1.00000000000000),

which correspond to the eigenvalues

λ1,2 = 0.80710678118652± 0.80710678118652i,

λ3,4 = 0.60710678118655± 0.60710678118655i.

After reordering we have

(α1, β1) = (−1.53524924293502− 1.53524924293503i,−2.52879607098851),

(α2, β2) = (−6.49961741950939+ 6.49961741950943i,−10.70588835592705),

(α3, β3) = (−0.07538905267396− 0.07538905267396i,−0.09340654103182),

(α4, β4) = (0.31916641695471− 0.31916641695471i, 0.39544509400044),

which define the eigenvalues λ1,2 = 0.60710678118654± 0.60710678118655i and

λ3,4 = 0.80710678118654± 0.80710678118654i.

Example IV. We consider Example III again, now with δ = 10−12 andK = 100 as before. The matrix product has the computed (α, β)-pairs

(α1, β1) = (−0.70710678118754− 0.70710678118754i,−1.00000000000002),

(α2, β2) = (−0.70710678118755+ 0.70710678118755i,−0.99999999999999),

R. GRANAT ET AL.

(α3, β3) = (0.70710678118555+ 0.70710678118555i, 1.00000000000000),

(α4, β4) = (−0.70710678118555+ 0.70710678118555i,−1.00000000000000),

which define the eigenvalues λ1,2 = 0.70710678118755± 0.70710678118754i andλ3,4 = 0.70710678118555± 0.70710678118555i. After reordering we have

(α1, β1) = (−0.70710678121274− 0.70710678121274i,−1.00000000003845),

(α2, β2) = (0.70710678121274− 0.70710678121274i, 1.00000000003845),

(α3, β3) = (−0.70710678116035− 0.70710678116036i,−0.99999999996155),

(α4, β4) = (−0.70710678116036+ 0.70710678116036i,−0.99999999996155),

which correspond to the eigenvalues

λ1,2 = 0.70710678118555± 0.70710678118555i,

λ3,4 = 0.70710678118754± 0.70710678118755i.

The eigenvalues outside and inside the unit disc come closer and closer witha decreasing δ and the problem gets more ill-conditioned but we are still able toreorder the eigenvalues with satisfying accuracy. We illustrate the situation inFigure 6.1.

Figure 6.1: Results from reordering the eigenvalues of Examples III and IV withδ ∈ [0, 1]. The displayed quantities are the same as in Tables 6.1–6.2. The horizon-tal axis shows the logarithm of the parameter δ and the vertical axis displays thelogarithm of the computed quantities.

Example V. Consider reordering the following single eigenvalue λ1 =√3 with

the eigenvalue pair λ2,3 =√32 ±

1√7i and period K = 5. The original (α, β)-pairs


are

(α1, β1) = (1.73205080756888, 1.00000000000000),

(α2, β2) = (−0.86602540378444− 0.37796447300923i,−1.00000000000000),

(α3, β3) = (0.86602540378444− 0.37796447300923i, 1.00000000000000).


(α1, β1) = (2.97791477286351+ 1.29966855807374i, 3.43859979147302),

(α2, β2) = (−1.43573050214952+ 0.62660416225306i,−1.65783878379957),

(α3, β3) = (0.30383422966230, 0.17541877428455),

which define eigenvalues λ1,2 = 0.86602540378444 ± 0.37796447300923i and

λ3 = 1.73205080756888.

Example VI. Consider reordering the eigenvalues λ1 = 1 and λ2 = ∞ andperiod K = 6. The original (α, β)-pairs are

(α1, β1) = (−0.9999999999999986, 1.000000000000000),

(α2, β2) = (1.000000000000000, 0.000000000000000).


(α1, β1) = (−1.564941642946474E−5, 0.000000000000000),

(α2, β2) = (6.390014634138052E+4, 6.390014634138062E+4),

which correspond to the eigenvalues λ1 = −∞ and λ2 = 0.9999999999999985.

7 Remarks.

In this section, we give some closing remarks on the developed reorderingmethod by presenting a comparison with existing methods and describing anextension to more general matrix products.

7.1 Comparison with existing methods.

Hench and Laub [18, Sec. II.F] proposed to swap the diagonal blocks in (3.1)by first explicitely computing the (n1 + n2)× (n1 + n2) matrix product

E−1K−1AK−1 · · ·E−10 A0.

Then, in exact arithmetic, the standard swapping technique [2] applied to thisproduct yields the outer orthogonal transformation matrix Z0. The inner orthog-onal matrices Q0, . . . , QK , Z1, . . . , ZK are obtained by propagating Z0 throughthe triangular factors, using QR and RQ factorizations. In finite-precision arith-metic, however, such an approach can be expected to perform poorly if any of

R. GRANAT ET AL.

the matrices Ek is nearly singular, see [21] for the case K = 1. Also for verywell-conditioned Ek (e.g., identity matrices), there are serious numerical diffi-culties to be expected for long products as the computed entries become proneto under- and overflow. Further numerical instabilities arise from the fact thattriangular matrix-matrix multiplication is in general not a numerically backwardstable operation, unless n1 = n2 = 1 [12].Benner et al. [3] developed collapsing techniques that can be used to improvethe above approach by avoiding all explicit inversions of Ek. Instead of a singleproduct, two n × n matrices E and A are computed such that E−1A has thesame eigenvalues. The generalized swapping technique [21, 23] applied to thepair (E, A) yields Z0. Again, the other orthogonal matrices are successively com-puted from QR and RQ factorizations. Although this approach avoids difficultiesassociated with (nearly) singular matrices Ek, it may still become numericallyunstable, see [15] for an example.Bojanczyk and Van Dooren [9] carefully modified the approach by Hench andLaub for the case n1 = n2 = 1 to avoid underflow, overflow, and numericalinstabilities. This variant has been observed to perform remarkably well in finite-precision arithmetic. Unfortunately, its extension to n1 = 2 and/or n2 = 2 is notclear. Thus, only real matrix products having real eigenvalues can be addressed.For complex eigenvalues one could in principle work with the complex periodicSchur decomposition, which has no 2× 2 blocks. Both, the swapping techniquedescribed in [9] and the one proposed in this paper, extend to the complex casein a straightforward manner. The obvious drawback of using complex arithmeticfor real input data is the increased computational complexity. Moreover, realeigenvalues and complex conjugate eigenvalue pairs will not be preserved infinite-precision arithmetic. For example, if we apply [9] to Example I we obtainthe following swapped eigenvalues:

λ1 = −1.87282049572853+ 0.58861866785157i,

λ2 = 1.74709648107590+ 0.47770138864644i.

The realness of the original eigenvalues is completely lost. Somewhat unexpect-edly, our algorithm also achieves significantly higher accuracy for this particularexample.

7.2 Reordering in even more general matrix products.

Reordering can also be considered in matrix products of the form

AsK−1K−1A

sK−2K−2 · · ·A

s00 , s0, . . . , sK−1 ∈ 1,−1,(7.1)

which is needed, e.g., in [4]. This could be accomplished by the method describedin this paper after inserting identity matrices into the matrix sequence such thatthe exponent structure has the same structure as in Equation (1.2), i.e., everysecond matrix is an inverse. It turns out that this trick is actually not needed. Alltechniques developed in this paper can be extended to work directly with (7.1).


For example, the associated periodic Sylvester-like matrix equation takes theform

A(k)11 Xk −Xk+1A

(k)22 = −A

(k)12 , for sk = 1,

A(k)11 Xk+1 −XkA

(k)22 = −A

(k)12 , for sk = −1,

which can be addressed by the methods in Section 5. See [16] for more details.

Acknowledgements.

The authors are grateful to Peter Benner, Isak Jonsson, and Andras Vargafor valuable discussions related to this work. The authors thank the referees fortheir valuable comments.

REFERENCES

1. E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. W. Demmel, J. J. Dongarra, J. Du Croz,A. Greenbaum, S. Hammarling, A. McKenney, and D. C. Sorensen, LAPACK Users’Guide, 3rd edn., SIAM, Philadelphia, PA, 1999.

2. Z. Bai and J. W. Demmel, On swapping diagonal blocks in real Schur form, Linear AlgebraAppl., 186 (1993), pp. 73–95.

3. P. Benner and R. Byers, Evaluating products of matrix pencils and collapsing matrixproducts, Numer. Linear Algebra Appl., 8 (2001), pp. 357–380.

4. P. Benner, R. Byers, V. Mehrmann, and H. Xu, Numerical computation of deflatingsubspaces of skew-Hamiltonian/Hamiltonian pencils, SIAM J. Matrix Anal. Appl., 24(1)(2002), pp. 165–190.

5. P. Benner, V. Mehrmann, and H. Xu, Perturbation analysis for the eigenvalue problem ofa formal product of matrices, BIT, 42(1) (2002), pp. 1–43.

6. S. Bittanti and P. Colaneri, eds., Periodic Control Systems 2001, IFAC Proceedings Vol-umes, Elsevier Science & Technology, Amsterdam, NL, 2002.

7. S. Bittanti, P. Colaneri, and G. De Nicolao, The periodic Riccati equation, in The RiccatiEquation, S. Bittanti, A. J. Laub, and J. C. Willems, eds., Springer, Berlin, Heidelberg,Germany, 1991, pp. 127–162.

8. A. Bojanczyk, G. H. Golub, and P. Van Dooren, The periodic Schur decomposition; algo-rithm and applications, in Proc. SPIE Conference, vol. 1770 (1992), pp. 31–42.

9. A. Bojanczyk and P. Van Dooren, On propagating orthogonal transformations in a productof 2x2 triangular matrices, in Numerical Linear Algebra, L. Reichel, A. Ruttan and R.S.Varga, eds., Walter de Gruyter, 1993, pp. 1–9.

10. G. Fairweather and I. Gladwell, Algorithms for Almost Block Diagonal Linear Systems,SIAM Rev., 44(1) (2004), pp. 49–58.

11. B. Garrett and I. Gladwell, Solving bordered almost block diagonal systems stably andefficiently, J. Comput. Methods Sci. Eng., 1 (2001), pp. 75–98.

12. G. H. Golub and C. F. Van Loan, Matrix Computations, 3rd edn., Johns Hopkins Univer-sity Press, Baltimore, MD, 1996.

13. R. Granat, I. Jonsson, and B. Kagstrom, Recursive blocked algorithms for solving periodictriangular Sylvester-type matrix equations, in PARA’06 – State of the Art in Scientificand Parallel Computing, 2006, B. Kagstrom, ed., Lect. Notes Comput. Sci., vol. 4699,Springer, 2007, pp. 531–539.

R. GRANAT ET AL.

14. R. Granat and B. Kagstrom, Direct eigenvalue reordering in a product of matrices inperiodic Schur form, SIAM J. Matrix Anal. Appl., 28(1) (2006), pp. 285–300.

15. R. Granat, B. Kagstrom, and D. Kressner, Reordering the eigenvalues of a periodic matrixpair with applications in control, in Proc. of 2006 IEEE Conference on Computer AidedControl Systems Design (CACSD) (2006), pp. 25–30. (ISBN:0-7803-9797-5).

16. R. Granat, B. Kagstrom, and D. Kressner,MATLAB tools for Solving Periodic EigenvalueProblems, Accepted for 3rd IFAC Workshop PSYCO’07, Saint Petersburg, Russia, 2007.

17. W. W. Hager, Condition estimates, SIAM J. Sci. Stat. Comput., 5 (1984), pp. 311–316.

18. J. J. Hench and A. J. Laub, Numerical solution of the discrete-time periodic Riccatiequation, IEEE Trans. Automat. Control, 39(6) (1994), pp. 1197–1210.

19. N. J. Higham, Fortran codes for estimating the one-norm of a real or complex matrix, withapplications to condition estimation, ACM Trans. Math. Software, 14(4) (1988), pp. 381–396.

20. N. J. Higham, Accuracy and Stability of Numerical Algorithms, 2nd edn., SIAM, Philadel-phia, PA, 2002.

21. B. Kagstrom, A direct method for reordering eigenvalues in the generalized real Schurform of a regular matrix pair (A,B), in Linear Algebra for Large Scale and Real–TimeApplications, M.S. Moonen, G.H. Golub, and B.L.R. De Moor, eds., Kluwer AcademicPublishers, Amsterdam, 1993, pp. 195–218.

22. B. Kagstrom and P. Poromaa, Distributed and shared memory block algorithms for thetriangular Sylvester equation with sep−1 estimators, SIAM J. Matrix Anal. Appl., 13(1)(1992), pp. 90–101.

23. B. Kagstrom and P. Poromaa, Computing eigenspaces with specified eigenvalues of a regu-lar matrix pair (A,B) and condition estimation: theory, algorithms and software, Numer.Algorithms, 12(3–4) (1996), pp. 369–407.

24. D. Kressner, An efficient and reliable implementation of the periodic QZ algorithm, inIFAC Workshop on Periodic Control Systems, S. Bittanti and P. Colaner, eds., Cernobbio-Como, Italy, 2001, pp. 187–192.

25. D. Kressner, Numerical Methods and Software for General and Structured EigenvalueProblems, PhD thesis, TU Berlin, Institut fur Mathematik, Berlin, Germany, 2004.

26. W.-W. Lin and J.-G. Sun, Perturbation analysis for the eigenproblem of periodic matrixpairs, Linear Algebra Appl., 337 (2001), pp. 157–187.

27. V. Mehrmann, The Autonomous Linear Quadratic Control Problem, Theory and Numer-ical Solution, in Lect. Notes Control Inf. Sci., Number 163, Springer, Heidelberg, 1991.

28. V. V. Sergeichuk, Computation of canonical matrices for chains and cycles of linear map-pings, Linear Algebra Appl., 376 (2004), pp. 235–263.

29. J. Sreedhar and P. Van Dooren, Pole placement via the periodic Schur decomposition, inProc. 1993 Amer. Contr. Conf., San Fransisco, CA, 1993, pp. 563–1567.

30. J. Sreedhar and P. Van Dooren, A Schur approach for solving some periodic matrixequations, in Systems and Networks: Mathematical Theory and Applications, U. Helmke,R. Mennicken, and J. Saurer, eds., vol. 77, Akademie Verlag, Berlin, 1994, pp. 339–362.

31. J. Sreedhar and P. Van Dooren, Forward/backward decomposition of periodic descriptorsystems and two-point boundary value problems, European Control Conf. ECC 97, Brus-sels, Belgium, July 1–4, 1997.

32. J. Sreedhar and P. Van Dooren, Periodic descriptor systems: solvability and condition-ability, IEEE Trans. Automat. Control, 44(2) (1999), pp. 311–313.

33. G. W. Stewart and J.-G. Sun, Matrix Perturbation Theory, Academic Press, New York,1990.

34. J.-G. Sun, Perturbation bounds for subspaces associated with periodic eigenproblems, Tai-wanese J. Math., 9(1) (2005), pp. 17–38.

35. C. F. Van Loan, Generalized Singular Values with Algorithms and Applications, PhDthesis, The University of Michigan, 1973.


36. A. Varga, Periodic Lyapunov equations: some applications and new algorithms, Int. J.Control, 67(1) (1997), pp. 69–87.

37. A. Varga, Computation of Kronecker-like forms of periodic matrix pairs, in Proc. ofSixteenth International Symposium on Mathematical Theory of Networks and Systems(MTNS 2004), Leuven, Belgium, 2004.

38. A. Varga, On solving discrete-time periodic Riccati equations, in Proc. of IFAC 2005 WorldCongress, Prague, Czech Republic, 2005.

39. A. Varga and P. Van Dooren, Computational methods for periodic systems – an overview,in Proc. of IFAC Workshop on Periodic Control Systems, Como, Italy, 2001, pp. 171–176.

40. D. S. Watkins, Product eigenvalue problems, SIAM Rev., 47 (2005), pp. 3–40.

41. J. H. Wilkinson, The Algebraic Eigenvalue Problem, Clarendon Press, Oxford, 1965.

42. S. J. Wright, Stable parallel algorithms for two-point boundary value problems, SIAM J.Sci. Stat. Comput., 13(3) (1992), pp. 742–764.

43. S. J. Wright, A collection of problems for which Gaussian elimination with partial pivotingis unstable, SIAM J. Sci. Comput., 14(1) (1993), pp. 231–238.

98

III

Paper III

Parallel Solvers for Sylvester-Type Matrix Equationswith Applications in Condition Estimation,

Part I: Theory and Algorithms∗



Abstract: Parallel ScaLAPACK-style algorithms for solving eight common standardand generalized Sylvester-type matrix equations and various sign and transposed vari-ants are presented. All algorithms are blocked variants based on the Bartels–Stewartmethod and involve four major steps: reduction to triangular form, updating the righthand side with respect to the reduction, computing the solution to the reduced trian-gular problem and transforming the solution back to the original coordinate system.Novel parallel algorithms for solving reduced triangular matrix equations based onwavefront-like traversal of the right hand side matrices are presented together with ageneric scalability analysis. These algorithms are used in condition estimation andnew robust parallel sep−1-estimators are developed. Experimental results from threeparallel platforms are presented and analyzed using several performance and accuracymetrics. The analysis includes results regarding general and triangular parallel solversas well as parallel condition estimators.

Key words: Parallel Computing, Parallel Algorithms, Eigenvalue problems, Condi-tion estimation, Sylvester matrix equations.

∗ Submitted to ACM Transactions on Mathematical Software, July 2007.

101

102

Parallel Solvers for Sylvester-Type MatrixEquations with Applications in ConditionEstimation, Part I: Theory and Algorithms

R. GRANAT and B. KAGSTROM

Umea University, Sweden

Parallel ScaLAPACK-style algorithms for solving eight common standard and generalized Sylvester-

type matrix equations and various sign and transposed variants are presented. All algorithms are

blocked variants based on the Bartels–Stewart method and involve four major steps: reduction

to triangular form, updating the right hand side with respect to the reduction, computing the

solution to the reduced triangular problem and transforming the solution back to the original

coordinate system. Novel parallel algorithms for solving reduced triangular matrix equations

based on wavefront-like traversal of the right hand side matrices are presented together with a

generic scalability analysis. These algorithms are used in condition estimation and new robust

parallel sep−1-estimators are developed. Experimental results from three parallel platforms are

presented and analyzed using several performance and accuracy metrics. The analysis includes

results regarding general and triangular parallel solvers as well as parallel condition estimators.

Categories and Subject Descriptors: F.2.1 [Analysis of Algorithms and Problem Complex-

ity]: Numerical Algorithm and Problems—Computation on matrices; G1.3 [Numerical Analy-

sis]: Numerical Linear Algebra—Conditioning, Linear Systems; G.4 [Mathematical Software]:

Algorithm Design and Analysis, Reliability and robustness

General Terms: Parallel Computing, Parallel Algorithms

Additional Key Words and Phrases: Eigenvalue problems, Condition estimation, Sylvester matrix

equations

1. INTRODUCTION

We develop and analyze parallel ScaLAPACK-style algorithms for eight commonstandard and generalized Sylvester-type matrix equations listed in Table I. Perhapsthe most common are the continuous-time Sylvester equation AX −XB = C withA ∈ Rm×m, B ∈ Rn×n, C ∈ Rm×n, and the continuous-time Lyapunov equationAX + XAT = C, with A,C ∈ Rm×m. Such matrix equations occur frequentlyin various eigenvalue problems, condition estimation of computed eigenvalues andeigenspaces, and in control applications like model reduction, signal processing andsubspace computations (see, e.g., [Higham 2002], [Konstantinov et al. 2003], [Datta2004] and the references therein). Following [Jonsson and Kagstrom 2002a; 2002b],we distinguish between one-sided and two-sided matrix equations. In the one-sided matrix equations (SYCT, LYCT and GCSY), the solution is multiplied by acoefficient matrix from one side only, and in two-sided matrix equations (SYDT,

Technical Report UMINF-07.15. Author’s addresses: Department of Computing Science and

HPC2N, Umea University, SE-901 87, UMEA. E-mail: granat,[email protected]. The research

was conducted using the resources of the High Performance Computing Center North (HPC2N).

Financial support was provided by the Swedish Research Council under grant VR 621-2001-3284

and by the Swedish Foundation for Strategic Research under grant A3 02:128.

104 · R. Granat and B. Kagstrom

Table I. Considered standard and generalized matrix equations. CT and DT denote the

continuous-time and discrete-time variants, respectively.


Standard CT Sylvester AX −XB = C ∈ Rm×n SYCTStandard CT Lyapunov AX + XAT = C ∈ Rm×m LYCTStandard DT Sylvester AXB −X = C ∈ Rm×n SYDTStandard DT Lyapunov AXAT −X = C ∈ Rm×m LYDTGeneralized Coupled (AX − Y B, DX − Y E) GCSY

Sylvester = (C, F ) ∈ R(m×n)×2

Generalized Sylvester AXBT − CXDT = E ∈ Rm×n GSYLGeneralized CT Lyapunov AXET + EXAT = C ∈ Rm×m GLYCTGeneralized DT Lyapunov AXAT − EXET = C ∈ Rm×m GLYDT

LYDT, GSYL, GLYCT and GLYDT) it is multiplied from both sides. Two-sidedmatrix equations have a more complex data dependency in the solution processthan the one-sided. We denote the coefficient matrices that multiply the solutionfrom left and right by left multiplying and right multiplying matrices, respectively.

Solvability conditions for the matrix equations in Table I are formulated in termsof non-intersecting spectra of standard or generalized eigenvalues of the involvedcoefficient matrices and matrix pairs, respectively, or equivalently by nonzero as-sociated sep-functions (see Section 5 and, e.g., [Higham 1993; Hammarling 1982;Jonsson and Kagstrom 2002b] and the references therein).

Moreover, for (G)LYCT and (G)LYDT a symmetric right hand side C implies asymmetric solution X.

1.1 Notation

In this paper, Im denotes an identity matrix of order m and [0]m×n denotes a zeromatrix with m rows and n columns. A⊗B denotes the Kronecker product of twomatrices, defined as a matrix with its (i, j)th block element equal to aijB. Thenotation vec(X) denotes a vector with elements consisting of an ordered stack ofthe columns of the matrix X, going from left to right. By op(A) we denote thematrix A or its transpose AT . Block (i, j) of a block partitioned matrix is denotedAij . Some other notation is introduced in its context.

1.2 Basic solution methods for matrix equations

The standard Schur method for solving SYCT proposed in [Bartels and Stewart1972] can be applied to any standard equation in Table I:

S1: Transform A and, in case of SYCT/SYDT, B to real Schur form.S2: Update C with respect to the Schur decompositions.S3: Solve the resulting reduced triangular matrix equation.S4: Transform the obtained solution back to the original coordinate system.

Step S1 results in the factorizations TA = QT AQ and TB = PT BP , where Q ∈Rm×m and P ∈ Rn×n are orthogonal and TA and TB are upper quasi-triangular, i.e.,having 1×1 and 2×2 diagonal blocks corresponding to real and complex conjugatepairs of eigenvalues, respectively. Reliable and efficient algorithms for this reduction

Parallel Algorithms for Sylvester-Type Matrix Equations · 105

step can be found in LAPACK [Anderson et al. 1999], and in ScaLAPACK [Henryet al. 2002; Blackford et al. 1997] for distributed memory (DM) environment.

Steps S2 and S4 are typically conducted by two consecutive GEMM-operations[Dongarra et al. 1990; Kagstrom et al. 1998a; 1998b], each having complexityO(2m3) for square matrices, as C = QT CP in case of SYCT/SYDT, and asC = QT CQ in case of LYCT/LYDT. The following reduced matrix equations cor-responding to SYCT, SYDT, LYCT and LYDT

TAX − XTB = C, (1)TAXTT

B − X = C, (2)TAX + XTT

A = C, (3)TAXTT

A − X = C. (4)

are solved in step S3.The solution X is obtained in step S4 by computing X = QXPT , in case of

SYCT/SYDT, and X = QXQT , in case of LYCT/LYDT.We formulate a Bartels–Stewart-style method for the generalized matrix equa-

tions GCSY, GSYL, GLYCT, and GLYDT:

G1: Reduce the involved left hand side matrix pairs to generalized Schur form.G2: Update the right hand side(s) with respect to the generalized Schur decompo-

sitions.G3: Solve the resulting triangular generalized matrix equation.G4: Transform the computed solution back to the original coordinate system.

For illustration we consider GCSY. In step G1, we transform (A,D) and (B,E) viaHessenberg-triangular reductions and the QZ algorithm into the factorizations TA =QT

1 AZ1, TD = QT1 DZ1, TB = QT

2 BZ2 and TE = QT2 EZ2, where TA, TB are upper

quasi-triangular, TD, TE are upper triangular and Q1, Z1 ∈ Rm×m, Q2, Z2 ∈ Rn×n

are orthogonal. The QZ algorithm was first presented in [Moler and Stewart 1973]and reliable software for this reduction step can be found in LAPACK [Andersonet al. 1999]. Important contributions regarding block forms and parallelization ofthe generalized Schur reduction was reported in [Dackland and Kagstrom 1999;Adlerborn et al. 2001; Adlerborn et al. 2002; Adlerborn et al. 2006]. See also[Kagstrom and Kressner 2006] for multishift variants with aggressive early deflation[Braman et al. 2002].

In step G2, the right hand side is updated as (C, F ) = (QT1 CZ2, Q

T1 FZ2) which

forms the reduced triangular GCSY

(TAX − Y TB , TDX − Y TE) = (C, F ), (5)

which is solved in step G3. The solution X is obtained in step G4 by the transfor-mation (X,Y ) = (Z1XZT

2 , Q1Y QT2 ).

Notice that the symmetry of the right hand side C in the Lyapunov equationsLYCT, LYDT, GLYCT, and GLYDT (see Table I) is preserved.

All linear matrix equations considered can be rewritten as an equivalent large lin-ear system of equations Zx = y, where Z is the Kronecker product representationof the corresponding Sylvester-type operator. Using x ∈ vec(X), vec(X, Y ) andy ∈ vec(C), vec(C, F ), vec(E), we introduce the Z-matrices in Table II. These


Table II. The Kronecker product representations ZACRO of the corresponding Sylvester operators

in Table I. The last column displays the complexity of solving ZACROx = y and ZTACROx = y,

which both appear in condition estimation of the standard and generalized matrix equations.

Acronym ZACRO ZACROx = y, ZTACROx = y

SYCT In ⊗A−BT ⊗ Im O(m2n + mn2)

LYCT Im ⊗A + A⊗ Im O(m3)

SYDT B ⊗A− Im·n O(m2n + mn2)

LYDT A⊗A− Im2 O(m3)

GCSY

"In ⊗A −BT ⊗ Im

In ⊗D −ET ⊗ Im

#O(2m2n + 2mn2)

GSYL B ⊗A−D ⊗ C

(O(4m2n + 2mn2), m ≤ n

O(2m2n + 4mn2), m > n

GLYCT A⊗A− E ⊗ E O( 163

m3)

GLYDT E ⊗A + A⊗ E O( 163

m3)

formulations are only efficient to use explicitly when solving small-sized problemsfor kernel solvers, see, e.g., LAPACK’s DLASY2 and DTGSY2 for solving SYCT andGCSY using Gaussian elimination with complete pivoting (GECP), and the super-scalar kernels of the RECSY library [Jonsson and Kagstrom 2002a; 2002b; Jonssonand Kagstrom 2003], but are also important in derivation of condition estimationalgorithms, see Section 5.

Matrix equations have been in focus of the numerical community for quite sometime. The classic paper of R. H. Bartels and G. W. Stewart [Bartels and Stew-art 1972] has served as foundation for developments of direct solution methods forrelated problem (see, e.g., [Hammarling 1982; Golub et al. 1979]). O’Leary andStewart presented parallel dataflow algorithms for triangular Sylvester equations[O’Leary and Stewart 1985]. Level-3 block algorithms for Sylvester-type matrixequations was developed in [Kagstrom and Poromaa 1992; 1996]. In [Benner andQuintana-Ortı 1999; Benner et al. 2002; 2004] fully iterative methods for solving ma-trix equations are considered. Parallel wavefront algorithms for solving Lyapunovequations using Hammarling’s method was presented in [Claver 1999]. Jonsson andKagstrom presented fast recursive blocked algorithms for solving matrix equationsin [Jonsson and Kagstrom 2002a; 2002b] and developed the software library RECSY[Jonsson and Kagstrom 2003; RECSY ]. Automatic generation of algorithms for thecontinuous-time Sylvester equations was developed in [Quintana-Ortı and van deGeijn 2003]. Kressner developed stable block algorithms based on Hammarling’s[Hammarling 1982] method for Lyapunov-type equations in [Kressner 2006]. In thispaper, we complete our work in [Granat et al. 2003; Granat et al. 2004; Granatand Kagstrom 2006a; 2006b] developing high performance parallel algorithms forSylvester-type matrix equations. The developed algorithms are implemented in thesoftware package SCASY which is presented in Part II [Granat and Kagstrom 2007].

The rest of this paper is outlined as follows. Section 2 reviews serial blockedBartels-Stewart-style algorithms for solving the matrix equations in Table I. Par-allel variants of these blocked algorithms for distributed memory environments arediscussed in Section 3. In Section 4, we present a generic scalability analysis ofthe parallel algorithms, and Section 5 discusses condition estimation and parallel


algorithms for estimation of associated sep-functions. Section 6 points to someimplementation issues, while details are presented in our Part II paper [Granatand Kagstrom 2007]. Finally, results from computational experiments from threeparallel platforms are presented and analyzed using several performance and accu-racy metrics. The analysis includes results regarding general and triangular parallelsolvers as well as parallel condition estimators.

2. SERIAL BLOCK ALGORITHMS

In this section, we review explicitly blocked algorithms for solving the triangularreduced matrix equations discussed in the previous section.

Let mb be the block size used in a partitioning of any left multiplying m × mmatrix that appear in a matrix equation of Table I (e.g., A in SYCT). Similarly,let nb be the block size used in a partitioning of any right multiplying n×n matrix(B in SYCT). Then mb is the row block size and nb is the column block size of thesolution (X in SYCT) which overwrites the right hand side matrix (C in SYCT).We let Dl = dm/mbe and Dr = dn/nbe denote the number of diagonal blocks ofthe left and right multiplying matrices, respectively. If m = n and mb = nb, thenDl = Dr.

2.1 Standard Sylvester-type matrix equations

After a block partitioning of A, B, and C as illustrated above, the triangular SYCTcan be rewritten as

AiiXij −XijBjj = Cij − (Dl∑

k=i+1

AikXkj −j−1∑

k=1

XikBkj), (6)

for i = 1, 2, . . . , Dl and j = 1, 2, . . . , Dr, which can be implemented as a blockedlevel 3 algorithm using a couple of nested loops (see, e.g., [Kagstrom and Poromaa1992; Poromaa 1998; Granat et al. 2003]).

For LYCT we partition A and C by rows and columns using a single block sizemb as

AiiXij + XijATjj = Cij − (

Dl∑

k=i+1

AikXkj +Dl∑

k=j+1

XikATjk), (7)

where i, j = 1, 2, . . . , Dl. Thereby, we reformulate LYCT into several smaller SYCT(i 6= j) and LYCT (i = j) problems and level 3 updates of the right hand side.Since C = CT implies Xij = XT

ji, we can rewrite (7) for the main diagonal blocksCii as

AiiXii + XiiATii = Cii − (

Dl∑

k=i+1

AikXTik + XikAT

ik). (8)

The updates of Cii in (8) define a sequence of symmetric rank-2k (SYR2K) oper-ations, each having the performance of a regular GEMM operation when imple-mented as a GEMM-based level-3 BLAS [Kagstrom et al. 1998a; 1998b]. Notice,however, that the number of regular GEMM operations in the block algorithmwidely exceeds the number of SYR2K operations. Moreover, for a symmetric right


hand side C = CT , a blocked level 3 solver should only solve subsystems andperform updates explicitly in the lower (or upper) triangular part of C.

We block the two-sided standard equations SYDT and LYDT similarly as

AiiXijBTjj −Xij = Cij −

Dl∑

k=i

Dr∑

l=j

AikXklBTjl, (i, j) 6= (k, l), (9)

and

AiiXijATjj −Xij = Cij −

Dl∑

k=i

Dl∑

l=j

AikXklATjl, (i, j) 6= (k, l). (10)

Observe that the blocking of LYDT (10) decomposes the problem into severalsmaller SYDT (i 6= j) and LYDT (i = j) equations. Comparing equations (9)-(10) with (6)-(7) reveals a more complex data dependency for the two-sided matrixequations and a straightforward implementation would cause redundant work im-plying a total complexity of O(m2n2). To find a remedy, we consider SYDT aftera 2× 2 explicit blocking:

A11X11BT11 −X11 = C11 −A11X12B

T12 −A12(X21B

T11 + X22B

T12)

A11X12BT22 −X12 = C12 −A12X22B

T22

A22X21BT11 −X21 = C21 −A22X22B

T12

A22X22BT22 −X22 = C22.

(11)

By computing X21BT11+X22B

T12 before multiplying with A12 and X22B

T12 only once,

we avoid performing any redundant computations. Implicitly, this corresponds toforming the product XBT before multiplying with A. To perform updates in theright hand side with respect to Xij , we sum up products of the form XijB

Tkj in

a temporary submatrix Ek of size mb × nb, where k corresponds to the kth blockcolumn of C (and X). The workspace Ek is an intermediate sum of matrix products.In Algorithm 1, we illustrate an O(m2n+mn2) level 3 block algorithm with a storagerequirement of m2 +n2 +mn+Dr ·mb ·nb. In general, blocked algorithms for two-sided matrix equations impose a trade-off between memory consumption and flopscounts caused by the more complex data dependency in solving for X blockwise.This complexity reduction was not explicitly performed in RECSY [Jonsson andKagstrom 2003], but the recursive blocking, which follows (11) at each stage, resultsonly in slightly higher complexity constants for the two-sided recursive blockedalgorithms [Jonsson and Kagstrom 2002b].

For LYDT a serial block algorithm that combines the ideas from SYDT andLYCT can be formulated.

2.2 Generalized Sylvester-type matrix equations

The explicit blocking for SYCT extends naturally to GCSY (see, e.g., [Kagstromand Poromaa 1996; Poromaa 1998]): Let mb and nb be block sizes used in a par-titioning of (A,D) and (B,E), respectively. This allows GCSY to be rewritten inblock-partitioned form as

AiiXij − YijBjj = Cij − (∑Dl

k=i+1 AikXkj −∑j−1

k=1 YikBkj)

DiiXij − YijEjj = Fij − (∑Dl

k=i+1 DikXkj −∑j−1

k=1 XikEkj),(12)


Algorithm 1 Explicitly blocked algorithm for SYDT.Input: Matrices A, B and C. A and B in real Schur form. Block sizes mb and nb.Output: Solution matrix X.

for i =Dl, 1, −1 dofor j = Dr, 1, −1 do

% Solve the (i, j)th subsystem

AiiXijBTjj −Xij = Cij

% Update all blocks Ek, k = 1, . . . , jendjend = jif (i = 1) then

jend = j − 1end iffor k = 1, jend, 1 do

% If going for a new block row, set Ek to zeroif (j = Dr) then

Ek = [0]mb×nbend ifEk = Ek + XijBT

kjend for% If we are at the end of a block row, update Clp, l = 1, . . . , i− 1, p = 1, . . . , Dr

if (j = 1) thenfor l = 1, i− 1, 1 do

for p = 1, Dr , 1 doClp = Clp − AliEp

end forend for% If we are not at the end of a block row, prepare for next solve

elseCi,j−1 = Ci,j−1 − AiiEj−1

end ifend for

end for

where i = 1, 2, . . . , Dl and j = 1, 2, . . . , Dr. A resulting serial level 3 algorithm isimplemented as a couple of nested loops over the matrix operations defined by (12).

We rewrite the reduced triangular GSYL equation in block-partitioned form as

AiiXijBTjj − CiiXijD

Tjj = Eij −Rij − Sij

Rij =∑Dl

k=i

∑Dr

l=j AikXklBTjl, (i, j) 6= (k, l)

Sij =∑Dl

k=i

∑Dr

l=j CikXklDTjl, (i, j) 6= (k, l).

(13)

Notice that when extending the explicit blocking from SYDT to GSYL we needtwo workspace arrays F and G of size Dr ·mb · nb for storing intermediate matrixproducts.

Before closing this section, we remark that an explicitly blocked algorithm forLYDT extends straightforwardly to GLYCT/GLYDT. As for LYCT, we may refor-mulate the updates of the main block diagonal of C in terms of SYR2K operations.However, this is not possible when we use the intermediate sums of matrix productsto reduce the arithmetic cost in the two-sided block algorithms.

3. PARALLEL BLOCK ALGORITHMS FOR TRIANGULAR MATRIX EQUATIONS

The block algorithms presented in the previous section can be implemented straight-forwardly on single CPU machines as well as parallelized for shared memory (SM)and/or multicore environments. Our objective is to parallelize the block algorithms


for distributed memory (DM) environments, and our approach extends on the basicideas from [Poromaa 1998; Granat et al. 2003]: Assume that the matrices and ma-trix pairs involved are block partitioned as described in Section 2 and distributedover a rectangular Pr × Pc processor mesh using 2-dimensional (2D) block cyclicdistribution following the ScaLAPACK convention [Blackford et al. 1997]. The sub-solutions Xij (and Yij) in the blocked algorithms for the reduced matrix equationsare now obtained by a wavefront traversal of the block diagonals or block anti-diagonals of the right hand side matrix (or matrices). Indeed, the wavefront blockapproach can be seen as a generalization of the dataflow approach, illustrated in[O’Leary and Stewart 1985] and applied to a triangular standard Sylvester equa-tion. We remark that the blocked formulations of the triangular matrix equationsin Section 2 reveal the data dependencies which in turn control the dataflow at ablock level of our distributed algorithms.

3.1 Basic parallelization techniques

We introduce some of the parallelization techniques used by considering the trian-gular SYCT equation (6). To solve SYCT in parallel we start the wavefront in theSouth-West corner of C. Since all subsolutions Xij on each block diagonal of X areindependent, we solve for as many of them as possible in parallel. Each subsolutionXij is broadcasted in block row i and block column j, i.e., in the process rows andcolumns corresponding to block row i and block column j, respectively, and used inlevel 3 updates. Theoretically, we may utilize at least min(Pr, Pc) processors con-currently in each phase of the subsystems solves and Pr ·Pc processors concurrentlyin the updates.

To compute Xij for certain values of i and j, we need Aii and Bjj to be held bythe same processor that holds Cij . We also need the blocks used in the updatesto be in the right place at the right time forcing us to communicate for someblocks during the solves and updates. This is done on demand [Granat et al.2003]: whenever a processor misses any block that it needs for solving a nodesubsystem or doing a level 3 update, it is received from the owner in a single point-to-point communication. Because of the global view of data in the ScaLAPACKenvironment, all processors know exactly which blocks to send and receive in eachstep of the algorithm.

Another, perhaps more elegant, communication method is to shift the involvedmatrices one block across the process mesh for every block diagonal that is solved(e.g., see [Poromaa 1998]). This brings all the blocks needed for the solves andupdates associated with the current block diagonal into the right place in one singleglobal communication operation1. This matrix block shift method puts restrictionson the dimensions of the processor grid and the data distribution: Pr must be aninteger multiple of Pc or vice versa, and the last rows/columns of A and B must bemapped onto the last process row/column [Poromaa 1998]. In addition, the matrixblock shift strategy can only be applied to matrix equations where the left and rightmultiplying matrices have the same transpose mode [Granat et al. 2003].

The parallel block algorithms for SYCT extend to GCSY by the following obviouschanges (see [Poromaa 1998] for a more detailed description using the matrix block

1The matrices are shifted such that as little data as possible is communicated.


shift communication):

—Solve GCSY node subsystems instead of SYCT node subsystems.—Broadcast Xij in block column j and Yij in block row i.—Perform pairs of level 3 updates in the matrix pair (C, F ).

For GCSY it is beneficial to perform on demand communication of pairs of sub-matrices (where applicable) to minimize the number of communication operations.Similarly, if matrix block shift communication is used, we shift blocks of matrixpairs together in one message, if possible.

To solve LYCT in parallel, the matrix C/X is traversed along its block anti-diagonals, starting in the South-East corner going North-West. If C = CT , wecan save half of the computational work compared to SYCT to the cost of somemore communication: each Xij from the lower triangular part of C is sent to thetransposed (j, i)th position, transposed locally, saved as Xji and broadcasted forupdates in block row j. In this way, the lower triangular part of C is transposedon-the-fly into the upper triangular part at almost no extra cost.

For SYDT, we introduce a matrix E ∈ Rm×n that is aligned with C and par-allelize the block algorithm for SYDT in terms of four phases. Each phase isillustrated by an example with reference to Figure 1.

Phase I Solving of subsystems on the current block anti-diagonal (i, j). Traversethe block anti-diagonals of the matrix C/X from South-East to North-West,and solve the SYDT subsystems encountered in parallel. Once Xij has beencomputed, it is broadcasted in block row i.

Figure 1: Solve for X66, X57, X48 on the current anti-diagonal (bold borderedblocks) and broadcast them in block rows 6, 5, and 4, respectively.

Phase II Updating of block row i of E. For each subsolution Xij from Phase I,compute Eik = Eik + XijB

Tkj , k = 1, . . . , j, in parallel. Once Eij has been

computed, it is broadcasted in block column j.

Figure 1: Use X66, X57, X48 to update block rows 6, 5, and 4 of E. UpdateE66, E57, E48 first and broadcast them in block columns 6, 7, and 8.

Phase III Updating of block column j of C. For each Eij in the current block anti-diagonal, compute Clj = Clj − AliEij , l = 1, . . . , i − 1, which (partly) overlapwith the computations of Phase II.

Figure 1: Use E66, E57, E48 to update block columns 8, 7, and 6 of C.Phase IV Preparing for next block anti-diagonal (i, j − 1) of C. Finally, update2

Ci,j−1 = Ci,j−1 − AiiEi,j−1. Notice that this update does not cover the top-most block in the next anti-diagonal, Ci−1,j , which was prepared in the previousphase.

Figure 1: Use E65, E56, E47 to update C65, C56, C47 in next anti-diagonal (stripedbold bordered blocks). So far, E38 = 0 and C38 already completed in PhaseIII.

2Since Aii is quasi-triangular, this can be implemented as a TRMM operation in combination

with k DAXPY operations, where k is equal to the number of 2× 2 blocks in Aii.


Fig. 1. The SYDT wavefront: standard, two-sided, non-symmetric. Yellow blocks contain al-

ready computed subsolutions. Bold bordered blocks are subsolutions on the current anti-diagonal.

Blocks with the same color are used together in subsystem solves, GEMM-updates or preparations

for the next anti-diagonal. Striped bold bordered blocks are used in several operations.

The total storage requirement of this algorithm is m2+n2+2mn, which includes them×n workspace E. A brief pseudo-code description of the parallel SYDT algorithmis given in Algorithm 2, with the four phases above marked as comments.

The algorithm for LYDT is parallelized in a similar way. With C = CT , we onlycompute the lower triangular part of X explicitly, as for LYCT.

We can apply Algorithm 2 to GSYL with the following modifications:

—Solve GSYL subsystems (13) instead of SYDT subsystems (9).—Use two matrices F and G for storing intermediate sums of matrix products.—Broadcast Xij in block row i and broadcast the submatrix pair (Fij , Gij) in block

column j.—Do all updates in the right hand side with respect to both Fij and Gij .

Without going into details, we remark that the parallel algorithm for LYDT bya similar reasoning can be applied straightforwardly to GLYCT/GLYDT.

3.2 Remarks on adapting matrix block shifting for two-sided matrix equations

When adapting the block shift approach for two-sided matrix equations, bufferswith intermediate sums of matrix products must be kept aligned and thereforeshifted along with the right hand side. Then the least costly shift strategy doesnot only depend on the matrix dimensions (i.e., m and n), but also on the num-ber of elements to be sent. Moreover, by storing the local pieces of the GSYLbuffer pair (F, G) in consecutive memory, each shift of this pair is performed in onecommunication step.


Algorithm 2 Parallel algorithm with on demand communication for SYDT.Input: Matrices A, B and C. A and B in real Schur form. Block sizes mb and nb.Output: Solution matrix X (overwrites C).

for k = 1, # block diagonals in C do% Phase I: Solve subsystems on current block anti-diagonal of C in parallelif (mynode holds Cij) then

if (mynode does not hold Aii and/or Bjj) thenCommunicate for receiving Aii and/or Bjj

end ifSolve AiiXijBT

jj −Xij = Cij

Broadcast Xij to processors holding blocks in block row i of Celse if (mynode needs Xij) then

Receive Xij

% Phase II: Update all Eik, k = 1, . . . , j, in paralleljend = jif (i = 1) then

jend = j − 1end iffor k = jend, 1, −1 do

if (j = Dr) thenSet Eik = [0]mb×nb

end ifif (mynode does not hold Bkj) then

Communicate for receiving Bkj

end ifCompute Eik = Eik + XijBT

kj

if (mynode holds Eij) thenBroadcast Eij to processors holding blocks in block column j of C

end ifend for

end ifif (mynode needs Eij) then

Receive Eij

% Phase III: Update block column j of C in parallelfor l = 1, i− 1, 1 do

if (mynode does not hold Ali) thenCommunicate for receiving Ali

end ifCompute Clj = Clj − AliEij

end forend if% Phase IV: Prepare to solve for next block anti-diagonal of C in parallelif (j > 1) then

if (mynode holds Ci,j−1) thenif (mynode does not hold Aii) then

Communicate for receiving Aii

end ifCompute Ci,j−1 = Ci,j−1 − AiiEi,j−1

end ifend if

end for

3.3 Handling 2× 2 diagonal blocks in left hand side Schur forms

The parallel algorithms must assure that no 2×2 diagonal blocks are shared betweendifferent blocks (and processors) in the explicit blocking. In [Granat et al. 2003], atechnique for handling this problem via an implicit redistribution of the elementsin the matrices in SYCT was presented. The same technique is adapted to alltriangular matrix equations considered. In general, all left multiplying matrices areredistributed in the same way and all right multiplying matrices are redistributed inthe same way to keep all block sizes in the right hand side and the solution matricesconsistent with the corresponding serial and parallel block algorithms. For details,wee refer to Part II [Granat and Kagstrom 2007].


3.4 Improving scalability of the triangular solvers

By introducing a second level of blocking at the nodes, we can utilize a pipelineapproach in our parallel algorithms that improves the scalability of the triangularsolvers. For the one-sided matrix equations (see Table I), define mb2 ≤ mb andnb2 ≤ nb and solve each local node subsystem by explicit blocking and a serial blockalgorithm (e.g., see Algorithm 1), return temporarily after each small mb2 × nb2

system has been solved, broadcast this partial subsolution as before and return forlocal level 3 updates. The same procedure is repeated for all mb2 × nb2 systems,leading to:

—a pipelining effect that causes less idle time among updating processors,—data reuse from the on demand communications, and—preserved cost-optimality for larger processor meshes since larger values of mb

and nb are beneficial from the previous point of data reuse (see Section 4).

However, the drawbacks are:

—a higher influence on the parallel runtime from the network latency due to morerow and column oriented broadcasts, see Section 4 (but the total amount of datais the same),

—tuning mb2,nb2 in relation to mb,nb is a non-trivial architecture-dependent task,and

—the local explicit blocking algorithm has to take care of the risk of splittingany local 2 × 2 blocks on the diagonal which were not removed by the implicitredistribution, leading to a lot of bookkeeping of indices and dimensions.

We call this process multiple pipelining. The optimal relation between the values ofmb, nb, mb2 and nb2 is platform dependent and must be properly balanced since badchoices may give suboptimal cache or network performance. Pipelining is turnedoff for mb2 = mb and nb2 = nb.

For the two-sided matrix equations, multiple pipelining is implemented for nb2 =nb, i.e., we do local explicit blocking in the row dimension only, which is necessaryto enhance pipelining of partial results of the intermediate sums of matrix productsas well as the local subsolutions.

3.5 Global scaling strategies

The parallel algorithms avoid overflow in the right hand side computations byperforming global scaling using a scaling factor sglobal ∈ (0, 1]. A simple techniquewould be to give the smallest local scaling factor as output from the actual solver sothat if overflow has occurred, the user may scale the right hand side and reinvokethe solving routine. In contrast to its simplicity, this strategy may more thandouble the execution time and is therefore not preferred. We instead perform globalscaling on-the-fly, as follows. After each phase in the diagonal block traversal, aglobal scaling factor is computed and sent to all processors. If the factor sglobal isless than 1.0, each processor scales its local parts of the right hand side data andthe computational process can continue. In practice, the calculation of the globalscaling factor involves a special case of the k-to-all reduction operation with thefollowing restrictions: given a Pr × Pc processor mesh we have k = min(Pr, Pc)


and the k different local scaling factors s1, s2, . . . , sk are all located in processorslabelled with different process rows and columns, i.e., no processor pair among thesek processors share row or column index in the mesh. Under these circumstances, wepropose the following efficient three-phase implementation of the k-to-all reductionoperation for computing sglobal:

—Perform one-to-all broadcasts of the local value in the 1D scope corresponding tothe largest mesh dimension, where the holders of the values are the roots of theoperations. Since k = min(Pr, Pc), the k broadcasts are independent.

—Perform all-to-one reductions of the local values in the 1D scope correspondingto the smallest mesh dimension and save the minimum in one root. As in theprevious step, all reductions are independent.

—Perform once again one-to-all broadcasts but now in the same scope and with thesame root processor as in phase (2) delivering the global minimum sglobal to allprocessors in the mesh. The broadcasts in this final step are also independent.

Notice that the last two steps above together perform an all-to-all reduction, whichcan be implemented using different strategies including the one described above.

It is recommended that this type of operations are carried out with broadcastand reduction algorithms originally developed for the hypercube network and bymapping the underlying scope onto a logical hypercube topology, i.e., in O(log P?)steps, where P? is the number of processors in the actual scope (see, e.g., [Gramaet al. 2003] and Section 4).

4. GENERIC SCALABILITY ANALYSIS OF THE PARALLEL BLOCK ALGORITHMS

The parallel block algorithm for solving triangular SYCT equations with the matrixblock shift approach (see Section 3.1) was analyzed in [Poromaa 1998]. In thissection, we present a generic scalability analysis of the triangular solvers using theon demand communication scheme. We start by defining some important conceptsused in the performance models.

4.1 Definitions

Let W denote the work, which is the number of floating point operations (flops)required to solve the triangular matrix equation using the (best) sequential algo-rithm. We define Tp = Ta + Tc as the parallel runtime using p = Pr ·Pc processors,where Ta and Tc denote the arithmetic and communication (synchronization) run-times, respectively. Usually, T1 = W . An algorithm is said to be cost-optimal ifthe parallel overhead is of the same order as the serial complexity (the work), i.e.,pTp −W = θ(W ). Cost-optimal algorithms are regarded as being scalable, i.e., ca-pable of using computational resources efficiently as the problem size (the numberof flops to perform) and the number of processors are increased simultaneously.

Let ta, ts, and tw be the arithmetic time to perform a floating point operation,the start-up time for sending a message, and the per-word transfer time as theamount of time it takes to send one byte through one link of the interconnectionnetwork of the parallel computer system. Usually, ts and tb are constants while ta isa function of the data locality. For modern parallel computers, the communicationcost model for a single point-to-point communication is usually approximated by


ts + tbl, where l denotes the message size in bytes, regardless of the number oflinks traversed [Grama et al. 2003]. For a one-to-all broadcast or its dual operationall-to-one reduction in a certain scope (e.g., a process row or a process column),we assume that such an operation is performed using recursive doubling, i.e., inO(log2 P?) steps, where P? is the number of processors in the actual scope.

Let Ts and Tu denote the complexity (measured in flops) of solving a node sub-system and a (pair of) level 3 update(s) in the matrix equation, respectively. Ex-pressions for Ts (and Tu) for each matrix equation are presented in Section 1. Asbefore, Dl and Dr denote the number of main diagonal blocks of the left multiplyingand right multiplying matrices, respectively, of the matrix equation considered.

In the following analysis, we consider double precision arithmetic, i.e., each float-ing point number requires eight bytes of storage. In addition, we denote the messagesize in double precision numbers by li, where i corresponds to a specific commu-nication operation in the parallel algorithms which all are explained below andsummarized in Table III.

4.2 Analysis

The number of subsystems to solve for general problems is S = DlDr, and halfas many in the case of symmetry (exploiting local symmetry for i = j). Thenumber of (pairs of) level 3 updates (mostly GEMM operations) U is approximatedby U = (D2

l Dr + D2rDl)/2 for general problems and half as many for symmetric

problems. The level of parallelism is min(Pr, Pc) for the subsystem solves and Pr ·Pc

for the updates. Ignoring the O(n2) cost of any performed scaling, we model thearithmetic cost as

Ta =(

S

min(Pr, Pc)Ts +

U

Pr · PcTu

)ta. (14)

The overall dominating part of (14) consists of level 3 updates.Tc is modelled by six contributions which appear in most but not all parallel

algorithms for the different matrix equations:

C1 Communication overhead from the implicit redistribution (see Section 3.3).C2 Communication associated with subsystems and preparations for the next solve.C3 Communication cost from forming the scaling factor sglobal (see Section 3.5).C4 Broadcasts in block row i and block column j corresponding to Xij .C5 On-the-fly transposition of the lower triangular part of the solution and the

associated broadcasts.C6 Communication associated with the updates.

C1: The algorithm for the implicit redistribution in SYCT [Granat et al. 2003]starts with Dl + Dr broadcasts of single integers defining the redistribution fol-lowed by exchanges of rows, columns and single elements of blocks in A, B and Cbetween the processors holding parts of the splitted 2 × 2 blocks of A and B andthe corresponding block rows and columns. Generically, this results in

btot(ts + 4tb)(log2 PrPc) +4

PrPc(3ts + 8tbl1)bext, (15)


where btot, bext and l1 are the total number of diagonal blocks of the left handside coefficient matrices, the maximum number of blocks that can be affected bythe redistribution, and the number of elements sent and received in extending oneblock, respectively. The values of these constants are displayed in Table III (as forall constants introduced below). The contribution (15) to Tc is in general very small,if not negligible. For a discussion of the implementation of these communications,see Part II [Granat and Kagstrom 2007].

Table III. Model parameters used in the generic scalability analysis of the parallel algorithms.

This table assumes that C = CT for (G)LYCT and (G)LYDT.Parameter SYCT SYDT LY[C,D]T

S DlDr DlDr DlDr/2U (D2

l Dr + D2rDl)/2 (D2

l Dr + D2rDl)/2 (D2

l Dr + D2rDl)/4

btot Dl + Dr Dl + Dr Dl

bext D2l + D2

r + 2DlDr D2l + D2

r + 2DlDr 3D2l

brmb·nb

mb2·nb2

mbmb2

nb2

nb22

, nbnb2

bcmb·nb

mb2·nb21 nb2

nb22

, 1

l1 mb + nb + 1 mb + nb + 1 2mb + 1l2 mb2 mb2 mb2

l3 nb2 nb2 mb2

l4 mb2 · nb2 mb2 · nb2 mb22


l6 - - mb22

l7 mb2 mb2 mb2

l8 nb2 nb2 mb2

Parameter GCSY GSYL GLY[C,D]TS DlDr DlDr DlDr/2U (D2

l Dr + D2rDl)/2 (D2

l Dr + D2rDl)/2 (D2

l Dr + D2rDl)/4

btot Dl + Dr Dl + Dr Dl + Dr

bext 2D2l + 2D2

r + 4DlDr 2D2l + 2D2

r + 2DlDr 4D2l

brmb·nb

mb2·nb2

mbmb2

nbnb2

bcmb·nb

mb2·nb21 nb

nb2l1 mb + nb + 1 mb + nb + 1 2mb + 1l2 2mb2 2mb2 2mb2

l3 2nb2 2nb2 2nb2


l5 mb2 · nb2 2mb2 · nb2 2mb22

l6 - - mb22

l7 2mb2 2mb2 2mb2

l8 2nb2 2nb2 2mb2

C2: Each process is involved in (DlDr)/ min(Pr, Pc) subsystems and preparationsfor the next solve. This gives the second contribution to Tc as

DlDr

min(Pr, Pc)(ts + 8tbl2) +

DlDr

min(Pr, Pc)(ts + 8tbl3) +

DlDr

min(Pr, Pc)(ts + 8tbl2), (16)

where the third term appears only for two-sided equations.C3: Ignoring the arithmetic cost, the total cost of forming the global scaling

factor sglobal within each phase of the (anti-)diagonal block traversal is

tk−to−all = (ts + 8tb)(log2(max(Pr, Pc)) + 2 log2(min(Pr, Pc))).

The cost is higher if the three parts of the k-to-all reduction are carried out inopposite mesh directions for a non-square (Pr 6= Pc) process mesh. An upper


bound for this contribution to the parallel execution time is

S · tk−to−all

min(Pr, Pc), (17)

which in most cases is negligible.C4: The broadcast operations are performed in groups of min(Pr, Pc) broadcasts

each to the cost

brDlDr

min(Pr, Pc)(ts + 8tbl4) log2(Pc) +

bcDlDr

min(Pr, Pc)(ts + 8tbl5) log2(Pr), (18)

where the first term corresponds to the row oriented broadcasts and the secondcorresponds to the column oriented broadcasts, and br and bc are the number ofbroadcasts per block (which depends on the pipelining configuration, see Section3.4) in the row and column directions, respectively.

C5: The on-the-fly transposition of the solution and the associated broadcastsare modelled by

brD2l

2 min(Pr, Pc)(ts + 8tbl6)(1 + log2(Pc)). (19)

C6: In theory, the (D2l Dr + D2

rDl)/2 updates can be carried out in parallel forp = Pr · Pc processors. However, assuming O(p/2) bisection bandwidth in thenetwork, half of the processors must first send a requested submatrix to a processorin the other half before they may receive their own requested data and computetheir own updates. Therefore, the upper limit of the number of concurrent messagesprior to the updates is (PrPc)/2. In practice, the level of concurrency can be lower.Consequently, we use the model

D2rDl

Pr · Pc(ts + 8tbl7) +

D2l Dr

Pr · Pc(ts + 8tbl8), (20)

where l7 and l8 denote the message sizes involved.To sum up, Tc is modelled as the sum of (15)–(20). The parallel runtime for a

generic algorithm solving a triangular matrix equation is then Tp = Ta + Tc. Usingthe values in Table III, we may formulate specific performance models for eachmatrix equation solver considered.

In general, assuming m = n, mb = nb, no pipelining, and p processors, eachprocessor will perform O(n3/p) flops and send and receive in total O(n3/(nb · p))matrix elements communicated in O(n3/(nb3 · p)) distinct messages, which gives anexpression for Tp of a simplified model algorithm as

Tp =n3

pta +

n3

nb3 · pts +n3

nb · ptw. (21)

where tw = 8tb. By regarding ta, ts and tb as functions, (21) implies cost-optimalityif ts/nb3 = Ω(ta) and tw/nb = Ω(ta). Moreover, the parallel speedup can bebounded from above by

Sp ≤ p

kwith k = 1 +

1nb3 ·

tsta

+1nb· tw

ta. (22)


Notice that (22) is independent of n if ta, which depends on both n and nb, is keptfixed. If ts/nb3 = O(ta) and tw/nb = O(ta), we get an approximate upper bound

Sp / p

1 + 1 + 1=

p

3. (23)

This predicts an upper bound of the speedup of O(p/k), where the constant k > 1depends on the blocking factors used in the data distribution and the performancebalance of the processors and the network of the target parallel computer system.For example, with nb = 64 and scaled values ta = 1, ts = 100, . . . , 1000 andtb = 10, . . . , 100 (see also Table V), we get k ≈ 2.25, ..., 13.50. According to themodel (22), k is also more sensitive to the bandwidth than to the node latency of theunderlying system. In Section 7, we illustrate these results with experimental datawhere ta is kept fixed by scaling the problem size with the number of processorssuch that the memory load per processor is maintained on a specified level. In sucha scenario, (22) can be regarded as expressing the possibility for scaled parallelspeedup of the model algorithm.

5. CONDITION ESTIMATION USING PARALLEL SEP−1 ESTIMATORS

In several papers, starting with [Hager 1984; Higham 1988], a general method forestimating ‖A−1‖1 for a square matrix A using reverse communication of A−1xand A−T x, where ‖x‖2 = 1, is discussed. In [Kagstrom and Poromaa 1992] thisapproach was successfully applied to the triangular Sylvester equation AX−XB =C, based on the observation that the triangular Sylvester equation is equivalent tothe linear system ZSYCTx = c. Since ‖·‖1 and ‖·‖2 differ at most by a constant, it ispossible to compute a lower bound estimate of the inverse of the separation betweenthe matrices A and B [Stewart and Sun 1990; Kagstrom and Poromaa 1992]

sep(A,B) = inf‖X‖F =1 ‖AX −XB‖F

= σmin(ZSYCT) = ‖Z−1SYCT‖−1

2 ,(24)

‖x‖2‖c‖2 =

‖X‖F

‖C‖F≤ ‖Z−1

SYCT‖2 =1

σmin(ZSYCT)= sep−1. (25)

Sep-functions like (24) appear in perturbation theory and error bounds for matrixequations (e.g., see [Higham 1993; Kagstrom 1994]) and different eigenspace compu-tations (e.g., see [Stewart and Sun 1990; Kagstrom and Poromaa 1996]). However,the cost for computing these quantities is typically O(m3n3) flops, for example,using the singular value decomposition (SVD) of ZSYCT. Already for moderatevalues of m and n this is a huge cost. Luckily, sep(A,B)−1 can be estimated inO(m2n + mn2) flops by solving a few triangular SYCT equations.

Notice that the condition sep[ACRO] = σmin(ZACRO) 6= 0 guarantees a uniquesolution of each of the matrix equations in Table I. Assuming sep[ACRO] 6= 0, wecan apply the technique for SYCT described above and perform condition estima-tion to any matrix equation by computing a lower bound on σmin(ZACRO)−1, seeTable II. However, since transposing ZGCSY is not only a matter of transposingall involved left hand side matrices (A, B, D, and E), as for the uncoupled matrixequations, but also transposing the (1,2) and (2,1) blocks of ZGCSY, the conditionestimation of GCSY requires a special algorithm (e.g., see [Kagstrom and Poromaa


1996]). Moreover, symmetry in the right hand side for Lyapunov equations (seeTable I) can not be assumed for this general estimation approach.

The 1-norm-based condition estimator in [Kagstrom and Poromaa 1992] wasbased on the serial LAPACK-routine DLACON [Anderson et al. 1999], and a blockversion is used in MATLAB [Higham and Tisseur 2000]. The parallel version weuse is implemented as the auxiliary routine PDLACON in ScaLAPACK [Blackfordet al. 1997] .

To compute an estimate, we make use of the fact that PDLACON requires a dis-tributed column vector as right hand side and compute Pc different estimates con-currently, one for each process column and column vector, and produce the globalmaximum by performing an all-to-all reduction [Grama et al. 2003] in each processrow in parallel. The column vector y in each process column is formed by perform-ing an all-to-all broadcast [Grama et al. 2003] of the local pieces of the right handside matrix (or matrices) in each process row and stacking them on top of eachother in the order they arrive.

We outline the parallel condition estimation process in Algorithm 3, where thewhile-loop is normally iterated around five times. The current implementations pickthe new global value of kase for the next iteration depending on the local outputfrom the Pc calls to PDLACON as 0 if all procesors have kase = 0 locally; otherwise,the algorithm picks the nonzero value of kase (1 or 2) found at the majority of theprocessors in the entire process mesh.

Algorithm 3 Parallel condition estimation algorithm for equations in Table I.Input: Reduced left hand side matrices for ACRO equation.

Output: Lower bound estimate est of sep−1[ACRO], number of iterations iter.

kase = iter = 0while (kase 6= 0 or iter = 0) do

if (iter 6= 0) thenPerform all-to-all broadcast of the local part of y in each process row

end ifEach process column: call PDLACON(y,est,kase) to compute new estimateFind new global value of kase for the next search directionTake global maximum of est by an all-to-all reduction in each process rowif (kase 6= 0) then

iter = iter + 1if (kase = 1) then

Compute y ← Z−1ACROy

else if (kase = 2) then

Compute y ← Z−TACROy

end ifend if

end while

6. IMPLEMENTATION ISSUES

We have implemented Bartels–Stewart’s method for the matrix equations in TableI as ScaLAPACK-style parallel solvers in eight routines PGE[ACRO]D, where [ACRO]is replaced by the corresponding acronym for the actual matrix equation. Theseroutines invoke their corresponding triangular solvers PTR[ACRO]D which are basedon the parallel algorithms described in Section 3. The condition estimators de-scribed in Section 5 were also implemented as the routines P[ACRO]CON, each basedon the associated matrix equation solver.


The implementation details are presented in Part II [Granat and Kagstrom 2007].Here we only remark that our implementations use functionality and routines fromScaLAPACK, PBLAS, BLACS, LAPACK, BLAS and RECSY (which uses a fewSLICOT routines). Furthermore, the parallel reduction to generalized Schur form isbased on the new ScaLAPACK-style implementations of the Hessenberg-triangularreduction and the parallel QZ algorithm presented in [Adlerborn et al. 2002; Adler-born et al. 2006].

Altogether, these routines make up SCASY [SCASY ], a parallel HPC library forsolving matrix equations.

7. COMPUTATIONAL EXPERIMENTS

We present a selection of representative results from computational experimentswith our implementations of the presented DM algorithms on three different parallelplatforms hosted by HPC2N. The results are evaluated using several performanceand accuracy metrics. The analysis includes results regarding general and triangularparallel solvers as well as parallel condition estimators.

7.1 Parallel machine characteristics

The IBM SP System knut consists of 64 thin P2SC nodes (120MHz, with peak480Mflops/s.), where one node has 1GB memory and all others have 128MB mem-ory each. The nodes are connected with a multi-stage network having a peakbandwidth of 150MB/s. The system peak performance is 33.6 Gflops/s.

The Super Linux Cluster seth consists of 120 dual AMD Athlon MP2000+ nodes(1.667MHz), where 12 nodes have 2GB memory each and all other have 1 GBmemory each. The cluster is connected with a Wolfkit3 SCI high speed interconnecthaving a peak bandwidth of 667 MB/s. The network connects the nodes in a 3-dimensional torus organized as a 6 × 4 × 5 grid. Each sub-ring in the grid is“one-way” directed. In total, the system has a theoretical peak performance of 800Gflops/s. For more information, see [Edmundsson et al. 2004].

The 64-bit Opteron Linux Cluster sarek consists of 192 dual AMD Opteron nodes(2.2 GHz), where each node is a NUMA machine with 8GB memory. The switchnetwork is a Myrinet-2000 high-performance interconnect with peak 250 MB/sbandwidth. In total, the system has a peak performance of 1690 Gflops/s andhas performed 1329 Gflops/s. on the HP-Linpack benchmark.

In summary, experimentally measured machine characteristics for the target par-allel platforms are summarized in Table V. Notice that in practice tb (the inverseof the bandwidth) may be larger (e.g., up to a factor of 2 on sarek according toour experiences, see also Figure 3) because of network sharing and the fact that weare communicating using MPI through the BLACS layer, which adds some extraoverhead to each communication.

The compilers and software used in the computational experiments are listed inTable IV. Notice that to enhance overlapping of communications with computationswe tune MPICH on sarek to use non-blocking point-to-point communications formessage sizes up to 100kB.


Table IV. Compilers, with flag settings, and library software used for IBM SP System knut, Super

Linux Cluster seth and 64-bit Opteron Linux Cluster sarek. The preprocessing option -DINTEGER8

is used when the memory in bytes for each node in the machine can not be expressed with a

standard 4-byte INTEGER.Software knut seth sarek

Compiler mpxlf 6.1.0.0 pgf77 6.0-5 32-bit mpif77 1.2.5 64-bit

Compiler flags -O3 -qstrict -O2 -tp -fast

-qarch=pwr2 athlonxp -fast

-qintlog

Preprocessing -qsuffix=cpp=f -Mpreprocess -Mpreprocess

flags -DDYNAMIC -DDYNAMIC

-DINTEGER8

MPI IBM MPI-1.2 ScaMPI (MPICH 1.2) MPICH-GM 1.5.2

LAPACK LAPACK 3.0 LAPACK 3.0 LAPACK 3.0

BLAS ESSL 3.3.0 ATLAS 3.5.9 GOTO-BLAS r0.94

PBLAS PESSL 2.3.0 ScaLAPACK 1.7.0 ScaLAPACK 1.7.0

BLACS PESSL 2.3.0 BLACS 1.1 BLACS 1.1patch3

ScaLAPACK PESSL 2.3.0 ScaLAPACK 1.7.0 ScaLAPACK 1.7.0

RECSY 0.01alpha 0.01alpha 0.01alpha

SLICOT SLICOT 4.0 SLICOT 4.0 SLICOT 4.0

Table V. Experimentally measured machine characteristics for IBM SP System knut, Super Linux

Cluster seth and 64-bit Opteron Linux Cluster sarek. The parameter ta denotes the time in seconds

for performing an arithmetic operation in solving a moderate-sized matrix equation, ts denotesthe startup (node latency) time in seconds for message passing and tb denotes the ”per-byte”

transfer time in seconds. The network parameters are measured by executing MPI-based ping-

pong communication in the networks. The ratios ta/ts and ta/tb measure the balance between the

processor speed and the network latency and network bandwidth of each machine, respectively.

Parameter knut seth sarek

ta 8.5× 10−9 3.1× 10−10 3.2× 10−10

ts 4.0× 10−5 3.7× 10−6 4.1× 10−7

tb 1.0× 10−8 1.4× 10−9 2.3× 10−9

ta/ts 2.1× 10−4 8.4× 10−5 7.8× 10−4

ta/tb 0.85 0.22 0.14

7.2 Performance and accuracy metrics

To investigate the benefits of using more than one processor to solve a matrixequation, we compute the parallel speedup as follows

Sp =Tp

Tpmin

, (26)

where pmin ≥ 1 is the smallest number of processors used to solve the problemand for which the aggregate memory is large enough to store the problem datastructures. For pmin = 1, this corresponds to the standard definition (e.g., see[Grama et al. 2003]).

In some of our results, we include Ra, the absolute residual norm, and Rr, the rel-ative residual norm which gives information about the backward error of the parallelalgorithms. For example, if computed using the 1-norm, the relative residual normshould ideally be a small constant (of size O(1)) [Kagstrom and Poromaa 1996]. InTable VI, we display the residual norms associated with the matrix equations in


Table I, where X denotes the computed solution. Notice that the true error in X

Table VI. Residual norms for matrix equations. Here, s is a scaling factor for the right hand side

to avoid overflow in the solution.Acronym Ra nrm in Rr = (ε−1

machRa)/nrm

SYCT ‖sC − AX + XB‖ (‖A‖+ ‖B‖)‖X‖+ s‖C‖LYCT ‖sC − AX − XAT ‖ 2‖A‖‖X‖+ s‖C‖)SYDT ‖sC − AXBT + X‖ ‖A‖‖B‖‖X‖+ s‖C‖LYDT ‖sC − AXAT + X‖ ‖A‖2‖X‖+ s‖C‖GCSY ‖(sC − AX + Y B, sF −DX + Y E)‖ (‖(A, D)‖+ ‖(B, E)‖)‖(X, Y )‖+ s‖(C, F )‖GSYL ‖sE − AXBT + CXDT ‖ (‖A‖‖B‖+ ‖C‖‖D‖)‖X‖+ s‖E‖GLYCT ‖sC − AXET − EXAT ‖ 2‖A‖‖E‖‖X‖+ s‖C‖GLYDT ‖sC − AXAT − EXET ‖ (‖A‖2 + ‖E‖2)‖X‖+ s‖C‖

may be much larger than indicated by the residual norms [Higham 1993; Kagstrom1994].

To investigate the accuracy when the exact solution X is known a priori, wecompute the absolute error norm Ea = ‖sX − X‖, where s is a scaling factor forthe right hand side to avoid overflow in the solution3. For example, if X is chosenas a random matrix with uniformly distributed entries in the interval [−1, 1], theright hand side is computed with respect to this exact solution. The relative errornorm is computed as Er = Ea/‖X‖4.

Below we use the Frobenius norm to compute the absolute residual norms andthe error norms and the 1-norm to compute the relative residual norms5.

In general, for well-conditioned problems the approximate error bound

Ea / est ·Ra, (27)

should hold [Higham 2002], where est is a lower bound estimate on sep−1[ACRO](see Section 5).

Below, the differently conditioned test problems we consider are generated usingthe functionality of the matrix (pair) generator routines described in Part II [Granatand Kagstrom 2007].

7.3 Performance and accuracy of general solvers

The implemented general solvers show good scalability and performance, e.g., seeTable VII. In our extensive tests the relative residuals in Table VI have been asexpected (of size O(1)) for a wide range of dimensions, blocking factors, includingwell-conditioned as well as ill-conditioned problems. However, as displayed inFigure 2, the execution time of the general solver(s) is highly dominated by thereduction to real Schur form. On the other hand, in the 1-norm based conditionestimators, introduced in Section 5, the triangular solver is called several times,which motivates producing fast and scalable software for the triangular solvers.

3For GCSY, Ea = ‖s(X, Y )− (X, Y )‖.4For GCSY, Er = Ea/‖(X, Y )‖.5For GCSY we always use the Frobenius norm.


Table VII. Performance results of PGESYDTD on seth using the block sizes mb=nb = 32. All timings

are in seconds.m = n Pr × Pc Time Sp m = n Pr × Pc Time Sp

1024 1× 1 293 1.0 4096 1× 1 17927 1.01024 2× 1 307 1.4 4096 2× 1 12127 1.51024 2× 2 119 2.5 4096 2× 2 6154 2.91024 3× 2 85.9 3.4 4096 3× 2 4042 4.41024 3× 3 51.1 5.7 4096 3× 3 2767 6.51024 4× 4 52.4 5.6 4096 4× 4 2021 8.91024 5× 4 44.4 6.6 4096 5× 4 1669 10.71024 5× 5 40.3 7.3 4096 5× 5 1544 11.61024 6× 5 42.4 6.9 4096 6× 5 1343 13.31024 6× 6 36.6 8.0 4096 6× 6 1365 13.12048 1× 1 2241 1.0 6144 3× 2 19328 1.02048 2× 1 1432 1.6 6144 3× 3 10158 1.902048 2× 2 831 2.7 6144 4× 4 5823 3.332048 3× 2 566 4.0 6144 5× 4 5107 3.822048 3× 3 360 6.0 6144 5× 5 4548 4.252048 4× 4 307 7.3 6144 6× 5 4866 3.972048 5× 4 259 8.7 6144 6× 6 3596 5.382048 5× 5 192 11.7 8192 5× 5 8750 1.02048 6× 5 226 9.9 8192 6× 5 8502 1.032048 6× 6 164 13.7 8192 6× 6 6732 1.30

Fig. 2. Left: Execution time profile of PGELYCTD (C = CT ) on knut using the block size nb = 64.

Right: Execution time profile of PGESYDTD on seth using the block sizes mb = nb = 32. The results

are typical for what we found using multiple processors to solve the problem.

0 20 40 60 80 100

10

11

12

13

Execution time profile of PGELYCTD

Percent of execution time

log 2(m

)

Step 1Step 2+4Step 3

0 20 40 60 80 100

10

11

12

13

Percent of execution time

log 2(m

)

Execution time profile of PGESYDTD

Step 1Step 2+4Step 3

7.4 Performance of triangular solvers

To illustrate the performance of the triangular solvers, we present performanceresults for solving well-conditioned SYCT and GSYL equations in Table VIII. Ingeneral, about the same performance behavior is observed for all triangular solvers.For large-scale problems, the scaled parallel speedup approaches O(p/k) as expectedby the analysis in Section 4. See also Figure 3 where the theoretical model fork = 3.18 is compared with experimental data when the memory load is maintainedat 1.5GB per processor.

It is our experience that the positive effect of multiple pipelining is significantfor the one-sided matrix equations. For the two-sided variants, the pipelining doesnot seem to have any obvious effect on the performance.

The triangular solvers for the standard and generalized Lyapunov equations saveabout half of the execution time when the symmetry of the right hand side is


Table VIII. Performance of PTRSYCTD using mb = nb = 128, mb2 = nb2 = 64 and PTRGSYLD using

mb = nb = 64 on sarek . All timings are in seconds.PTRSYCTD PTRGSYLD

m = n Pr × Pc Time Gflops/s Rr Time Gflops/s Rr

5000 1× 1 135 1.86 0.2E01 217 3.45 0.2E015000 2× 2 45.3 5.52 0.1E01 148 5.07 0.2E015000 4× 4 19.9 12.60 0.2E01 60.5 12.39 0.2E015000 6× 6 12.6 19.91 0.1E01 34.4 21.80 0.1E015000 8× 8 9.0 27.80 0.6E01 23.4 32.07 0.2E015000 10× 10 6.7 38.45 0.1E01 17.2 43.64 0.2E0110000 2× 2 377 5.31 0.2E01 1127 5.06 0.2E0110000 4× 4 137 14.63 0.2E01 438 13.71 0.2E0110000 6× 6 80.4 24.88 0.2E01 239 25.13 0.1E0210000 8× 8 58.1 34.45 0.3E01 161 37.36 0.2E0110000 10× 10 40.2 49.77 0.2E01 115 52.07 0.2E0110000 12× 12 31.9 62.70 0.2E01 85.7 70.05 0.2E0110000 14× 14 26.3 75.92 0.2E01 70.7 84.96 0.2E0110000 16× 16 22.8 87.61 0.3E01 56.9 105.45 0.2E0115000 4× 4 450 14.99 0.2E01 1437 14.09 0.2E0115000 6× 6 259 26.07 0.2E01 771 26.27 0.2E0115000 8× 8 182 37.14 0.3E01 514 39.43 0.2E0115000 10× 10 127 53.03 0.2E01 361 56.08 0.2E0115000 12× 12 96.2 70.17 0.2E01 270 75.14 0.2E0115000 14× 14 79.6 84.77 0.2E01 190.9 106.30 0.2E0115000 16× 16 68.0 99.30 0.3E01 172.5 117.40 0.2E0220000 4× 4 1146 13.96 0.3E01 3351 14.33 0.2E0120000 6× 6 592 27.03 0.2E01 1786 26.88 0.2E0120000 8× 8 422 37.91 0.3E01 1172 40.97 0.2E0120000 10× 10 283 56.52 0.2E01 828 57.99 0.2E0120000 12× 12 217 73.59 0.3E01 608 78.95 0.2E0120000 14× 14 177.3 90.24 0.2E01 496 96.63 0.2E0120000 16× 16 149 107.61 0.4E01 387 124.08 0.2E0230000 6× 6 2037 26.51 0.2E01 6057 26.75 0.2E0130000 8× 8 1368 39.48 0.4E01 3900 41.54 0.2E0130000 10× 10 916 58.97 0.3E01 2631 61.57 0.2E0130000 12× 12 691 78.20 0.3E01 1924 84.19 0.2E0230000 14× 14 558.5 96.67 0.3E01 1555 104.16 0.1E0230000 16× 16 469 115.24 0.5E01 1234 131.24 0.1E0240000 8× 8 3142 40.74 0.5E01 8912 43.09 0.3E0140000 10× 10 2152 59.47 0.3E01 6151 62.43 0.2E0140000 12× 12 1606 79.70 0.3E01 4481 85.69 0.2E0140000 14× 14 1303 98.20 0.3E01 3636 105.62 0.2E0140000 16× 16 1067 119.97 0.6E01 2835 135.46 0.9E0150000 10× 10 4142 60.36 0.3E01 11677 64.23 0.2E0150000 12× 12 3095 80.78 0.4E01 8771 85.51 0.2E0150000 14× 14 2517 99.34 0.3E01 7133 105.15 0.2E0150000 16× 16 2059 122.03 0.6E01 5536 135.49 0.9E0160000 12× 12 5356 80.66 0.4E01 14716 88.07 0.3E0160000 14× 14 4274 101.08 0.3E01 11567 112.05 0.3E0160000 16× 16 3505 123.26 0.8E01 9160 141.48 0.3E0170000 14× 14 6631 103.5 0.6E01 18012 114.26 0.2E0170000 16× 16 5507 124.57 0.7E01 14376 143.16 0.3E0180000 16× 16 8105 126.34 0.9E01 21101 145.58 0.4E01

considered. Their accuracy is also illustrated in Table X for PTRLYCTD solvingdifferently conditioned problems (see also the next section).

7.5 Reliability and performance of condition estimators

We discuss the reliability and performance of the condition estimators described inSection 5. For illustration, we consider LYCT and GSYL. Below, est denotes thelower bound estimate of sep−1[ACRO] computed by the corresponding conditionestimator such that a large value of est signals ill-conditioning.


Fig. 3. Experimental and modelled performance of PTRSYCTD using mb = nb = 64, a constant

memory load per processor of 1.5GB on sarek, and the model parameter estimates ta = 2.3E−10,

ts = 4.1E − 7 and tb = 4.3E − 9.

0 50 100 150 200 250 3000

50

100

150

#cpus

Gflo

ps/s

ec.

Performance of SCASY with 2.5 Gbyte workload per cpu

PTRSYCTDModel: k = 3.18

7.5.1 Reliability. Our estimator PLYCTCON produces correct estimates est =sep−1[LYCT] if 0 < q = est · σmin(ZLYCT) ≤ 1. Notice that the exact value shouldagree for both transpose modes of the general LYCT problem op(A)X+Xop(AT ) =C since the corresponding Kronecker product representation of both Lyapunov op-erators have the same smallest singular value (at least in exact arithmetic).

In Table IX, we present uniprocessor results for PLYCTCON where the computedestimate est is compared to the exact value of σ−1

min(ZLYCT). Even though thematrices ZLYCT and ZT

LYCT have the same smallest singular value, our estimatorcomputes different values of est depending on op(A). This depends on (at least)two things:

—est is based on an estimation of the 1-norm of ZLYCT and in general ‖A‖1 6=‖AT ‖1, and

—the initial search direction set in PDLACON may suit different transpose modesdifferently well.

To remedy such situatons, the user could make a second call to the conditionestimator and compute the estimate as est = max(estop(A)=A, estop(A)=AT ).

Notice that sometimes PLYCTCON computes an overestimate of the lower boundon ‖Z−1

LYCT‖ for two of the problems in Table X (Ex. 4, m = 4096) and for onein Table IX (Ex. 4, m = 64). Such overestimates can occur when σmin(ZLYCT) isclose to zero in machine precision which gives solutions of some subsystems withvery large components and/or cancellation of terms in the update- and solvingphase of the triangular solver [Kagstrom and Poromaa 1992]. However, the severeill-conditioning is alarmed by the condition estimates, which illustrates a goodqualitative behaviour of our parallel estimators.

In Table X, we present results for est, iter, Ea and Er invoking PTRLYCTD and


Table IX. Reliability of LYCT condition estimator on knut with Pr = Pc = 1. Examples referred to

are the same as in Table X with mb = 1. For this table, Z = ZLYCT. The quantity σ−1min(ZLYCT)

is computed explicitly using the LAPACK routine DGESVD.

op(A) = A op(A) = AT

Ex. m est iter q σ−1min(Z) est iter p σ−1

min(Z)

1 1 0.500 1 1.00 0.500 0.500 1 1.00 0.500

2 0.149 4 0.578 0.258 0.149 4 0.577 0.258

4 0.344E-1 4 0.273 0.126 0.428E-1 4 0.340 0.126

8 0.820E-2 4 0.130 0.627E-1 0.967E-2 5 0.154 0.627E-1

16 0.200E-2 4 0.639E-1 0.313E-1 0.220E-2 5 0.703E-1 0.313E-1

32 0.494E-3 4 0.377E-1 0.156E-1 0.532E-3 5 0.341E-1 0.156E-1

64 0.122E-3 4 0.156E-1 0.781E-2 0.128E-3 5 0.164E-2 0.781E-1

2 1 0.500 1 1.00 0.500 0.500 1 1.00 0.500

2 0.250 4 0.448 0.562 0.324 5 0.578 0.562

4 0.125 4 0.220 0.568 0.122 4 0.215 0.568

8 0.625E-1 4 0.109 0.573 0.230 5 0.401 0.573

16 0.313E-1 4 0.526E-1 0.595 0.154 5 0.260 0.595

32 0.156E-1 4 0.260E-1 0.601 0.940E-1 5 0.156 0.601

64 0.781E-2 4 0.129E-1 0.606 0.680E-1 5 0.112 0.606

3 1 0.500 1 1.00 0.500 0.500 1 1.00 0.500

2 0.250 4 0.444 0.563 0.324 5 0.575 0.563

4 0.125 4 0.170 0.734 0.220 4 0.300 0.734

8 0.293 5 0.117 2.50 2.05 5 0.820 2.50

16 0.523 5 0.220E-2 2.38E2 14.0 5 0.588E-1 2.38E2

32 3.87E3 5 0.834E-2 4.64E5 1.83E5 5 0.394 4.64E5

64 8.88E9 5 0.722E-3 1.23E13 7.59E9 5 0.562E-3 1.35E13

4 1 0.500 1 1.00 0.500 0.500 1 1.00 0.500

2 0.250 4 0.342 0.732 0.411 5 0.561 0.732

4 0.548 4 0.301 1.82 0.578 4 0.318 1.82

8 4.16 5 0.654E-2 63.6 1.23 5 0.193E-1 63.6

16 69.8 5 0.703E-3 9.93E4 1.47E3 5 0.147E-1 9.97E4

32 2.99E11 5 0.43 6.93E11 5.45E11 4 0.786 6.93E11

64 6.62E23 5 3.72E8 1.78E15 2.44E25 4 1.10E10 2.21E15

PLYCTCON for differently conditioned problems and a random solution matrix X0

with uniformly distributed entries in the interval [−1, 1]. It is obvious that theability of the solver to compute a reliable solution depends very much on the con-ditioning of the underlying matrix equation. When going from Example 1 to 4,the ill-conditioning is gradually increased by manipulating the eigenvalues and thedistance from normality of the matrix A up to a point where it is signaled by estthat no reliable solution can be computed.

7.5.2 Performance. In Table XI, we present performance results for the parallelGSYL condition estimator PGSYLCON solving well-conditioned problems using thecorresponding parallel triangular GSYL solver PTRGSYLD. For this table, iter is thenumber of iterations and calls to the triangular solver PTRGSYLD, est is the lowerbound estimate of sep−1[GSYL], Ra, Rr, Ea and Er correspond to the absoluteand relative residual and error norms (see Table VI).

The condition estimators consist of three major parts, see also Algorithm 3:Building the right hand sides in each process column by row-wise all-to-all broad-casts, computing the next estimate (i.e., calling PDLACON) and solving the triangular


Table X. Accuracy of PTRLYCTD on knut solving the general LYCT problem op(A)X +Xop(AT ) =

C using the block size nb = 64 and Pr = Pc = 4. Here A = αDA + βMA, where DA is a random

diagonal matrix (2× 2 blocks are allowed), MA is a random strictly upper triangular matrix, and

α and β are scalars, as follows: Example 1: α = n, β = 1. Example 2: α = 1, β = 1. Example 3:

α = 1, β = n/(2nb). Example 4: α = 1, β = n/nb.Ex. m op(A) est iter Ea Er

1 1024 A 2.5E-07 5 0.5E-12 0.9E-142048 A 4.9E-08 5 0.1E-11 0.1E-144096 A 1.4E-08 4 0.4E-11 0.2E-14

1024 AT 2.5E-07 5 0.1E-11 0.2E-14

2048 AT 4.9E-08 5 0.3E-11 0.4E-14

4096 AT 1.4E-08 5 0.9E-11 0.4E-142 1024 A 3.5E-04 4 0.9E-12 0.2E-14

2048 A 1.6E-04 4 0.2E-11 0.2E-144096 A 7.9E-05 4 0.7E-11 0.3E-14

1024 AT 4.1E-04 5 0.2E-11 0.3E-14

2048 AT 2.3E-04 5 0.8E-11 0.7E-14

4096 AT 1.5E-04 5 0.3E-10 0.1E-133 1024 A 1.1E-02 5 0.2E-9 0.4E-12

2048 A 1.0 5 0.3E-6 0.2E-94096 A 8.2E04 5 0.47 0.2E-3

1024 AT 7.9 5 0.5E-11 0.8E-14

2048 AT 3.8E03 5 0.3E-10 0.2E-13

4096 AT 2.4E09 5 0.5E-5 0.2E-84 1024 A 3.0 5 0.7E-6 0.1E-8

2048 A 1.5E04 5 0.4E-1 0.4E-44096 A 6.9E14 5 5.0E10 2.1E6

1024 AT 1.7E04 5 0.1E-10 0.2E-13

2048 AT 3.8E08 5 0.4E-6 0.4E-9

4096 AT 1.2E20 5 0.3E5 53.8

Table XI. Condition estimation of GSYL invoking PGSYLCON on sarek using the blocksize mb =

nb = 64. All timings are in seconds. For this table, (A, C) and (B, D) are chosen as random upper

triangular matrices with specified eigenvalues as λ(i)(A,C)

= i and λ(i)(B,D)

= −i, respectively. The

known solution X is a random matrix with uniformly distributed entries in the interval [−1, 1].m = n Pr × Pc Time iter est Ra Rr Ea Er

1024 1× 1 12.7 5 0.6E-03 0.2E-06 0.1E+01 0.1E-11 0.2E-141024 2× 2 7.4 5 0.6E-03 0.1E-06 0.1E+01 0.1E-11 0.2E-141024 4× 4 3.9 5 0.6E-03 0.1E-06 0.1E+01 0.1E-11 0.2E-141024 8× 8 2.1 5 0.6E-03 0.1E-06 0.1E+01 0.1E-11 0.2E-141024 16× 16 2.8 5 0.5E-03 0.1E-06 0.1E+01 0.9E-12 0.1E-142048 1× 1 86.6 5 0.3E-03 0.1E-05 0.1E+01 0.3E-11 0.2E-142048 2× 2 48.3 5 0.3E-03 0.1E-05 0.1E+01 0.2E-11 0.2E-142048 4× 4 21.9 5 0.3E-03 0.1E-05 0.1E+01 0.3E-11 0.2E-142048 8× 8 9.7 5 0.3E-03 0.1E-05 0.1E+01 0.2E-11 0.2E-142048 16× 16 6.2 5 0.3E-03 0.1E-05 0.1E+01 0.2E-11 0.2E-144096 1× 1 923.9 7 0.2E-03 0.1E-04 0.1E+01 0.7E-11 0.3E-144096 2× 2 503.3 7 0.2E-03 0.1E-04 0.1E+01 0.6E-11 0.2E-144096 4× 4 193.8 7 0.2E-03 0.1E-04 0.1E+01 0.6E-11 0.2E-144096 8× 8 77.5 7 0.2E-03 0.1E-04 0.1E+01 0.8E-11 0.3E-144096 16× 16 28.5 5 0.1E-03 0.9E-05 0.1E+01 0.6E-11 0.3E-148192 1× 1 5302.4 5 0.8E-04 0.1E-03 0.1E+01 0.1E-10 0.3E-148192 2× 2 2625.9 5 0.8E-04 0.1E-03 0.1E+01 0.1E-10 0.3E-148192 4× 4 904.5 5 0.8E-04 0.1E-03 0.1E+01 0.1E-10 0.3E-148192 8× 8 331.4 5 0.8E-04 0.1E-03 0.1E+01 0.1E-10 0.3E-148192 16× 16 165.4 5 0.7E-04 0.9E-04 0.1E+01 0.1E-10 0.2E-1416384 8× 8 4611.0 7 0.4E-04 0.9E-03 0.2E+01 0.4E-10 0.4E-1416384 16× 16 1560.1 7 0.4E-04 0.8E-03 0.2E+01 0.3E-10 0.3E-14

matrix equation. The work is dominated by the third step and this dominance evenincreases with the problem size [Granat and Kagstrom 2006b] which confirms that


extra communication in the row-wise all-to-all broadcasts is not a bottleneck of theestimator.

Finally, we see that the approximate error bound (27) is fulfilled for all cases inTable XI, which indicates reliable condition estimates.

8. ACKNOWLEDGEMENTS

The authors are grateful to Bjorn Adlerborn, Isak Johnsson, Lars Karlsson andDaniel Kressner for fruitful discussions on the subject.

REFERENCES

Adlerborn, B., Dackland, K., and Kagstrom, B. 2001. Parallel Two-Stage Reduction of

a Regular Matrix Pair to Hessenberg-Triangular Form. In Applied Parallel Computing: New

Paradigms for HPC Industry and Academia, T. Sørvik and et al, Eds. Lecture Notes in Com-

puter Science, vol. 1947. Springer, 92–102.

Adlerborn, B., Dackland, K., and Kagstrom, B. 2002. Parallel and blocked algorithms

for reduction of a regular matrix pair to Hessenberg-triangular and generalized Schur forms.

In Applied Parallel Computing PARA 2002, J. Fagerholm and et al., Eds. Lecture Notes in

Computer Science, vol. 2367. Springer-Verlag, 319–328.

Adlerborn, B., Kressner, D., and Kagstrom, B. 2006. Parallel Variants of the Multishift QZ

Algorithm with Advanced Deflation Techniques . In PARA’06 - State of the Art in Scientific

and Parallel Computing. Vol. 4699. Lecture Notes in Computer Science, Springer, 2007 (to

appear).

Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J. W., Dongarra, J. J.,

Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., and Sorensen, D. C. 1999.

LAPACK Users’ Guide, Third ed. SIAM, Philadelphia, PA.

Bartels, R. H. and Stewart, G. W. 1972. Algorithm 432: The Solution of the Matrix Equation

AX −BX = C. Communications of the ACM 8, 820–826.

Benner, P., Quintana-Ortı, E., and Quintana-Ortı, G. 2002. Numerical Solution of Discrete

Stable Linear Matrix Equations on Multicomputers. Parallel Algorithms and Applications 17, 1,

127–146.

Benner, P., Quintana-Ortı, E., and Quintana-Ortı, G. 2004. Solving Stable Sylvester Equa-

tions via Rational Iterative Schemes. Preprint sfb393/04-08, TU Chemnitz.

Benner, P. and Quintana-Ortı, E. S. 1999. Solving Stable Generalized Lyanpunov Equations

with the matrix sign functions. Numerical Algorithms 20, 1, 75–100.

Blackford, L. S., Choi, J., Cleary, A., D’Azevedo, E., Demmel, J. W., Dhillon, I., Don-

garra, J. J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., and

Whaley, R. C. 1997. ScaLAPACK Users’ Guide. SIAM, Philadelphia, PA.

Braman, K., Byers, R., and Mathias, R. 2002. The multishift QR algorithm, II: Aggressive

early deflation. SIAM J. Matrix Anal. Appl. 23, 4, 948–973.

Claver, J. M. 1999. Parallel Wavefront Algorithms Solving Lyapunov Equations for the Cholesky

Factor on Message Passing Multiprocessors. The Journal of Supercomputing 13, 2, 171–189.

Dackland, K. and Kagstrom, B. 1999. Blocked Algorithms and Software for Reduction of a

Regular Matrix Pair to Generalized Schur Form. ACM Trans. Math. Software 25, 4, 425–454.

Datta, B. N. 2004. Numerical Methods for Linear Control Systems Design and Analysis. Elsevier

Academic Press, New York.

Dongarra, J. J., Du Croz, J., Duff, I. S., and Hammarling, S. 1990. A set of level 3 basic

linear algebra subprograms. ACM Trans. Math. Software 16, 1–17.

Edmundsson, N., Elmroth, E., Kagstrom, B., M. M., Nylen, M., Sandgren, A., and Waden-

stein, M. 2004. Design and Evalutation of a TOP100 Linux Super Cluster System. Concurrecy

and Computations: Practice and Experiences (in press).

Golub, G. H., Nash, S., and Van Loan, C. F. 1979. A Hessenberg-Schur method for the problem

AX + XB = C. IEEE Trans. Automat. Control 24, 6, 909–913.


Grama, A., Gupta, A., Karypsis, G., and Kumar, V. 2003. Introduction to Parallel Computing,

Second Edition. Addison-Wesley.

Granat, R., Jonsson, I., and Kagstrom, B. 2004. Combining Explicit and Recursive Blocking

for Solving Triangular Sylvester-Type Matrix Equations on Distributed Memory Platforms.

In Euro-Par 2004 Parallel Processing, M. Danelutto, D. Laforenza, and M. Vanneschi, Eds.

Lecture Notes in Computer Science, vol. 3149. Springer, 742–750.

Granat, R. and Kagstrom, B. 2006a. Evaluating Parallel Algorithms for Solving Sylvester-

Type Matrix Equations: Direct Transformation-Based versus Iterative Matrix-Sign-Function-

Based Methods. In PARA 2004 - Applied Parallel Computing. State of the Art in Scientific

Computing, J. Dongarra, K. Madsen, and J. Wasniewski, Eds. Lecture Notes in Computer

Science, vol. 3732. Springer, 719–729.

Granat, R. and Kagstrom, B. 2006b. Parallel Algorithms and Condition Estimators for Stan-

dard and Generalized Triangular Sylvester-type Matrix Equations. In PARA’06 - State of

the Art in Scientific and Parallel Computing. Vol. 4699. Lecture Notes in Computer Science,

Springer, 2007 (to appear).

Granat, R. and Kagstrom, B. 2007. Parallel Solvers for Sylvester-type Matrix Equations

with Applications in Condition Estimation, Part II: The SCASY Software Library. ACM

Transactions on Mathematical Software (submitted 2007).

Granat, R., Kagstrom, B., and Poromaa, P. 2003. Parallel ScaLAPACK-style Algorithms for

Solving Continuous-Time Sylvester Equations. In Euro-Par 2003 Parallel Processing, H. Kosch

and et al, Eds. Lecture Notes in Computer Science, vol. 2790. Springer, 800–809.

Hager, W. 1984. Condition estimates. SIAM J. Sci. Statist. Comput. 3, 311–316.

Hammarling, S. J. 1982. Numerical Solution of the Stable, Non-negative Definite Lyapunov

Equation. IMA Journal of Numerical Analysis 2, 303–323.

Henry, G., Watkins, D. S., and Dongarra, J. J. 2002. A parallel implementation of the

nonsymmetric QR algorithm for distributed memory architectures. SIAM J. Sci. Comput. 24, 1,

284–311.

Higham, N. J. 1988. Fortran codes for estimating the one-norm of a real or complex matrix, with

applications to condition estimation. ACM Trans. of Math. Software 14, 4, 381–396.

Higham, N. J. 1993. Perturbation theory and backward error for AX − XB = C. BIT 33, 1,

124–136.

Higham, N. J. 2002. Accuracy and Stability of Numerical Algorithms, Second ed. SIAM, Philadel-

phia, PA.

Higham, N. J. and Tisseur, F. 2000. A block algorithm for matrix 1-norm estimation, with an

application to 1-norm pseudospectra. SIAM J. Matrix Anal. Appl. 21, 4, 1185–1201.

Jonsson, I. and Kagstrom, B. 2002a. Recursive blocked algorithms for solving triangular sys-

tems. Part I. One-sided and coupled Sylvester-type matrix equations. ACM Trans. Math.

Software 28, 4, 392–415.

Jonsson, I. and Kagstrom, B. 2002b. Recursive blocked algorithms for solving triangular sys-

tems. Part II. Two-sided and generalized Sylvester and Lyapunov matrix equations. ACM

Trans. Math. Software 28, 4, 416–435.

Jonsson, I. and Kagstrom, B. 2003. RECSY - A High Performance Library for Solving Sylvester-

Type Matrix Equations. In Euro-Par 2003 Parallel Processing, K. H. et al, Ed. Lecture Notes

in Computer Science, vol. 2790. 810–819.

Kagstrom, B. 1994. A perturbation analysis of the generalized Sylvester equation (AR −LB, DR− LE) = (C, F ). SIAM J. Matrix Anal. Appl. 15, 4, 1045–1060.

Kagstrom, B. and Kressner, D. 2006. Multishift Variants of the QZ Algorithm with Aggressive

Early Deflation. SIAM Journal on Matrix Analysis and Applications 29, 1, 199–227.

Kagstrom, B., Ling, P., and Van Loan, C. 1998a. GEMM-Based Level 3 BLAS: High-

Performance Model Implementations and Performance Evaluation Benchmark. ACM Trans.

Math. Software 24, 3, 268–302.

Kagstrom, B., Ling, P., and Van Loan, C. 1998b. Algorithm 784: GEMM-Based Level 3 BLAS:

Portability and Optimization Issues. ACM Trans. Math. Software 24, 3, 303–316.


Kagstrom, B. and Poromaa, P. 1992. Distributed and shared memory block algorithms for

the triangular Sylvester equation with sep−1 estimators. SIAM J. Matrix Anal. Appl. 13, 1,

90–101.

Kagstrom, B. and Poromaa, P. 1996. Computing eigenspaces with specified eigenvalues of a

regular matrix pair (A, B) and condition estimation: theory, algorithms and software. Numer.

Algorithms 12, 3-4, 369–407.

Konstantinov, M., Wei Gu, M., Mehrmann, V., and Petkov, P. 2003. Perturbaton Theory

for Matrix Equations. Studies in Computatonal Mathematics, 9. Elsevier Academic Press.

Kressner, D. 2006. Block variants of Hammarling’s method for solving Lyapunov equations.

Uminf, Department of Computing Science, Umea University, Sweden. To appear in ACM

Trans. Math. Software.

Moler, C. B. and Stewart, G. W. 1973. An algorithm for generalized matrix eigenvalue

problems. SIAM J. Numer. Anal. 10, 241–256.

O’Leary, D. P. and Stewart, G. W. 1985. Data-flow algorithms for parallel matrix computa-

tions. Communications of the ACM 28, 840–853.

Poromaa, P. 1998. Parallel Algorithms for Triangular Sylvester Equations: Design, Scheduling

and Scalability Issues. In Applied Parallel Computing. Large Scale Scientific and Industrial

Problems, K. et al., Ed. Lecture Notes in Computer Science, vol. 1541. Springer, 438–446.

Quintana-Ortı, E. S. and van de Geijn, R. A. 2003. Formal derivation of algorithms: The

triangular Sylvester equation. ACM Transactions on Mathematical Software 29, 2 (June),

218–243.

RECSY. RECSY - High Performance library for Sylvester-type matrix equations. See http:

//www.cs.umu.se/research/parallel/recsy.

SCASY. SCASY - ScaLAPACK-style solvers for Sylvester-type matrix equations. See http:

//www.cs.umu.se/research/parallel/scasy.

Stewart, G. W. and Sun, J.-G. 1990. Matrix Perturbation Theory. Academic Press, New York.

132

IV

Paper IV

Parallel Solvers for Sylvester-Type Matrix Equationswith Applications in Condition Estimation,

Part II: The SCASY Software Library∗



Abstract: We continue our presentation of parallel ScaLAPACK-style algorithms forsolving Sylvester-type matrix equations. In Part II, we present SCASY, a state-of-the-art HPC software library for solving 44 sign and transpose variants of eight commonstandard and generalized Sylvester-type matrix equations. The internal design of thelibrary, Fortran interfaces and implementation issues are discussed in some detail. Inaddition, experimental results from a distributed memory platform with NUMA nodesillustrates the performance of SCASY when linked with the OpenMP version of thenode solver library RECSY and a threaded BLAS implementation. This demonstratesSCASY’s novel capacity and functionality in being able to concurrently handle boththe message passing model and the multiple threading model for parallel computing.

Key words: Parallel Computing, Parallel Algorithms, Eigenvalue problems, Condi-tion estimation, Sylvester matrix equations.

∗ Submitted to ACM Transactions on Mathematical Software, July 2007.

135

136

Parallel Solvers for Sylvester-Type MatrixEquations with Applications in ConditionEstimation, Part II: The SCASY Software Library

R. GRANAT and B. KAGSTROM

Umea University, Sweden

We continue our presentation of parallel ScaLAPACK-style algorithms for solving Sylvester-type

matrix equations. In Part II, we present SCASY, a state-of-the-art HPC software library for

solving 44 sign and transpose variants of eight common standard and generalized Sylvester-type

matrix equations. The internal design of the library, Fortran interfaces and implementation issues

are discussed in some detail. In addition, experimental results from a distributed memory platform

with NUMA nodes illustrates the performance of SCASY when linked with the OpenMP version

of the node solver library RECSY and a threaded BLAS implementation. This demonstrates

SCASY’s novel capacity and functionality in being able to concurrently handle both the message

passing model and the multiple threading model for parallel computing.

Categories and Subject Descriptors: F.2.1 [Analysis of Algorithms and Problem Complex-

ity]: Numerical Algorithm and Problems—Computation on matrices; G1.3 [Numerical Analy-

sis]: Numerical Linear Algebra—Conditioning, Linear Systems; G.4 [Mathematical Software]:

Algorithm Design and Analysis, Reliability and robustness

General Terms: Parallel Computing, Parallel Algorithms

Additional Key Words and Phrases: Eigenvalue problems, Condition estimation, Sylvester matrix

equations

1. INTRODUCTION

In Part I [Granat and Kagstrom 2007a], we derived and analyzed ScaLAPACK-style algorithms for solving eight common standard and generalized Sylvester-typematrix equations. In this contribution, we continue our presentation from PartI by introducing SCASY, a high performance computing (HPC) software librarythat reliably solves 44 sign and transpose variants of the matrix equations listed inTable I. Indeed, only 40 equations are possible to deduce from Table I. The fourremaining concern GCSY (see Section 2.2.7). The operator op(·) denotes a possible(implicit) transpose operation on the matrix argument. For example, op(A) denotesA or AT . For the generalized matrix equations, the same transpose mode applies toboth matrices in the pairs ((op(A), op(D)) and (op(B), op(E)) for GCSY; (op(A),op(C)) and (op(B), op(D)) for GSYL; (op(A), op(E)) for GLYCT and GLYDT).

SCASY is also designed to work on submatrices of globally distributed matrices.In the following, sub(A) denotes a submatrix of the matrix A.

All algorithms are blocked variants based on the Bartels–Stewart method [Bartels

Technical Report UMINF-07.16. Author’s addresses: Department of Computing Science, Umea

University, SE-901 87, UMEA. E-mail: granat,[email protected]. The research was conducted

using the resources of the High Performance Computing Center North (HPC2N). Financial support

was provided by the Swedish Research Council under grant VR 621-2001-3284 and by the Swedish

Foundation for Strategic Research under grant A3 02:128.


Table I. Sign and transpose variants of the Sylvester-type matrix equations considered in SCASY.

CT and DT denote the continuous-time and discrete-time variants, respectively. The scalar s,

where 0 < s ≤ 1.0, is a scaling factor used to prevent from overflow in the solution process.


Standard CT Sylvester op(A)X ±Xop(B) = sC SYCTStandard DT Lyapunov op(A)X + Xop(AT ) = sC LYCTStandard DT Sylvester op(A)Xop(B)±X = sC SYDTStandard DT Lyapunov op(A)Xop(AT )−X = sC LYDTGeneralized CoupledSylvester

op(A)X ± Y op(B) = sCop(D)X ± Y op(E) = sF

GCSY

Generalized Sylvester op(A)Xop(B)± op(C)Xop(D) = sE GSYLGeneralized CT Lyapunov op(A)Xop(ET ) + op(E)Xop(AT ) = sC GLYCTGeneralized DT Lyapunov op(A)Xop(AT )− op(E)Xop(ET ) = sC GLYDT

and Stewart 1972] and involve four major steps: reduction to triangular Schur form,updating the right hand side with respect to the reduction, computing the solutionto the reduced triangular problem and transforming the solution back to the originalcoordinate system. Below, we briefly overview our parallel algorithms. For detailsand more references, we refer to Part I [Granat and Kagstrom 2007a].

Reduction to triangular form is performed by separate Hessenberg reductionsof each involved matrix (Hessenberg-triangular reductions for each matrix pair inthe generalized matrix equations) followed by applying the QR algorithm whichproduces a real Schur form (QZ algorithm is used for a matrix pair, producing ageneralized real Schur form). The updates of the right hand side(s) before andafter solving a reduced triangular matrix equation is performed as matrix-matrixmultiplications (GEMM operations).

By using explicit blocking and two-dimensional (2D) cyclic distribution of thematrices over a rectangular Pr × Pc grid, the solution to the triangular systemscan be computed block (pair) by block (pair) using a wavefront-like traversal ofthe block (anti-)diagonals of the right hand side matrix (pair). Each computedsubsolution block is then used in level-3 updates of the currently unsolved part ofthe right hand side.

Parallelism is extracted for the one-sided matrix equations by observing that allsubsystems associated by any block (anti-)diagonal are independent. This is alsotrue for the two-sided matrix equations if the technique of rearranging the updatesof the right hand side using intermediate sums of matrix products [Granat andKagstrom 2006; 2007a] is utilized. The level-3 updates of the right hand side arealso independent except for the need to communicate for the subsolution and someother blocks from the left hand side.

Data locality is accomplished by broadcasting the subsolutions in the correspond-ing processor row and column and by using two different communication schemes:on demand and matrix block shifts [Granat et al. 2003]. The broadcasts may alsobe pipelined for improving the scalability of the triangular solvers [Granat andKagstrom 2006] for the one-sided matrix equations.

The bulk of this paper is Section 2, where we discuss implementation issues,present the Fortran interfaces to SCASY, and describe the internal design andthe usage of the library. In addition, Section 3 presents novel computational re-

SCASY–Parallel Solvers for Sylvester-Type Matrix Equations · 139

sults from a distributed memory platform with dual shared memory NUMA (non-uniform memory access) nodes which demonstrate the performance of SCASY whencombined with multi-threaded RECSY node solvers and a threaded version of theBLAS.

2. THE SCASY SOFTWARE LIBRARY

SCASY includes ScaLAPACK-style general matrix equation solvers implemented aseight basic routines called PGE[ACRO]D, where ’P’ stands for parallel, ’GE’ stands forgeneral, ’D’ denotes double precision and ’[ACRO]’ is replaced by the acronym in Ta-ble I for the matrix equation to be solved. All parallel algorithms implemented areblocked variants based on the Bartels-Stewart mehod (see [Granat and Kagstrom2007a]). These routines invoke the corresponding triangular solvers PTR[ACRO]D,where ’TR’ stands for triangular. Condition estimators P[ACRO]CON associated witheach matrix equation are built on top of the triangular solvers accessed throughthe general solvers using a parameter setting that avoids the reduction part of thegeneral algorithm (see below).

2.1 Software documentation

The software package contains additional information regarding installation instruc-tions and the functionality of the software in a README file and in the documentationprovided in the beginning of the source code of each included subroutine. In whatfollows, we refer to this documentation for details.

2.2 Fortran interfaces

All programs and subroutines are designed and documented to be integrated in“state-of-the-art” software libraries like ScaLAPACK [Blackford et al. 1997] andPSLICOT [SLICOT ; Blanquer et al. 1998]. Therefore, SCASY and the test pro-gram provided with the library are implemented in ANSI Fortran 77 with thefollowing exceptions:

—We allow mixing of integers and logicals in statements and expressions in thesubroutines.

—We provide the possibility to utilize dynamic memory allocation in the test pro-gram by defining the preprocessor variable DYNAMIC.

—We provide the possibility to allocate large arrays of memory in the test programby specifying the node memory size using an integer of 8 bytes defined in thepreprocessor variable INTEGER8.

If any of the latter exceptions are used, the test program must be both compiledand linked using a Fortran 90 compiler.

2.2.1 Standard matrix equations. Below, the subroutine headings of our fourgeneral solver routines for the unreduced standard Sylvester-type matrix equationsare listed.

—SUBROUTINE PGESYCTD( JOB, ASCHUR, BSCHUR, TRANSA, TRANSB, ISGN,

COMM, M, N, A, IA, JA, DESCA, B, IB, JB, DESCB, C, IC, JC, DESCC,

MBNB2, DWORK, LDWORK, IWORK, LIWORK, NOEXSY, SCALE, INFO )


Solves the general continuous-time Sylvester (SYCT) equation, where sub(A) isM ×M , sub(B) is N × N , and sub(C) and sub(X) (which overwrites sub(C))are M ×N .

—SUBROUTINE PGELYCTD( JOB, SYMM, OP, ASCHUR, M, A, IA, JA, DESCA, C,

IC, JC, DESCC, MB2, DWORK, LDWORK, IWORK, LIWORK, NOEXSY, SCALE,

INFO )

Solves the general continuous-time Lyapunov (LYCT) equation with general orsymmetric right hand side C, where sub(A) is M ×M , and sub(C) and sub(X)(which overwrites sub(C)) are M ×M .

—SUBROUTINE PGESYDTD( JOB, ASCHUR, BSCHUR, TRANSA, TRANSB, ISGN,


MB2, DWORK, LDWORK, IWORK, LIWORK, NOEXSY, SCALE, INFO )

Solves the general discrete-time Sylvester (SYDT) equation, where sub(A) isM ×M , sub(B) is N × N , and sub(C) and sub(X) (which overwrites sub(C))are M ×N .

—SUBROUTINE PGELYDTD( JOB, SYMM, OP, ASCHUR, M, A, IA, JA, DESCA, C,

IC, JC, DESCC, MB2, DWORK, LDWORK, IWORK, LIWORK, NOEXSY, SCALE,

INFO )

Solves the general discrete-time Lyapunov equation (LYDT) with general or sym-metric right hand side C, where sub(A) is M×M , and sub(C) and sub(X) (whichoverwrites sub(C)) are M ×M .

The interfaces to the corresponding triangular solvers are not discussed here sincethey are accessed through the corresponding general solvers by default (see, e.g.,ASCHUR and BSCHUR below). In the following, we briefly describe the different inter-face arguments associated with the routines above. A summary of these argumentsis listed in Table II.

2.2.2 Mode parameters. The character mode parameter ’JOB’, which takes thevalue ’R’ or ’S’, chooses between reduction mode and solving mode. For the lattermode, the actual equation is both reduced to triangular form and solved, but for theformer mode only the reduction step is performed. Such a stand-alone reductionis motivated by that it simplifies a following call to the corresponding conditionestimator in cases when the user does not provide a right hand side for solving theequation.

The character mode parameters ’ASCHUR’ and ’BSCHUR’, which take the values’N’ or ’S’, specify what parts (if any) of the reduction to carry out. For example,ASCHUR=’S’ and BSCHUR=’S’ correspond to a fully reduced triangular problemand no work related to the reduction will be performed. Expert users may insteadcall the triangular solver directly, but will then have to do more error checkingin their calling program since SCASY has almost all error checking in the generalroutines.

The character mode parameters ’TRANSA’ and ’TRANSB’ (SYCT and SYDT) or


’OP’ (LYCT and LYDT), which take the values ’N’ or ’T’, switch between thetranspose cases of the actual equation (see also Table I).

For LYCT and LYDT, the character mode parameter ’SYMM’, which takes the val-ues ’N’ or ’S’, switches between the symmetric and nonsymmetric cases. SYMM=’S’

corresponds to symmetric right hand side and solution matrices and halves the num-ber of flops needed for computing the solution of the reduced triangular problem.

2.2.3 Input/output parameters. The integer input parameter ’ISGN’ signals thesign (+1 or −1) in the actual equation (see Table I).

The character input/output argument ’COMM’ gives the user the opportunity tochoose one out of two communication schemes in the triangular solver, the de-fault on demand scheme (COMM=’D’) or the non-default matrix block shift scheme(COMM=’S’). Matrix block shifting cannot be employed for all problems (see, e.g.,[Granat and Kagstrom 2006] and Part I [Granat and Kagstrom 2007a]), so thecorresponding triangular solver may switch to the on demand scheme if necessary.For this reason, ’COMM’ is also output from the routines showing what scheme thatwas actually used in the routine.

The integer input parameters ’M’ and ’N’ specify the dimensions of the involvedsubmatrices.

The input/output two-dimensional double precision arrays ’A’, ’B’ and ’C’ corre-spond to the globally distributed matrices A, B and C stored using column-majorlayout. On return from the routines, ’A’ and ’B’ are in real Schur form and in thesolving mode (see above) ’C’ is overwritten with the solution matrix X. Notice thatSCASY treats all matrices as one-dimensional arrays internally in the subroutines.

The integer input arguments ’IA’, ’JA’, ’IB’, ’JB’, ’IC’ and ’JC’ specify start rowsand columns for the corresponding submatrices to operate on. These parametersmust obey some alignment requirements to not violate the ScaLAPACK conven-tions. We remark that the current release of SCASY can not work on submatricesfor the unreduced matrix equations because of missing functionality in the routinesused in the reduction to triangular form, e.g., PDLAHQR from ScaLAPACK.

The input integer descriptor arrays ’DESC ’ correspond to the ScaLAPACK dis-tributed matrix descriptors. These arrays consist of nine elements each and describehow the corresponding matrix is distributed over the process mesh, as follows:

(1) DESC (1) contains the descriptor type. For dense matrices, DESC (1) = 1.(2) DESC (2) contains the identity of the corresponding BLACS context, which is

very much like an MPI Communicator, over which the corresponding matrix isdistributed.

(3) DESC (3) and DESC (4) contain the total number of rows and columns ofthe corresponding globally distributed matrix, respectively. These dimensionsshould not be confused with M or N above.

(4) DESC (5) and DESC (6) contain the blocking factors MB and NB used in theblock-cyclic data layout of the corresponding globally distributed matrix in therow and column dimensions, respectively.

(5) DESC (7) and DESC (8) contain the process row and column over which the firstrow and column of the corresponding globally distributed matrix is distributed,respectively.


(6) DESC (9) contains the leading dimension of the local part of the correspondingglobally distributed matrix.

The first eight elements of a descriptor must be globally consistent for the corre-sponding context. For more information on the descriptor arrays, see, e.g., [Black-ford et al. 1997; SLUG ].

The input arrays ’DWORK’ and ’IWORK’ are double precision and integer workspaceswith the lengths specified by the integer input arguments ’LDWORK’ and ’LIWORK’,respectively.

The input arguments ’MBNB2’, which is an integer array of size 2, and ’MB2’,which is an integer scalar, contain the internal blocking factors used in the multiplepipelining approach [Granat and Kagstrom 2007a] utilized in the correspondingtriangular solvers. Multiple pipelining is turned off by using the same blockingfactors for the pipelining as in the data distribution (i.e., in the matrix descriptors,see above).

2.2.4 Output parameters. The triangular solvers handle 2× 2 blocks shared bymultiple data layout blocks by an implicit redistribution [Granat et al. 2003] of theinvolved matrices (matrix pairs). This causes some subsystems to be of slightlydifferent size and the integer output argument ’NOEXSY’ counts the number of suchsystems solved during the execution of the corresponding triangular solver.

The output double precision argument ’SCALE’ is a global scaling factor in theinterval (0, 1] for the right hand side used in the parallel solver to avoid overflow inthe solution. ’SCALE’ corresponds to the scalar s in Table I.

2.2.5 Error handling. The output integer argument ’INFO’ gives error messages,including overflow warnings, on output from the calling routine, as follows:

—If INFO < 0, some of the argument passed to the routine had an illegal value andINFO is set pointing to that particular argument.

—If INFO = 0, the routine was invoked successfully and returned without any errormessages.

—If INFO = 1, there was no valid BLACS context [BLACS ] in the call and the callwas aborted.

—If INFO = 2, the problem was very ill-conditioned and a perturbed nearly singularsystem was used to solve the corresponding matrix equations (see, e.g., [Kagstromand Poromaa 1996]).

—If INFO = 3, the problem was badly scaled and the right hand side(s) was scaledby a factor SCALE to avoid overflow in the solution.

2.2.6 Generalized matrix equations. Below, the subroutine headings of our fourgeneral solver routines for the unreduced generalized Sylvester-type matrix equa-tions are listed.

—SUBROUTINE PGEGCSYD( JOB, TRANZ, ADSCHR, BESCHR, TRANAD, TRANBE,

ISGN, COMM, M, N, A, IA, JA, DESCA, B, IB, JB, DESCB, C, IC, JC,

DESCC, D, ID, JD, DESCD, E, IE, JE, DESCE, F, IF, JF, DESCF, MBNB2,

DWORK, LDWORK, IWORK, LIWORK, NOEXSY, SCALE, INFO )

Solves the unreduced generalized coupled Sylvester (GCSY) equation, where


sub(A) and sub(D) are M × M , sub(B) and sub(E) are N × N , and sub(C),sub(X), sub(F ) and sub(Y ) are M × N . Notice that sub(X) and sub(Y ) over-write sub(C) and sub(F ) on output.

—SUBROUTINE PGEGSYLD( JOB, ACSCHR, BDSCHR, TRANAC, TRANBD, ISGN,


D, ID, JD, DESCD, E, IE, JE, DESCE, MBNB2, DWORK, LDWORK, IWORK,

LIWORK, NOEXSY, SCALE, INFO )

Solves the unreduced generalized Sylvester (GSYL) equation, where sub(A) andsub(C) are M × M , sub(B) and sub(D) are N × N , and sub(E) and sub(X)(which overwrites sub(E)) are M ×N .

—SUBROUTINE PGEGLYCTD( JOB, SYMM, OP, AESCHR, M, A, IA, JA, DESCA,

E, IE, JE, DESCE, C, IC, JC, DESCC, MB2, DWORK, LDWORK, IWORK,


Solves the unreduced generalized continuous-time Lyapunov (GLYCT) equation,with general or symmetric right hand side C, where sub(A) and sub(E) areM ×M , and sub(C) and sub(X) (which overwrites sub(C)) are M ×M .

—SUBROUTINE PGEGLYDTD( JOB, SYMM, OP, AESCHUR, M, A, IA, JA, DESCA,

E, IE, JE, DESCE, C, IC, JC, DESCC, MB2, DWORK, LDWORK, IWORK,


Solves the unreduced generalized discrete-time Lyapunov (GLYDT) equation,with general or symmetric right hand side C, where sub(A) and sub(E) areM ×M , and sub(C) and sub(X) (which overwrites sub(C)) are M ×M .

Below, we give a description of the new arguments that do not exist for thestandard matrix equations. A summary of these is listed in Table II.

2.2.7 Mode parameters. The character mode parameter ’ SCHR’, which takesthe values ’N’ or ’S’, specifies if the corresponding matrix pair should be reducedto generalized Schur form or not.

The character mode parameters ’TRAN ’ and ’OP’, which take the values ’N’

or ’T’, give the transpose mode for the corresponding matrix pair in the actualequation (see Table I). The algorithms in SCASY are limited to the cases wherethe corresponding matrix pairs formed by the left and right multiplying matrices(see Part I [Granat and Kagstrom 2007a]) have the same transpose mode.

In condition estimation of GCSY, we need to solve transpose variants of the Kro-necker product matrix representation ZGCSY of the generalized coupled Sylvesteroperator which can not be expressed by changing the transpose modes of the in-volved left hand side coefficient matrices (see Part I [Granat and Kagstrom 2007a],and [Kagstrom and Poromaa 1996] and LAPACK’s DTGSYL for details). Therefore,the corresponding transpose mode is specified by the character mode argument’TRANZ’, which takes the values ’N’ or ’T’, in the routine PGEGCSYD.

2.2.8 Input/output parameters.. The input/output two-dimensional double pre-cision arrays ’A’, ’B’, ’C’, ’D’, ’E’ and ’F’ correspond to the matrices A, B, C, D,


E and F . On return from the routines, the involved matrix pairs (excluding theright hand side pair (E,F ) in GCSY) are in generalized real Schur form. In solvingmode (see above), the right hand side matrix (or matrix pair) is overwritten withthe solution matrix (pair). As mentioned before, SCASY treats all matrices asone-dimensional arrays internally in the subroutines.

The input integer arguments ’IA’, ’JA’, ’IB’, ’JB’, ’IC’, ’JC’, ’ID’, ’JD’, ’IE’, ’JE’,’IF’ and ’JF’, specify start rows and columns for the corresponding submatrices ofA, B, C, D, E, and F to operate on. These parameters must obey some alignmentrequirements to not violate ScaLAPACK conventions.

Table II. Parameters to Fortran interfacesMode Type DescriptionparametersJOB CHARACTER*1 Specifies solving mode (’S’) or reduction mode (’R’).SYMM CHARACTER*1 Specifies symmetric right hand side (’S’) or not (’N’)).OP CHARACTER*1 Specifies transpose mode (’N’ or ’T’). Only for

Lyapunov equations.TRANZ CHARACTER*1 Specifies transpose mode for Kronecker product matrix

representation (’N’ or ’T’). Only used in conditionestimation of GCSY.

SCHUR CHARACTER*1 Specifies if the matrix is in real Schur form (’S’)or not (’N’). Only for standard matrix equations.

SCHR CHARACTER*1 Specifies if the matrix pair is in generalized real Schurform (’S’) or not (’N’). Only for generalized matrixequations.

TRANS CHARACTER*1 Specifies transpose mode for specific matrix(’N’ or ’T’).

TRAN CHARACTER*1 Specifies transpose mode for specific matrix pair(’N’ or ’T’).

ISGN INTEGER Specifies the sign variant of the equation (-1 or 1).Input/OutputparametersCOMM CHARACTER*1 Sets communication scheme used in PTR[ACRO]D.M,N INTEGER (Sub)matrix dimensions.A,B,C,D,E,F DOUBLE PRECISION(*) Two-dimensional arrays corresponding to the local

parts of the globally distributed matrices. On output,each left hand side matrix (pair) is returned inSchur (generalized Schur) form. On output and insolving mode, right hand side is overwritten withthe solution. In reduction mode, the right hand sideis not referenced.

IA,IB,IC,ID,IE,IF INTEGER Row starting indices for submatrices to operate on.JA,JB,JC,JD,JE,JF INTEGER Column starting indices for submatrices to operate on.DESC INTEGER(*) ScaLAPACK matrix descriptor arrays.MBNB2 INTEGER(2) Blocking factors for multiple pipelining in one-sided

Sylvester equations.MB2 INTEGER Blocking factor for multiple pipelining in two-sided

Sylvester and one-sided Lyapunov equations.WorkspaceDWORK DOUBLE PRECISION(*) Double precision workspace.LDWORK INTEGER Length of DWORK.IWORK INTEGER(*) Integer workspace.LIWORK INTEGER Length of IWORK.OutputinformationNOEXSY INTEGER Counts the number of extended/diminished

subsystems solved in the call to the triangular solver.SCALE DOUBLE PRECISION Right hand side scaling factor, 0 < SCALE ≤ 1.0.Error handlingINFO INTEGER Returns error information to the calling program.


2.3 Condition estimators

Parallel implementations of the condition estimators presented in Part I [Granatand Kagstrom 2007a] are available in SCASY as the routines P[ACRO]CON. TheFortran interfaces to the different estimators are derived from the correspondingsolvers. For example, the SYCT condition estimator has the following Fortraninterface:

—SUBROUTINE PSYCTCON( TRANSA, TRANSB, ISGN, COMM, M, N, A, IA, JA,

DESCA, B, IB, JB, DESCB, MBNB2, DWORK, LDWORK, IWORK, LIWORK, EST,

NOITER, INFO )

Computes a 1-norm based lower bound estimate EST of sep−1(sub(A), sub(B)),where sub(A) is M ×M and sub(B) is N ×N , using NOITER iterations and callsto PTRSYCTD (via PGESYCTD with ASCHUR=’S’ and BSCHUR=’S’).

The differences from the general SYCT interface are that it is assumed that thematrix equation is in reduced form, there is no argument for the right hand side’C’ since it is generated internally by the estimator (ScaLAPACK’s PDLACON), andthe two output arguments ’EST’ and ’NOITER’ which give the computed estimateand the number of iterations (the number of triangular matrix equations solved)needed to compute the estimate.

The implemented condition estimators assume that the corresponding matrixequations are in reduced (triangular) form. If this is not the case, the user mustperform the reduction step by one single call to the corresponding general solverwith the mode parameter ’JOB’ set to ’R’. Future releases of SCASY are plannedto also include Frobenius-norm based estimators (see, e.g., [Kagstrom and Westin1989; Kagstrom and Poromaa 1996]).

2.4 Test example generators

SCASY includes two problem generator routines which generate matrices or ma-trix pairs with specified standard or generalized eigenvalues, with the followinginterfaces:

—SUBROUTINE P1SQMATGD( DIAG, SDIAG, UPPER, M, A, DESCA, DW, MW,

DDIAG, SUBDIAG, NQTRBL, ASEED, DWORK, LDWORK, INFO )

Generates the M×M matrix A with eigenvalues specified by DDIAG and SUBDIAG.—SUBROUTINE P2SQMATGD( ADIAG, ASDIAG, BDIAG, UPPER, M, A, DESCA,

ADW, AMW, ADDIAG, ASUBDIAG, ANQTRBL, ASEED, B, DESCB, BDW, BMW,

BDDIAG, BSEED, DWORK, LDWORK, INFO ).Generates the M ×M matrix pair (A,B) with generalized eigenvalues specifiedby ADDIAG, ASUBDIAG and BDDIAG.

P1SQMATGD generates coefficient matrices for the standard matrix equations asfollows: Consider a matrix A ∈ Rm×m in the form A = Q(αADA + βAMA)QT ,where DA is block diagonal with 1 × 1 and 2 × 2 blocks, MA is strictly uppertriangular with zeros in the first superdiagonal where DA has 2 × 2 blocks, Qis a random orthogonal matrix and αA and βA are real scalars. We choose MA

as a random matrix with uniformly distributed elements in the interval [0, 1] and


prescribe the eigenvalues of A by specifying the elements of DA, where the 2 × 2blocks correspond to complex conjugate pairs of eigenvalues.P2SQMATGD generates test matrix pairs for the generalized matrix equations in a

simlar way by specifying the generalized eigenvalues for a given diagonal matrix pair(DA, DB) ∈ R(m×m)×2 (2 × 2 blocks are only allowed in DA) and performing theequivalence transformation (A,B) = XT (DA, DB)Y, where X and Y are invertiblematrices which are constructed as follows: we specify their corresponding singularvalues ΣX = diag(σ1, . . . , σm) and ΣY = diag(ρ1, . . . , ρm) and generate four randomorthogonal matrices U1, U2, V1 and V2 such that X = U1ΣXV T

1 and Y = U2ΣYV T2 .

In practice, the matrices ΣX and ΣY are not generated explicitly, but the singularvalues are used to scale the corresponding rows and columns in the congruencetransformations above. The conditioning of X and Y are controlled by ΣX and ΣY.If chosen so that X and Y are well-conditioned, the conditioning of the resultingmatrix pair (A,B) will mainly depend on the specified eigenvalues.

In the current release, the matrix generation routines do only support globallydistributed testmatrices (and not submatrices of a global matrix).

2.5 SCASY internals

In total, SCASY consists of 47 routines whose design depends on the functionalityof a number of external libraries. The call graph in Figure 1 shows the subroutinehierarchy in SCASY. In the following, we briefly describe the six types of softwareincluded in the call graph.

2.5.1 Libraries. The following external libraries are used in SCASY:

—ScaLAPACK [Blackford et al. 1997; SLUG ], including the PBLAS [PBLAS ] andBLACS [BLACS ],

—LAPACK and BLAS [Anderson et al. 1999],—RECSY [Jonsson and Kagstrom 2003], which provides almost all node solvers

except for one transpose case of the GCSY equation. Notice that RECSY inturn calls a small set of subroutines from SLICOT (Software Library in Control)[SLICOT ; Elmroth et al. 2001].

For example, the routines for the standard matrix equations utilize the ScaLA-PACK routines PDGEHRD, which performs a parallel Hessenberg reduction, PDLAHQR,which is the parallel unsymmetric QR algorithm presented in [Henry et al. 2002],and PDGEMM, the PBLAS parallel implementation of the level-3 BLAS GEMM-operation. The triangular solvers employ the RECSY node solvers [Jonsson andKagstrom 2002a; 2002b; Jonsson and Kagstrom 2003] and LAPACK’s DTGSYL (seealso [Kagstrom and Poromaa 1996]) for solving (small) matrix equations on thenodes and the BLAS for the level 3 updates (DGEMM, DTRMM and DSYR2K operations).To perform explicit communication and coordination in the triangular solvers weuse the BLACS library.

SCASY may be compiled including node solvers from the OpenMP version ofRECSY by defining the preprocessor variable OMP. By linking with a multi-threadedversion of the BLAS, SCASY supports parallelization on both a global and on anode level on distributed memory platforms with SMP-aware nodes (see Section 3).The number of threads to use in the RECSY solvers and the threaded version of the


Fig. 1. Subroutine call graph of SCASY.

BLAS is controlled by the user via environment variables, e.g., via OMP NUM THREADS

for OpenMP and GOTO NUM THREADS for the threaded version of the GOTO-BLAS[GOTOBLAS ] (see also Section 3).

2.5.2 Test utilities. The test utilities of SCASY consist of the test programTESTSCASY, the matrix (pair) generators described in Section 2.4, and the matrixprinting routines DLAPRNT and PDLAPRNT (the latter one is a modified version ofthe corresponding ScaLAPACK routine) which are convenient to use in debug-ging. TESTSCASY tests some or all routines for a user-specified range of matrixdimensions, blocking factors, processor grid configurations, etc. The user may alsoconstruct differently conditioned problems by specifying the eigenvalues of the gen-erated problems.

2.5.3 SCASY core. The SCASY core consists of the general and triangularsolvers described in Section 2.2 and the condition estimators described in Section2.3.

2.5.4 Implicit redistribution. The handling of the implicit redistribution, causedby 2×2 diagonal blocks (corresponding to complex conjugate pairs of eigenvalues),shared by multiple data layout blocks (and processors) in the left hand side quasi-triangular matrices of the matrix equations, is concentrated to a few routines, asfollows:


—PDEXTCHK searches the diagonal blocks of a given matrix, say A, looking forany 2 × 2 blocks shared by several data layout blocks. The routine returns thefollowing redistribution information which decides which mbA × nbA diagonalblocks Aii that should be extended or diminished one row and column to includeall elements of a shared 2× 2 block:

EXT INFO A(i) =

0 if Aii is unchanged1 if Aii is extended2 if Aii is diminished3 if Aii is extended and diminished

.

This information is broadcasted to all processors and is used in

—PDIMPRED, in which data is exchanged between the processors via message passingto build up local arrays of extra elements which are used while constructing anddecomposing the ”correct” submatrices locally on the nodes by invoking the serialroutines DBEXMAT and DUBEXMA.

—PDBCKRD is called right before returning from the corresponding triangular solverand sends back the redistributed parts of the solution matrix (pair) to their orig-inal owner processes such that the solution matrix (pair) is correctly distributedover the process mesh on output.

These routines are implemented in a general manner such that they are easy toreuse in other applications involving distributed quasi-triangular matrices.

2.5.5 Multiple pipelining. For all matrix equations except LYDT, GLYCT andGLYDT, a locally blocked node solver DTR[ACRO] is used on top of the correspond-ing RECSY solver to enhance pipelining of local subsolutions (see Section 3.4 inPart I [Granat and Kagstrom 2007a] for details).

2.5.6 Parallel QZ prototype. The reduction of the involved matrix pairs to gen-eralized real Schur form is performed using the new prototype ScaLAPACK-styleimplementations of the Hessenberg-triangular reduction and the parallel multi-shiftQZ algorithm presented in [Adlerborn et al. 2002; Adlerborn et al. 2006]. Thesealgorithms are still under development and improvements will be included in futurereleases of SCASY.

2.6 Library usage

The SCASY library is available from the library webpage [SCASY ]. The libraryis easily installed using the Make include file supplied by the package. Necessarymodifications include choice of compilers, the associated flags, linking options, pathsto external libraries, etc.

The library is copyrighted and freely available for academic (non-commercial)use, and is provided on an ”as is” basis. Any use of the SCASY library should beacknowledged by citing the Part I and Part II papers [Granat and Kagstrom 2007a;2007b] and the SCASY homepage [SCASY ].

For bug reports and other issues, we refer to the software documentation (seeSection 2.1). We also welcome input and comments from users.


3. SCASY WITH OPENMP RECSY AND SMP-BLAS

In this section, we present experimental results that illustrate the performanceof the SCASY library when linked with the OpenMP version of the RECSY nodesolvers and a shared memory version of the BLAS on a distributed memory platformwith SMP-type nodes. Such a platform consists of a number of distributed nodeswith local memory, where each node is a stand-alone shared memory machine.These parallel platforms are becoming more and more common and benefit in manyways from the possibility of performing parallel processing in several programmingparadigms (including the message-passing model and the multiple threads model)concurrently.

For example, according to the analysis (see Part I [Granat and Kagstrom 2007a]),the level of parallelism in our triangular algorithms is favored by square processmeshes, i.e., Pr = Pc, since the number of concurrent subsystems to solve and, tosome extent, the related GEMM-updates are limited by min(Pr, Pc). This meansthat the number of concurrent tasks will not increase significantly going from a4× 4 to a 8× 4 (or 4× 8) mesh since the number of concurrent subsystems to solvewill not increase at all. However, using the benefits of the shared memory versionsof the node solvers and the BLAS, we may increase the number of processors from4× 4 to 4× 4× 2, where the new third dimension of the mesh consists of the extrathreads added by including both processors of the shared memory nodes. In thisway, the ScaLAPACK process mesh is kept squarish while the number of processorsis doubled.

Our target machine is the 64-bit Opteron Linux Cluster sarek with 192 dualAMD Opteron nodes (2.2 GHz), 1.5TB RAM per node and a Myrinet-2000 high-performance interconnect with 250 MB/sec bandwidth. Each node on sarek is aNUMA machine. All experiments were conducted using the pgf77 1.2.5 64-bit

compiler, the compiler flag -fast and the following software: MPICH-GM 1.5.2[MPI ], LAPACK 3.0 [Anderson et al. 1999], GOTO-BLAS r0.94 [GOTOBLAS ],ScaLAPACK 1.7.0 [SLUG ], BLACS 1.1patch3 [BLACS ] and RECSY 0.01alpha[RECSY ]. All experiments are conducted in double precision arithmetics. Formore information about the compilers and software used, see Part I [Granat andKagstrom 2007a].

As case study, we consider the triangular GCSY equation (see Table I) for whichthe RECSY node solver RECGCSY P is known to give good parallel performance. Thethreaded versions of RECSY and BLAS give the best performance when as largeblocks (tasks) as possible are used in the calls. By definition multiple pipeliningcreates smaller blocks, so there is a trade-off between using multi-threading andmultiple pipelining (or both). For the results presented below only implicit multi-threading via RECSY and BLAS was utilized.

Some initial tests were performed on equations with matrices of size 2000× 2000to investigate what blocking factors that were beneficial on one node of sarek withmultiple (in this case two) threads. Already for mb = nb = 256, we experienceda cut down in the execution time from 12.0 to 8.3 seconds going from one to twothreads, which is a speedup of 1.45. For increasing problem sizes, larger blockingfactors are required for the same amount of speedup, see also [Granat et al. 2004]where it was demonstrated that RECSY performs best in this context when large


and square data layout blocks are utilized.We present performance results of PTRGCSYD in Table III where the right-most

column S2 displays the speedup of the parallel algorithms going from one to twothreads on each compute node.

Table III. Performance of PTRGCSYD compiled with the OpenMP node solver RECGCSY P and a

threaded version of the GOTO-BLAS on sarek. All timings, T1 and T2, are in seconds.[OMP/GOTO] NUM THREADS=1 [OMP/GOTO] NUM THREADS=2 S2 =

m = n mb=nb Pr × Pc T1 Gflops/s Rr T2 Gflops/s Rr T1/T2

5000 256 2× 2 76.3 6.55 0.8E-01 54.1 9.24 0.8E-01 1.415000 256 4× 4 35.8 15.03 0.8E-01 24.0 20.98 0.8E-01 1.4910000 256 3× 3 364 10.98 0.8E-01 252 15.89 0.8E-01 1.4510000 256 5× 5 174 23.07 0.9E-01 126 31.79 0.9E-01 1.3820000 512 6× 6 845 37.87 0.1E00 538 59.46 0.1E00 1.5720000 512 7× 7 702 45.60 0.1E00 445 71.12 0.1E00 1.5730000 512 7× 7 2264 47.71 0.1E00 1427 75.68 0.9E-01 1.5930000 512 8× 8 1801 59.99 0.1E00 1134 95.28 0.1E00 1.59

We remark that multi-threading in both RECSY and the BLAS at the sametime can cause difficulties since the current standard of the threaded BLAS do notallow the user to alter the number of threads dynamically. In fact, this can causeoverthreading (i.e., using more than one thread per processor) which may degradethe performance and even cause slow-downs. We avoid this by making sure thatthe relation OMP NUM THREADS·GOTO NUM THREADS ≤ MAX NUM THREADS always holds.For sarek MAX NUM THREADS = 2.

4. SUMMARY AND FUTURE WORK

We have presented the high performance parallel library SCASY for solving eightcommon matrix equations. Including different transposed and sign variants, SCASYprovides 44 parallel solvers. In addition, we have demonstrated its performance ona distributed memory machine with shared memory nodes. This proves SCASY’snovel capacity and functionality in concurrently handling both the message passingmodel and the multiple threading model for parallel computing.

Future work will include improving the performance of the parallel reduction al-gorithms used in the framework of SCASY, like the parallel QR and QZ algorithms.Periodic variants of the algorithms in SCASY are also on the agenda.

5. ACKNOWLEDGEMENTS

The authors are grateful to Bjorn Adlerborn, Isak Johnsson, Lars Karlsson andDaniel Kressner for fruitful discussions on the subject. Thanks also to Per An-dersson and Romaric David who spent some time testing and debugging codes inSCASY.

REFERENCES

Adlerborn, B., Dackland, K., and Kagstrom, B. 2002. Parallel and blocked algorithms

for reduction of a regular matrix pair to Hessenberg-triangular and generalized Schur forms.

In Applied Parallel Computing PARA 2002, J. Fagerholm and et al., Eds. Lecture Notes in

Computer Science, vol. 2367. Springer-Verlag, 319–328.

Adlerborn, B., Kressner, D., and Kagstrom, B. 2006. Parallel Variants of the Multishift QZ

Algorithm with Advanced Deflation Techniques . In PARA’06 - State of the Art in Scientific


and Parallel Computing. Vol. 4699. Lecture Notes in Computer Science, Springer, 2007 (to

appear).

Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J. W., Dongarra, J. J.,

Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., and Sorensen, D. C. 1999.

LAPACK Users’ Guide, Third ed. SIAM, Philadelphia, PA.

Bartels, R. H. and Stewart, G. W. 1972. Algorithm 432: The Solution of the Matrix Equation

AX −BX = C. Communications of the ACM 8, 820–826.

Blackford, L. S., Choi, J., Cleary, A., D’Azevedo, E., Demmel, J. W., Dhillon, I., Don-

garra, J. J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., and

Whaley, R. C. 1997. ScaLAPACK Users’ Guide. SIAM, Philadelphia, PA.

BLACS. BLACS - Basic Linear Algebra Communication Subprograms. See http://www.netlib.

org/blacs/index.html.

Blanquer, I., Guerrero, D., Hernandez, V., Quintana-Orti, E., and Ruiz, P. 1998. Parallel-

SLICOT implementation and documentation standards.

Elmroth, E., Johansson, P., Kagstrom, B., and Kressner, D. 2001. A web computing envi-

ronment for the SLICOT library. In The Third NICONET Workshop on Numerical Control

Software. 53–61.

GOTOBLAS. GOTO-BLAS - High-Performance BLAS by Kazushige Goto. See http://www.cs.

utexas.edu/users/flame/goto/.

Granat, R., Jonsson, I., and Kagstrom, B. 2004. Combining Explicit and Recursive Blocking

for Solving Triangular Sylvester-Type Matrix Equations on Distributed Memory Platforms.

In Euro-Par 2004 Parallel Processing, M. Danelutto, D. Laforenza, and M. Vanneschi, Eds.

Lecture Notes in Computer Science, vol. 3149. Springer, 742–750.

Granat, R. and Kagstrom, B. 2006. Parallel Algorithms and Condition Estimators for Standard

and Generalized Triangular Sylvester-type Matrix Equations. In PARA’06 - State of the Art

in Scientific and Parallel Computing. Vol. 4699. Lecture Notes in Computer Science, Springer,

2007 (to appear).

Granat, R. and Kagstrom, B. 2007a. Parallel Solvers for Sylvester-type Matrix Equations with

Applications in Condition Estimation, Part I: Theory and Algorithms. ACM Transactions on

Mathematical Software (submitted 2007).

Granat, R. and Kagstrom, B. 2007b. Parallel Solvers for Sylvester-type Matrix Equations

with Applications in Condition Estimation, Part II: The SCASY Software Library. ACM

Transactions on Mathematical Software (submitted 2007).

Granat, R., Kagstrom, B., and Poromaa, P. 2003. Parallel ScaLAPACK-style Algorithms for

Solving Continuous-Time Sylvester Equations. In Euro-Par 2003 Parallel Processing, H. Kosch

and et al, Eds. Lecture Notes in Computer Science, vol. 2790. Springer, 800–809.

Henry, G., Watkins, D. S., and Dongarra, J. J. 2002. A parallel implementation of the

nonsymmetric QR algorithm for distributed memory architectures. SIAM J. Sci. Comput. 24, 1,

284–311.

Jonsson, I. and Kagstrom, B. 2002a. Recursive blocked algorithms for solving triangular sys-

tems. Part I. One-sided and coupled Sylvester-type matrix equations. ACM Trans. Math.

Software 28, 4, 392–415.

Jonsson, I. and Kagstrom, B. 2002b. Recursive blocked algorithms for solving triangular sys-

tems. Part II. Two-sided and generalized Sylvester and Lyapunov matrix equations. ACM

Trans. Math. Software 28, 4, 416–435.

Jonsson, I. and Kagstrom, B. 2003. RECSY - A High Performance Library for Solving Sylvester-

Type Matrix Equations. In Euro-Par 2003 Parallel Processing, K. H. et al, Ed. Lecture Notes

in Computer Science, vol. 2790. 810–819.

Kagstrom, B. and Poromaa, P. 1996. Computing eigenspaces with specified eigenvalues of a

regular matrix pair (A, B) and condition estimation: theory, algorithms and software. Numer.

Algorithms 12, 3-4, 369–407.

Kagstrom, B. and Westin, L. 1989. Generalized Schur methods with condition estimators for

solving the generalized Sylvester equation. IEEE Trans. Autom. Contr. 34, 4, 745–751.

MPI. MPI - Message Passing Interface. See http://www-unix.mcs.anl.gov/mpi/.


PBLAS. PBLAS - Parallel Basic Linear Algebra Subprograms. See http://www.netlib.org/

scalapack/pblas.

RECSY. RECSY - High Performance library for Sylvester-type matrix equations. See http:

//www.cs.umu.se/research/parallel/recsy.

SCASY. SCASY - ScaLAPACK-style solvers for Sylvester-type matrix equations. See http:

//www.cs.umu.se/research/parallel/scasy.

SLICOT. SLICOT Library In The Numerics In Control Network (Niconet). See http://www.

win.tue.nl/niconet/index.html.

SLUG. ScaLAPACK Users’ Guide. See http://www.netlib.org/scalapack/slug/.

V

Paper V

Parallel Eigenvalue Reordering in Real Schur Forms∗

Robert Granat1, Bo Kagstrom1, and Daniel Kressner2


2 Seminar fur angewandte Mathematik, ETH Zurich, [email protected]

Abstract: A parallel variant of the standard eigenvalue reordering method for the realSchur form is presented and discussed. The novel parallel algorithm adopts compu-tational windows and delays multiple outside-window updates until each window hasbeen completely reordered locally. By using multiple concurrent windows the parallelalgorithm has a high level of concurrency, and most work is level 3 BLAS operations.The presented algorithm is also extended to the generalized real Schur form. Exper-imental results for ScaLAPACK-style Fortran 77 implementations on a Linux clusterconfirm the efficiency and scalability of our algorithms in terms of more than 16 timesof parallel speedup using 64 processor for large scale problems. Even on a single pro-cessor our implementation is demonstrated to perform significantly better compared tothe state-of-the-art serial implementation.

Key words: Parallel algorithms, eigenvalue problems, invariant subspaces, direct re-ordering, Sylvester matrix equations, condition number estimates.

∗ Submitted to Concurrency and Computations: Practice and Experience, September 2007. Also pub-lished as LAPACK Working Note #192.

155

156

Parallel eigenvalue reorderingin real Schur forms‡

R. Granat1, B. Kagstrom1,∗, and D. Kressner1,2†

1 Department of Computing Science and HPC2N, Umea University,SE-901 87 Umea, Sweden2 Seminar fur Angewandte Mathematik, ETH Zurich, Switzerland

SUMMARY

A parallel variant of the standard eigenvalue reordering method for the real Schur formis presented and discussed. The novel parallel algorithm adopts computational windowsand delays multiple outside-window updates until each window has been completelyreordered locally. By using multiple concurrent windows the parallel algorithm has ahigh level of concurrency, and most work is level 3 BLAS operations. The presentedalgorithm is also extended to the generalized real Schur form. Experimental results forScaLAPACK-style Fortran 77 implementations on a Linux cluster confirm the efficiencyand scalability of our algorithms in terms of more than 16 times of parallel speedup using64 processor for large scale problems. Even on a single processor our implementationis demonstrated to perform significantly better compared to the state-of-the-art serialimplementation.

key words: Parallel algorithms, eigenvalue problems, invariant subspaces, direct reorder-

ing, Sylvester matrix equations, condition number estimates

1. Introduction

The solution of large-scale matrix eigenvalue problems represents a frequent task in scientificcomputing. For example, the asymptotic behavior of a linear or linearized dynamical systemis determined by the right-most eigenvalue of the system matrix. Despite the advance ofiterative methods – such as Arnoldi and Jacobi-Davidson algorithms [3] – there are problems

∗Correspondence to: Department of Computing Science and HPC2N, Umea University, SE-901 87 UMEA,Sweden.†E-mail: granat,[email protected], [email protected]‡Technical Report UMINF-07.20, Department of Computing Science, Umea University, Sweden. Also publishedas LAPACK Working Note #192.Contract/grant sponsor: The Swedish Research Council/The Swedish Foundation for Strategic Research;contract/grant number: VR 621-2001-3284/SSF A3 02:128

158 R. GRANAT, B. KAGSTROM AND D. KRESSNER

where a direct method – usually the QR algorithm [14] – is preferred, even in a large-scalesetting. In the example quoted above, an iterative method may fail to detect the right-mosteigenvalue and, in the worst case, misleadingly predict stability even though the system isunstable [29]. The QR algorithm typically avoids this problem, simply because all and notonly selected eigenvalues are computed. Also, iterative methods are usually not well suited forsimultaneously computing a large portion of eigenvalues along with the associated invariantsubspace. For example, an invariant subspace belonging to typically half of the eigenvaluesneeds to be computed in problems arising from linear-quadratic optimal control [31], see alsoSection 7.

Parallelization of the QR algorithm is indispensable for large matrices. So far, only its twomost important steps have been addressed in the literature: Hessenberg reduction and QRiterations, see [5, 10, 19], and the resulting software is implemented in ScaLAPACK [6, 32].The (optional) third step of reordering the eigenvalues, needed for computing eigenvectors andinvariant subspaces, has not undergone parallelization yet. The purpose of this paper is tofill this gap, aiming at a complete and highly performing parallel implementation of the QRalgorithm.

1.1. Mathematical problem description

Given a general square matrix A ∈ Rn×n, computing the Schur decomposition (see, e.g., [14])

QT AQ = T (1)

is the standard approach to solving non-symmetric eigenvalue problems (NEVPs), that is,computing eigenvalues and invariant subspaces (or eigenvectors) of a general dense matrix.In Equation (1), T ∈ Rn×n is quasi-triangular with diagonal blocks of size 1 × 1 and 2 × 2corresponding to real and complex conjugate pairs of eigenvalues, respectively, and Q ∈ Rn×n

is orthogonal. The matrix T is called the real Schur form of A and its diagonal blocks (i.e., itseigenvalues) can occur in any order along the diagonal.

For any decomposition of (1) of the form

QT AQ = T ≡[T11 T12

0 T22

](2)

with T11 ∈ Rp×p for some integer p and provided that T (p+1, p) is zero, the first p columns ofthe matrix Q span an invariant subspace of A corresponding to the p eigenvalues of T11 (see,e.g, [13]). Invariant subspaces are important since they often eliminate the need of expliciteigenvector calculations in applications. However, notice that if T (2, 1) = 0, v1 = Q(1 : n, 1) isan (orthonormal) eigenvector of A corresponding to the eigenvalue λ1 = T (1, 1). If T (2, 1) 6= 0,the leading 2× 2 block corresponds to a complex conjugate pair of eigenvalues and in order toretain to real arithmetic we have to compute their eigenvectors simultaneously.

Computing the ordered real Schur form (ORSF) of a general n× n matrix is vital in manyapplications, for example, in stable-unstable separation for solving Riccati matrix equations[34]. It can also be used for computing explicit eigenvectors of A by reordering each eigenvalue(block) of interest to the leading position in T and reading off the corresponding first (two)column(s) in Q, as illustrated above.

PARALLEL EIGENVALUE REORDERING IN REAL SCHUR FORMS 159

In [2], a direct algorithm for reordering adjacent eigenvalues in the real Schur form (2) isproposed. For the special case of a tiny matrix T where p, n − p ∈ 1, 2, the method is asfollows:

• Solve the continuous-time Sylvester (SYCT) equation

T11X −XT22 = γT12, (3)

where γ is a scaling factor to avoid overflow in the right hand side.• Compute the QR factorization [−X

γI

]= QR (4)

using Householder transformations (elementary reflectors).• Apply Q in the similarity transformation of T :

T = QT TQ (5)

• Standardize 2× 2 block(s) if any exists.

In the method above, each swap is performed tentatively to guarantee backward stability byrejecting each swap that appears unstable with respect to a stability criterion. By applyinga bubble-sort procedure based on the adjacent swap method, where all selected eigenvalues(usually pointed to by a select vector) are moved step by step towards the top-left corner ofT by swapping adjacent blocks, ordered Schur forms can be computed, see Algorithm 1. Weremark that for n = 2 the adjacent swaps are performed using Givens rotations.

This paper presents parallel algorithms for blocked variants [27] of the method in [2].Furthermore, we apply our techniques to parallel eigenvalue reordering in the generalizedreal Schur form [21].

2. Serial blocked algorithms for eigenvalue reordering

The algorithm in [2] is implemented in LAPACK [1, 28] in the following hierarchy of routines:DTRSEN, which computes an ordered real Schur form and (optionally) associated conditionnumbers for the selected cluster of eigenvalues and the resulting associated invariant subspace;DTREXC, which moves a specified eigenvalue from a specific start position IFST to a specificend position ILST; DLAEXC, which performs direct swaps of adjacent eigenvalues and updatesthe matrices T and Q with respect to each swap (see Section 1).

An analogous software hierarchy exists for reordering in the generalized real Schur form[21, 25].

2.1. Working with computational windows and update regions

The LAPACK-style reordering [2] leads to matrix multiply updates in T and Q by small(2× 2, 3× 3, 4× 4) orthogonal transformations. Essentially, the performance is in the rangeof level 1 and 2 BLAS [12, 7].


Algorithm 1 Reordering a real Schur form (LAPACK’s DTRSEN)

Input: A matrix T ∈ Rn×n in real Schur form with m diagonal blocks of size1 × 1 and 2 × 2, an orthogonal matrix Q ∈ Rn×n and a subset ofeigenvalues Λs, closed under complex conjugation.

Output: A matrix T ∈ Rn×n in ordered real Schur form and an orthogonalmatrix Q ∈ Rn×n such that T = QT TQ. For some integer j, the set Λs

is the union of eigenvalues belonging to the j upper-left-most diagonalblocks of T . The matrices T and Q are overwritten by T and QQ,respectively.

j ← 0for i ← 1, . . . , m do

if λ(Tii) ⊂ Λs thenj ← j + 1, select(j) ← i

end ifend fortop ← 0for l ← 1, . . . , j do

for i ← select(l), select(l)− 1, . . . , top + 1 doSwap Ti−1,i−1 and Tii by an orthogonal similarity transformation and apply this transformationto the rest of the columns and rows of T , and the columns of Q.

end fortop ← top + 1

end for

Following the ideas from [26, 27], we may delay and accumulate all updates from inside acomputational window until the reordering is completed locally. Then the update regions of T(and the corresponding regions of Q) are updated, preferably in level 3 operations, see Figure1. The window is then moved towards the top-left part of the matrix, such that the clusternow ends up in the bottom-right part of the current window, and the reordering process isrepeated.

In general, this strategy leads to a significant performance gain by improving the memoryaccess pattern and diminishing the number of cache misses. Notice that a similar techniquewas employed for the variants of the QR algorithm presented in [8, 9]. The idea of delayingupdates is however not new (see, e.g., [11, 26] and the references therein).

The performance of the block algorithm is controlled by a few parameters, e.g., the size ofthe computational window nwin, and the number of eigenvalues to move inside each window,neig. A recommended choice is neig = nwin/2 [27]. Other parameters to tune are rmmult, whichdefines the threshold for when to use the orthogonal transformations in their factorized form(Householder transformations and Givens rotations) or their accumulated form by matrixmultiplication, and nslab, which is used to divide the updates of the rows of T into blocksof columns for improved memory access pattern in case the orthogonal transformations areapplied in their factorized form.


Figure 1. Working with computational windows (red) and delaying the updates for level 3 BLASoperations on update regions (green) in the matrix T .

If only 1 × 1 blocks are reordered in the current position of the window and matrixmultiplication is used, the orthogonal transformation matrix has the structure

U =[

U11 U12

U21 U22

]=

@

@

, (6)

i.e., the submatrices U12 ∈ R(nwin−k)×(nwin−k) and U21 ∈ Rk×k are lower and upper triangular,respectively, and k ≤ neig is the number of eigenvalues moved from the bottom to top insidethe window. This structure can be exploited by replacing a single call to GEMM by twocalls to GEMM and TRMM which sometimes can lead to reduced execution times (see [27]for details). Here, GEMM is an acronym for the general matrix multiply and add operationC ← βC + α · op(A) · op(B), where op(·) denotes a matrix or its transpose. TRMM is anacronym for the triangular matrix multiply operation in level 3 BLAS [1].

We present the overall block reordering method in Algorithm 2.

3. Parallel blocked algorithms for eigenvalue reordering

Going in parallel, we adopt to the ScaLAPACK (see, e.g., [6, 32]) conventions of the paralleldistributed memory (DM) environment, as follows:

• The parallel processes are organized into a rectangular Pr × Pc mesh labeled from (0, 0)to (Pr − 1, Pc − 1) according to their specific position indices in the mesh.


Algorithm 2 Blocked algorithm for reordering a real Schur formInput and Output: See Algorithm 1. Additional input: block parameters neig

(max #eigenvalues in each window) and nwin (window size).

iord ← 0 % iord = #number of eigenvalues already in orderwhile iord < #Λs do

% Find first nev ≤ neig disordered eigenvalues from top.nev ← 0, ihi ← iord + 1while nev ≤ neig and ihi ≤ n do

if Tii ∈ Λs then nev ← nev + 1 end ifihi ← ihi + 1

end while% Reorder these eigenvalues window-by-window to top.while ihi > iord + nev do

ilow ← maxiord + 1, ihi − nwin + 1Apply Algorithm 1 to the active window T (ilow : ihi, ilow : ihi) in order to reorder the k ≤ nev

selected eigenvalues that reside in this window to top of window. Let the correspondingorthogonal transformation matrix be denoted by U .Update T (ilow : ihi, ihi + 1 : n) ← UT T (ilow : ihi, ihi + 1 : n).Update T (1 : ilow − 1, ilow : ihi) ← T (1 : ilow − 1, ilow : ihi)U .Update Q(1 : n, ilow : ihi) ← Q(1 : n, ilow : ihi)U .ihi ← ilow + k − 1

end whileiord ← iord + nev

end while

• The matrices are distributed over the mesh using 2-dimensional (2D) block cyclicmapping with the block sizes mb and nb in the row and column dimensions, respectively.

Since the matrices T and Q are square, we assume throughout this paper that nb = mb, i.e.,the matrices are partitioned in square blocks. To simplify the reordering in the presence of2 × 2 blocks, we also assume that nb ≥ 3 to avoid the situation of having two adjacent 2 × 2blocks spanning over three different diagonal blocks in T . Moreover, we require T and Q to bealigned such that blocks Tij and Qij are held by the same process, for all combinations of iand j, 1 ≤ i, j ≤ dn/nbe. Otherwise, shifting T and/or Q across the process mesh before (andoptionally after) the reordering is necessary.

Below, we refer to an active process as a process that holds all or some part of a computationalwindow in T .

3.1. The computational windows in the parallel environment

We restrict the size of the computational window to the block size used in the data layout.This means that a computational window can be in two states: either it is completely held byone single block or it is shared by four data layout blocks: two neighboring diagonal blocks,one subdiagonal block (in presence of a 2 × 2 block residing on the block borders) and onesuperdiagonal block.


(1,2)

(0,1) (0,2)

(0,0)

(0,0)

(0,1) (0,2)

(1,0) (1,1)

Figure 2. The Schur form T distributed overa process mesh of 2 × 3 processors. Thecomputational window (red) is completelylocal and the update regions (green) areshared by the corresponding process row and

column.

(0,2)

(0,0)

(1,2)

(0,2)

(0,0)

(0,1)

(1,0) (1,1)

(0,1)

Figure 3. Broadcasting of the orthogonaltransformations along the current process row

and column of a 2× 3 process mesh.

3.2. Moving eigenvalues inside a data layout block

Depending on the values of nb and nwin, each diagonal block of T is locally reordered bya certain number of partly overlapping positions of the window, moving k ≤ neig selectedeigenvalues from the bottom towards the top of the block, see Figure 2. Before the window ismoved to its next position inside the block, the resulting orthogonal transformations fromthe reordering in the current position are broadcasted in the process rows and columnscorresponding to the current block row and column in T (and Q), see Figure 3. The subsequentupdates are performed independently and in parallel. In principle, no other communicationoperations are required beside these broadcasts.

Given one active process working alone on one computational window the possible parallelspeedup during the update phase is limited by Pr + Pc.

3.3. Moving eigenvalues across the process borders

When a computational window reaches the top-left border in the diagonal block, the currenteigenvalue cluster must be reordered across the border into the next diagonal block of T .This forces the computational window to be shared by more than one data layout blockand (optionally) more than one process. The restrictions in Section 3.1 make sure that a


(1,0) (1,1)

(0,1) (0,2)

(0,0)

(0,0)

(0,1) (0,2)

(1,2)

Figure 4. The Schur form T distributed over a process mesh of 2 × 3 processors. The computationalwindow (red) is shared by four distinct processes. The update regions (green) is shared by the

corresponding two process rows and process columns.

computational window cannot be shared by more than four data layout blocks and by at mostfour different processes which together form a submesh of maximum size 2× 2, see Figure 4.

To be able to maximize the work performed inside each diagonal block before crossingthe border and to minimize the required communication for the cross border reordering, itis beneficial to be able to control the size of the shared windows by an optional parameterncrb ≤ nwin, that can be adjusted to match the properties of the target architecture.

The processes holding the different parts of the shared ncrb × ncrb window now cooperateto bring the window across the border, as follows:

• The on-diagonal active processes start by exchanging their parts of the window andreceiving the off-diagonal parts from the two other processes. The cross border windowcauses updates in T and Q that span over parts of two block rows or columns. Therefore,the processes in the corresponding process rows and columns exchange blocks with theirneighbors as preparation for the (level 3) updates to come, see Figure 5. The totalamount of matrix elements from T and Q exchanged over the border in both directionsis ncrb · (2n− ncrb − 1).

• The on-diagonal active processes compute the reordering for the current window. Thisrequires some replication of the computations on both sides of the border. Since the totalwork is dominated by the off-diagonal updates, the overhead caused by the replicatedwork is negligible.

• Finally, the delayed and accumulated orthogonal transformations are broadcasted alongthe corresponding process rows and columns, and used in level 3 updates, see Figure 6.


(0,2)

(1,0) (1,1)

(0,1) (0,2)

(0,0)

(0,0)

(0,1)

(1,2)

Figure 5. Exchanges of data in adjacentprocess rows and columns for updates

associated with cross border reordering.

(1,0) (1,1)

(0,1) (0,2)

(0,0)

(0,0)

(0,1) (0,2)

(1,2)

Figure 6. Broadcasts of the computedorthogonal transformations in the currentprocessor rows and columns for cross border

reordering.

We remark that in principle the updates are independent and can be performed inparallel without any additional communication or redundant work. But if the orthogonaltransformations are applied in their factorized form, each processor computing an updatewill also compute parts of T and/or Q that are supposed to be computed by another processorat the other side of the border. In case of more than one process in the corresponding meshdimension, this causes duplicated work in the cross border updates, as well. Our remedy isto use a significantly lower value for rmmult favoring matrix multiplication for cross borderreordering.

In case matrix multiplication is used for the cross border reordering updates, we point outthat the special structure (6) of the accumulated orthogonal transformations sometimes canbe utilized. This happens when all the chosen eigenvalues have reached the other side of theborder and the last eigenvalue to cross the border ends up in the position closest to the border.In such a case, the two pairs of GEMMs and TRMMs are split into two pairs of one GEMMand one TRMM, each pair being computed on each side of the border, independently and inparallel. Therefore, the size of the cross border window should be chosen as twice the numberof eigenvalues to reorder across the border whenever possible.


(1,0) (1,1) (1,2)

(0,1) (0,2)

(0,0)

(0,0)

(0,1) (0,2)

Figure 7. Using multiple (here two) concurrent computational windows. The computational windows(red) are local but some parts of the update regions (green and blue) are shared.

3.4. Introducing multiple concurrent computational windows

By using multiple concurrent computational windows, we can work on at least kwin ≤min(Pr, Pc) adjacent windows at the same time, computing local reordering and broadcastingorthogonal transformations for updates in parallel. With k = min(Pr, Pc) and a square processmesh (Pr = Pc), the degree of concurrency in the updates becomes Pr · Pc, see Figure 7.

When all kwin windows reach the process borders, they are moved into the next diagonalblocks as described in the previous section, but in two phases. Since each window requirescooperation between two adjacent process rows and columns, we number the windows by theorder in which they appear on the block diagonal of T and start by moving all windows withan odd label across the border, directly followed by all windows with an even label. Care hasto be taken to assure that no processor is involved in more than one cross border window atthe same time. For example, if kwin = min(Pr, Pc) > 1 is an odd number, the last labelledwindow will involve processors which are also involved in the first window. In such a case, thelast window is reordered across the border after the second (even) phase has finished.

This two-phase approach gives an upper limit of the concurrency of the local reordering anddata exchange phases of the cross border reordering as kwin/2, which is half of the concurrencyof the pre-cross border part of the algorithm.

We present a high-level version of our parallel multi-window method in Algorithms 3–4.Notice that in the presence of 2 × 2 blocks in the Schur form, the computed indices in thealgorithm are subject to slight adjustments (see Section 5).


Algorithm 3 Parallel blocked algorithm for reordering a real Schur form (main part)Input and Output: See Algorithm 2. Additional input: data layout block size, nb,

process mesh sizes, Pr and Pc, process mesh indices, myrowand mycol, maximum cross border window size, nwcb, maximumnumber of concurrent windows, kwin ≤ min(Pr, Pc).

Let W = The set of computational windows = ∅iord ← 0 % iord = #number of eigenvalues already in orderwhile iord < #Λs do

for j ← 0, . . . , kwin − 1 doswin = (biord/nbc+ j) · nb + 1Add a window to W for T (swin : swin + nb − 1, swin : swin + nb − 1)

end for% Reorder each window to top-left corner of corresponding diagonal block.for each window w ∈ W in parallel do

if (myrow,mycol) owns w then% Find first nev ≤ neig disordered eigenvalues from top of my diagonal block.nev ← 0, ihi ← maxiord + 1, swinwhile nev ≤ neig and ihi ≤ n do

if Tii ∈ Λs thennev ← nev + 1

end ifihi ← ihi + 1

end whileend if% Reorder these eigenvalues window-by-window to top of diagonal block.while ihi > iord + nev do

ilow ← maxiord + 1, maxihi − nwin + 1, swinif (myrow,mycol) owns w then

Apply Algorithm 1 to the active window T (ilow : ihi, ilow : ihi) in order to reorder the k ≤ nev

selected eigenvalues that reside in this window to top of the window. Let the correspondingorthogonal transformation matrix be denoted by U .Broadcast U in process row myrow.Broadcast U in process column mycol.

else if (myrow,mycol) needs U for updates thenReceive U

end ifUpdate T (ilow : ihi, ihi + 1 : n) ← UT T (ilow : ihi, ihi + 1 : n) in parallel.Update T (1 : ilow − 1, ilow : ihi) ← T (1 : ilow − 1, ilow : ihi)U in parallel.Update Q(1 : n, ilow : ihi) ← Q(1 : n, ilow : ihi)U in parallel.ihi ← ilow + k − 1

end whileend foriord ← iord + nev[top-left window in W ]% Reorder selected clusters across block (process) border.Apply Algorithm 4 to W in order to reorder each computational window across the next block (process)border

end while


Algorithm 4 Parallel blocked algorithm for reordering a real Schur form (cross border part)Input and Output: See Algorithm 3. Additional input: set of computational windows

W .

for each odd window w ∈ W in parallel doForm 2× 2 process submesh G = (0, 0), (0, 1), (1, 0), (1, 1) corresponding to wiihi ← miniilo − 1, swin + nwcb/2, iilo ← maxiihi − nwcb + 1, iordif (myrow,mycol) ∈ G then

Exchange data in G to build w at (0, 0) and (1, 1)if (myrow,mycol) ∈ (0, 0), (1, 1) then

Apply Algorithm 1 to w to compute U and k.Broadcast U in process row myrow.Broadcast U in process column mycol.

end ifend ifif myrow ∈ prows(G) or mycol ∈ pcols(G) then

Exchange local parts of T (ilow : ihi, ihi + 1 : n), T (1 : ilow − 1, ilow : ihi) and Q(1 : n, ilow : ihi) forupdates with neighboring processes in parallel.Receive U .Update T (ilow : ihi, ihi + 1 : n) ← UT T (ilow : ihi, ihi + 1 : n) in parallel.Update T (1 : ilow − 1, ilow : ihi) ← T (1 : ilow − 1, ilow : ihi)U in parallel.Update Q(1 : n, ilow : ihi) ← Q(1 : n, ilow : ihi)U in parallel.

end ifend forif iilo = iord then

iord ← iord + kend iffor each even window w ∈ W in parallel do

% Similar algorithm as for the odd case above.end for

4. Performance analysis

In this section, we analyze the parallel performance of Algorithms 3–4 and derive a model ofthe parallel runtime using p processes

Tp = Ta + Tc, (7)

where Ta and Tc denote the arithmetic and communication (synchronization) runtime,respectively. We assume block cyclic data distribution of T ∈ Rn×n over a square

√p × √p

process mesh using the square block factor nb (see Section 3). We define ta, ts and tw asthe arithmetic time to perform a floating point operation (flop), the start-up time (or nodelatency) for sending a message in our parallel computer system, the per-word transfer timeor the inverse of the bandwidth as the amount of time it takes to send one data word (e.g., adouble precision number) through one link of the interconnection network. Usually, ts and tware assumed to be constants while ta is a function of the data locality. The communicationcost model for a single point-to-point communication can be approximated by tp2p = ts + twl,where l denotes the message length in words, regardless of the number of links traversed [15].However, for a one-to-all broadcast or its dual operation all-to one reduction in a certain scope(e.g., a process row or process column), we assume that such an operation is performed using


a hypercube-based algorithm like recursive doubling, i.e., in O(log2 p?) steps, where p? is thenumber of processors in the actual scope.

Algorithms 1 and 2 have an arithmetic complexity of Ts = O(kn2)ta, where 1 ≤ k < n isthe number of selected eigenvalues to reorder to the top-left corner of the matrix T . The exactcost depends on the distribution of the eigenvalues over the diagonal of the Schur form. In thefollowing, we assume 1 ¿ k < n which implies that a majority of the updates are performedusing matrix multiplication.

Given that the selected eigenvalues are uniformly distributed over the diagonal of T , thearithmetic cost of executing Algorithms 3–4 can be modelled as

Ta =(

kn2 − 3knnwin

p+

3knnwin√p

)ta, (8)

where the first term is the cost of the GEMM updates, which is divided evenly between the pinvolved processors, and the second term describes the cost for computing the local and crossborder reordering in the computational windows, where the diagonal processors are workingand the off-diagonal processors are idle waiting for the broadcasts to start.

In principle, bubble-sorting the eigenvalues in the unblocked Schur form can be seen asbubble-sorting the diagonal nb × nb blocks of the blocked Schur form in O((n/nb)2) steps. Ineach step, each local window is reordered to the top-left corner of the corresponding diagonalblock and reordered across the border. By working with

√p concurrent windows in parallel

(which is the upper limit for a square grid), the communication cost Tc can be modelled as

Tc =D2

T√p

(tlbcast + tcwin + tcbcast + trcexch), (9)

where DT = dn/nbe is the number of diagonal blocks in T , and tlbcast, tcwin, tcbcast, and trcexch

model the cost of the broadcasts local to the corresponding process row and column, thepoint-to-point communications associated with the construction of the cross border window,the cross border broadcasts in the two corresponding process rows and columns, and the dataexchange between neighboring processes in the corresponding process rows and columns forthe cross border updates, respectively.

The number of elements representing the factorized form of the orthogonal transformationcan be approximated by n2

win and the broadcasts are initiated once for each time a window ismoved to the next position in the corresponding block. Based on these observations, we modeltlbcast as

tlbcast = 2nb

nwin(ts + n2

wintw) log2

√p. (10)

By a similar reasoning and taking into account the lower degree of concurrency in the crossborder operations, see Section 3.1, we model tcwin + tcbcast by

tcwin + tcbcast = 4kcr(ts + n2crbtw)(log2

√p + 1) (11)

where kcr is the number of passes necessary for bringing the whole locally collected eigenvaluecluster across the border (kcr = 1 if ncrb = nwin).

The exchanges of row and column elements in T and column elements in Q betweenneighboring processes suffer from a lower degree of concurrency. Moreover, we cannot guarantee


that the send and receive may take place concurrently rather than in sequence. The averagelength of each matrix row or column to exchange is n/2 and about half of the cross borderwindow is located on each side of the border. To sum up, we model trcexch as

trcexch = 12kcr(ts +n · ncrb

4√

ptw). (12)

Using the derived expressions and assuming kcr = 1, i.e., ncrb = nwin, Tc can be boiled downto the approximation((

2n2

nb·nwin+ 4n2

n2b

+ 12n2

log2√

p·n2b

)ts +

(2n2·nwin

nb+ 4n2·n2

win

n2b

+ 3n3·nwin

log2√

p·n2b

)tw

)log2

√p+1√

p .

The dominating term is the last fraction of the part associated with tw and comes fromthe data exchange associated with the updates (see Figure 5); it is of order O(n3/(nb · √p))assuming nwin = O(nb) and in the general case the communication cost in the algorithm willbe dominated by the size of this contribution. The influence of this term is diminished bychoosing nb ·√p as close to n as possible and thereby reducing the term closer to O(n2), whichmay be necessary when the arithmetic cost is not dominated by GEMM updates. For example,for n = 1500, nb = 180 and p = 64, we have nb ·√p = 1440 ≈ n (see also Section 6). In generaland in practice, we will have nb ·√p = n/l, where l < n is the average number of row or columnblocks of T distributed to each processor in the cyclic distribution. The scenario to strive foris l << k, where k is the number of selected eigenvalues. Then we perform significantly morearithmetics than communication which is a rule of thumb in all types of parallel computing.If this is possible depends on the problem, i.e., the number of selected eigenvalues and theirdistribution over the diagonal of T . Our derived model is compared with real measured datain Section 6.

5. Implementation issues

In this section, we address some important implementation issues.

5.1. Very few eigenvalues per diagonal block

In the extreme case, there will be only one eigenvalue per diagonal block to reorder. In sucha situation, we would like to gather larger clusters of eigenvalues over the borders as quicklyas possible to increase the serial performance of the local reordering of the diagonal blocks.However, this can be a bad idea since the scalability and the overall efficiency in hardwareutilization will be poor when the number of concurrent windows and the concurrently workingprocessors are kept low. Our parallel multi-window method is consequently self-tuning in thesense that its serial performance will vary with the number of selected eigenvalues and, in somesense, their distribution across the main block diagonal of T .

5.2. Splitting clusters in cross border reordering

Sometimes when we cross the process borders, not all the eigenvalues of the cluster can bemoved across the border because of lack of reserved storage at the receiver, e.g., when there


are other selected eigenvalues on the other side which occupy entries in T close to the border.Then the algorithm splits the cluster in two parts, roughly half on each side of the border,to keep the cross border region as small as necessary. In such a way, more work is performedwith a higher rate of concurrency in the pre-cross border phase: the eigenvalues left behind arepushed closer to the border and reordered across the border in the next cross border sweepover the diagonal blocks.

5.3. Size of cross border window and shared 2× 2 blocks

To simplify the cross border reordering in the presence of any 2 × 2 block being shared onthe border, we keep the top-left part of the cross border window at a minimum dimension 2.This ensures that the block does not stay in the same position causing an infinite loop. For asimilar reason we keep the bottom-right part of the window at a minimum dimension 3 if anon-selected eigenvalue is between the border and a selected 2× 2 block.

In our implementation, the size of the cross border window is determined by the parameternceig which controls the number of eigenvalues that cross the border. In practice, ncrb ≈ 2nceig

except for the case when it is not possible to bring all eigenvalues in the current cluster acrossthe border for reasons discussed above.

5.4. Detection of reordering failures and special cases

Due to the distributed nature of the parallel algorithm, failures in reordering within anycomputational window and special cases, like when no selected eigenvalue was found or movedin a certain computational window, must be detected at runtime and signaled to all processorsin the affected scope (e.g., the corresponding processor row(s), processor column(s) or thewhole mesh) at specific synchronization points. In the current implementation, all processorsare synchronized in this way right before and right after each computational window is movedacross the next block border.

5.5. Organization of communications and updates

In a practical implementation of Algorithms 3–4, the broadcasts of the orthogonaltransformations, the data exchanges for cross border row and column updates and theassociated updates of the matrices T and Q should be organized to minimize any risk forsequential bottlenecks. For example, all broadcasts in the row direction should be started beforeany column oriented broadcast starts to ensure that no pair of broadcasts have intersectingscopes (see Figure 7). In practice, this approach also encourages an implementation thatperforms all row oriented operations, computations and communications, before any columnoriented operations take place. For such a variant of the algorithm, all conclusions fromSection 4 are still valid. This technique also paves the way for a greater chance of overlappingcommunications with computations, possibly leading to better parallel performance.


5.6. Condition estimation of invariant subspaces

Following the methodology of LAPACK, our implementation also computes condition numbersfor the invariant (deflating) subspaces and the selected cluster of eigenvalues using the recentlydeveloped software package SCASY [16, 17, 30] and adopting a well-known matrix normestimation technique [18, 20, 23] in combination with parallel high performance software forsolving different transpose variants of the triangular (generalized) Sylvester equations.

6. Experimental results

In this section, we demonstrate the performance of a ScaLAPACK-style parallel Fortran 77implementation of Algorithms 3–4 called PBDTRSEN. All experiments were carried out in doubleprecision real arithmetics (εmach ≈ 2.2× 10−16).

Our target parallel platform is the Linux Cluster seth which consists of 120 dual AMD AthlonMP2000+ nodes (1.667MHz, 384KB L1 cache), where most nodes have 1GB memory and asmall number of nodes have 2GB memory. The cluster is connected with a Wolfkit3 SCI highspeed interconnect having a peak bandwidth of 667 MB/sec. The network connects the nodesin a 3-dimensional torus organized as a 6× 4× 5 grid, where each link is “one-way” directed.In total, the system has a theoretical peak performance of 800 Gflops/sec. Moreover, seth is aplatform which really puts any parallel algorithm to a tough test regarding its utilization ofthe narrow memory hierarchy of the dual nodes.

All subroutines and programs were compiled using the Portland Group’s pgf90 6.0-5

compiler using the recommended compiler flags -O2 -tp athlonxp -fast and the followingsoftware libraries: ScaMPI (MPICH 1.2), LAPACK 3.0, ATLAS 3.5.9, ScaLAPACK / PBLAS

1.7.0, BLACS 1.1, RECSY 0.01alpha and SLICOT 4.0.All presented timings in this section are in seconds. Parallel speedup, Sp, is computed as

Sp = Tpmin/Tp, (13)

where Tp is the parallel execution time using p processors and pmin is the smallest number ofprocessors utilized for the current problem size.

Some initial tests were performed by reordering 50% of the eigenvalues uniformly distributedover the diagonal of 1500 × 1500 random matrices† already in Schur form and updating thecorresponding Schur vectors in Q. The purpose of these tests was to find close to optimalconfiguration parameters for the parallel algorithm executed with one computational windowon one processor. By testing all feasible combinations of nb, nwin and neig = nceig within theinteger search space 10, 20, . . . , 200 for rmmult ∈ 5, 10, . . . , 100 and nslab = 32, we foundthat nb = 180, nwin = 60 and neig = nceig = 30 and rmmult = 40 are optimal with runtime 5.66seconds. This is a bit slower than the result in [27] (4.74 seconds) but still a great improvementover the current LAPACK algorithm (DTRSEN) which takes over 20 seconds! The difference may

†The strictly upper part of T is a random matrix, but we construct T such that 50% of its eigenvalues are incomplex conjugate pairs.


0 1000 2000 3000 4000 5000 60000

500

1000

1500

2000

2500

3000

3500

4000

4500Uniprocessor performance reordering 50% of eigenvalues

n − matrix dimension

Exe

cutio

n tim

e (s

ec.)

LAPACK’s DTRSENPBDTRSEN on 1 cpu

Figure 8. Uniprocessor performance resultsfor standard LAPACK algorithm DTRSEN andparallel block algorithm PBDTRSEN reordering50% of uniformly distributed eigenvalues of

random matrices.

5 10 15 20 25 3070

75

80

85

90

95

100

105

110

115

120

nceig

Exe

cutio

n tim

e (s

ec.)

Reordering 50% of eigenvalues of the Schur form using 4x4 processors

n = 6000, n

b = 180, n

win = 60, n

eig = 30, k

win = 4

Figure 9. Parallel execution time for 6000 ×6000 matrices using 4 concurrent computa-tional windows on a 4 × 4 processor meshreordering 50% of uniformly distributed eigen-

values of random matrices.

be partly explained by the additional data copying that is performed in the parallel algorithmduring the cross border phase.

We display uniprocessor performance results for DTRSEN and PBDTRSEN in Figure 8, using theoptimal configuration parameters listed above. The speedup is remarkable for large problems;n = 5700 shows a speedup of 14.0 for the parallel block algorithm executed on one processorcompared to the standard algorithm, which is mainly caused by an improved memory referencepattern due to the rich usage of high performance level 3 BLAS.

Next, using the close to optimal parameter settings, we made experiments on a 4×4 processormesh using 4 computational windows on 6000×6000 matrices of the same form as above (whichputs the same memory load per processor as one 1500 × 1500 matrix does on one processor)to find a close to optimal value of nceig in the integer search space 5, 10, . . . , 30. In this case,we ended up with nopt

ceig = 30 with a corresponding execution time of 71.53 seconds, see Figure9. It turned out that the execution time increased with a decreasing nceig. For instance, thesame problem took 119.77 seconds using nceig = 5. With our analysis in mind, this is notsurprising since using neig = nceig brings each subcluster of eigenvalues across the processborder in one single step, thereby minimizing the number of messages sent across the borderfor each subcluster, i.e., the influence on the parallel runtime caused by the node latency isminimized.

In Table I, we present representative performance results for PBDTRSEN for different variationsof the number of selected eigenvalues, their distribution over the diagonal of the Schurform and the number of utilized processors. All presented results are obtained executing theparallel reordering algorithm using the close to optimal parameters settings using as manycomputational windows as possible (kwin =

√p). For PBDTRSEN, the parallel speedup goes up

to 16.6 for n = 6000 using a 8 × 8 processor mesh when selecting 5% of the eigenvalues from


Table I. Performance of PBDTRSEN on seth.

Random Bottom Random BottomSel. Pr × Pc n time Sp time Sp n time Sp time Sp

5% 1× 1 1500 1.98 1.00 2.16 1.00 4500 42.7 1.00 42.4 1.002× 2 1500 0.90 2.20 1.02 2.12 4500 15.9 2.69 17.6 2.413× 3 1500 1.01 1.96 1.04 2.08 4500 10.8 3.95 12.4 3.424× 4 1500 0.80 2.48 0.92 2.35 4500 6.65 6.42 7.74 5.485× 5 1500 0.34 5.82 0.76 2.84 4500 4.70 9.09 6.25 6.786× 6 1500 0.32 6.19 0.35 6.17 4500 4.26 10.0 4.51 9.407× 7 1500 0.32 6.19 0.32 6.75 4500 3.48 12.3 3.50 12.18× 8 1500 0.24 8.25 0.27 8.00 4500 3.13 13.6 3.50 12.1

35% 1× 1 1500 5.78 1.00 9.87 1.00 4500 127 1.00 217 1.002× 2 1500 2.96 1.95 5.58 1.77 4500 55.3 2.30 96.7 2.243× 3 1500 2.05 2.82 4.03 2.45 4500 40.4 3.14 67.9 3.204× 4 1500 1.74 3.32 2.72 3.63 4500 25.5 4.98 43.2 5.025× 5 1500 1.51 3.83 2.28 4.33 4500 22.4 5.67 39.0 5.566× 6 1500 1.17 4.94 1.82 5.42 4500 16.7 7.60 26.4 8.227× 7 1500 1.13 5.12 1.49 6.62 4500 15.8 8.04 21.6 10.08× 8 1500 1.02 5.67 1.12 8.81 4500 11.8 10.8 18.7 11.6

50% 1× 1 1500 5.66 1.00 12.6 1.00 4500 140 1.00 263 1.002× 2 1500 3.50 1.62 6.83 1.84 4500 62.7 2.23 121 2.173× 3 1500 1.89 2.99 4.97 2.53 4500 36.1 3.88 84.3 3.124× 4 1500 1.87 3.03 4.48 2.81 4500 29.2 4.79 61.5 4.285× 5 1500 1.54 3.68 3.20 3.93 4500 21.6 6.48 49.6 5.306× 6 1500 1.22 4.64 2.03 6.21 4500 18.8 7.45 34.2 7.697× 7 1500 1.30 4.35 1.90 6.63 4500 16.5 8.48 30.3 8.688× 8 1500 1.11 5.10 1.76 7.16 4500 12.2 11.5 24.2 10.9

5% 1× 1 3000 12.8 1.00 19.8 1.00 6000 97.6 1.00 109 1.002× 2 3000 4.93 2.60 6.63 2.99 6000 34.3 2.85 39.4 2.773× 3 3000 3.69 3.47 4.00 4.95 6000 23.8 4.10 27.4 3.984× 4 3000 3.01 4.26 2.90 6.83 6000 14.0 6.97 16.3 6.695× 5 3000 2.97 4.31 2.54 7.80 6000 11.2 8.71 14.4 7.576× 6 3000 1.72 7.44 2.12 9.34 6000 8.60 11.3 8.94 12.27× 7 3000 1.82 7.03 2.33 8.50 6000 6.20 15.8 7.52 14.58× 8 3000 1.45 8.83 1.59 12.5 6000 6.30 15.5 6.58 16.6

35% 1× 1 3000 40.1 1.00 72.4 1.00 6000 304 1.00 509 1.002× 2 3000 19.0 2.11 38.0 1.91 6000 123 2.47 220 2.313× 3 3000 13.8 2.91 23.2 3.12 6000 89.9 3.38 152 3.354× 4 3000 8.83 4.54 14.2 5.10 6000 58.8 5.17 98.4 5.175× 5 3000 7.81 5.13 14.8 4.89 6000 51.6 5.89 81.8 6.226× 6 3000 6.90 5.81 9.26 7.82 6000 37.4 8.13 57.2 8.907× 7 3000 5.81 6.90 9.07 7.98 6000 31.7 9.59 49.0 10.48× 8 3000 5.30 7.57 7.17 10.1 6000 25.2 12.1 38.7 13.2

50% 1× 1 3000 50.2 1.00 91.4 1.00 6000 324 1.00 623 1.002× 2 3000 24.2 2.07 44.5 2.05 6000 133 2.44 275 2.273× 3 3000 18.9 2.66 28.1 3.25 6000 94.5 3.43 197 3.164× 4 3000 11.6 4.33 18.2 5.02 6000 71.5 4.53 135 4.615× 5 3000 9.57 5.25 17.2 5.31 6000 50.2 6.45 111 5.616× 6 3000 7.17 7.00 12.6 7.25 6000 36.4 8.90 71.5 8.717× 7 3000 6.89 7.29 11.4 8.02 6000 33.4 9.70 61.8 10.18× 8 3000 5.27 9.53 9.81 9.32 6000 26.4 12.3 50.1 12.4


the lower part of T ; in Table I, Random and Bottom refer to the parts of T where the selectedeigenvalues reside before the reordering starts.

Some remarks regarding the results in Table I are in order.

• In general, the parallel algorithm scales better (delivers a higher parallel speedup) for asmaller value of k, the number of chosen eigenvalues for a given problem size. The mainreason is that less computational work per processor in general leads to less efficient usageof processor resources, i.e., a lower Mflops-rate, which in turn makes the communicationoverhead smaller in relation to arithmetics. We also see improved processor utilizationwhen increasing the number of selected eigenvalues to reorder from 5% to 50%. Theamount of work is increased by a factor 10 but the uniprocessor execution times increaseroughly by a factor of 6 (n = 3000).

• Since only the processor rows and columns corresponding to the selected groups of fouradjacent processors can be efficiently utilized during the cross border phase (see theformation of the processor groups in Algorithm 4), the performance gain going froma 2 × 2 to a 3 × 3 processor mesh is sometimes bad (even negative, see the problemn = 1500 and 5% selected eigenvalues). In principle, some performance degradation willoccur when going from 2q× 2q to (2q +1)× (2q +1) processors, for q ≥ 1, since the levelof concurrency in the cross border phase will not increase. This effect is not that visiblefor larger meshes since the relative amount of possibly idle processors in the cross borderpart decreases with an increasing processor mesh.

To confirm the validity of the derived performance model (see Section 4), we compare theexperimental parallel speedup with parallel speedup computed from combining Equations (7),(8) and (9). The results are displayed in Figure 10. The machine specific constants ta, ts andtw are estimated as follows:

• Since the exact number of flops needed for computing a reordered Schur form is not knowna priori, we approximate ta for p = 1 by t

(1)a (n, k) which is computed from Equation (8)

by replacing Ta by T1, the serial runtime of the parallel algorithm for one processor only.For p > 1, we model ta by

t(p)a (n, k) = α0 + α1n + α2k + α3

√p, (14)

where αi ∈ R, i = 0, . . . , 3. The model captures that the processor speed is expected tobe proportional to the matrix data load at each processor. Since the amount of matrixdata per processor and the number of selected eigenvalues per on-diagonal processor arekept fixed going from a problem instance (n, k, p) to (2n, 2k, 4p), we assume

t(p)a (n, k) = t(4p)

a (2n, 2k). (15)

From this assumption and the available data for t(1)a derived from Table I, we compute

the αi-parameters from a least squares fit, see Table II. With these parameters and fixedvalues on p and/or n, the model predicts that t

(p)a (n, k) decreases for an increasing value

of k. Moreover, for a fixed value on k, t(p)a (n, k) increases with p and/or n. For such cases,

the decrease of the modelled processor performance is marginal, except for large values


Table II. Experimentally determined machine parameters for seth.

Parameter Valueα0 6.73× 10−9

α1 6.34× 10−13

α2 −3.10× 10−12

α3 3.08× 10−10

ts 3.7× 10−6

tw ω · 1.1× 10−8

0 10 20 30 40 50 60 700

2

4

6

8

10

12

14

16

p

Sp

Comparison of experimental and modelled parallel speedup

Exp n=6000, k=5%Exp n=6000, k=35%Exp n=6000, k=50%Mod n=6000, k=5%, omega=1Mod n=6000, k=35%, omega=4Mod n=6000, k=50%, omega=4

Figure 10. Comparison of modelled and experimental parallel speedup for problems with Randomdistribution of the selected eigenvalues.

on p and/or n. With an increasing size of the processor mesh√

p×√p and fixed valueson n and k, it is expected that individual processors perform less efficient due to lesslocal arithmetic work.

• The interconnection network parameters ts and tw are estimated by performingMPI-based ping-pong communication in the network, which also includes solving anoverdetermined linear system based on experimental data, see Table II. In Figure 10, weallow tw to vary by ω > 1 which models unavoidable network sharing, overhead fromthe BLACS layer and the potentially (much) lower memory bandwidth inside the dualnodes. For ω = 1, tw represents the inverse of the practical peak bandwidth for doubleprecision data in the network of seth.


The comparison presented in Figure 10 shows that the performance model is not perfect, butis able to predict a parallel speedup in the same range as the actually observed results.

We close this section by remarking that the accuracy of the parallel reordering algorithm issimilar to the results presented in [2, 26, 27] since it is essentially numerically equivalent tothose algorithms. Some additional accuracy results are also presented below in Section 7.

7. Application example: parallel computation of stable invariant subspaces ofHamiltonian matrices

As application example, we consider parallel computation of c-stable invariant subspacescorresponding to the eigenvalues λ : Re(λ) < 0 and condition estimation of the selectedcluster of eigenvalues and invariant subspaces of the Hamiltonian matrix

H =[

A bbT

ccT −AT

],

where A ∈ Rn2×

n2 is a random diagonal matrix and b, c ∈ Rn

2×1 are random vectors withreal entries of uniform or normal distribution. For Hamiltonian matrices, m = n/2 − kimag

of the eigenvalues are c-stable, where kimag is the number of eigenvalues that lie strictlyon the imaginary axis. Solving this Hamiltonian eigenvalue problem for the stable invariantsubspaces can be very hard since the eigenvalues tend to cluster closely around the imaginaryaxis, especially when n gets large [26], leading to a very ill-conditioned separation problem.

The stable subspace computation includes two major steps and one optional third step:

1. Compute the real Schur form T = QT HQ of H.2. Reorder the m stable eigenvalues to the top left corner of T , i.e., compute the ordered

Schur decomposition T = QT HQ, such that the m first columns of Q span the stableinvariant subspace of H; the m computed stable eigenvalues may be distributed over thediagonal of T in any order before reordering starts, see below.

3. Compute condition number estimates for (a) the selected cluster of eigenvalues and (b)the invariant subspaces, see below.

For the first step, we utilize the parallel Hessenberg reduction routine PDGEHRD and a slightlymodified‡ version of the existing parallel QR algorithm PDLAHQR from ScaLAPACK [6]. For thesecond step, we use the parallel algorithm described in this paper for reordering the eigenvalues.For the last step we utilize the condition estimation functionality and the corresponding parallelSylvester-type matrix equation solvers provided by SCASY [16, 17, 30].

We present experimental results in Tables III and IV, where the following performance,output and accuracy quantities are presented:

‡PDLAHQR often delivers a Schur form T with some 2 × 2 blocks corresponding to two real eigenvalues. Ourmodification consists of a postprocessing step where all such ’false’ 2 × 2 blocks are removed via a novelimplementation of PDROT which applies a Givens rotation to a distributed matrix.


• In Table III, we present the parallel runtime measures t1, t2, t3a, t3b, ttot and Sp

corresponding to the individual steps in the stable subspace computation describedabove, the total execution time and the corresponding parallel speedup.

• In Table IV, we present the outputs m, s and sep corresponding to the dimension ofthe invariant subspace and the reciprocal condition numbers of the selected cluster ofeigenvalues and the stable invariant subspace, respectively.

• In Table IV, we also present the accuracy measures Ro1, Rr1, Ro2, Rr2 and Reig

corresponding to the orthogonality check

Ro? = max(‖QT? Q? − I‖F , ‖Q?Q

T? − I‖F )/(εmach · n),

the relative residual

Rr? = max(‖T? −QT? AQ?‖F , ‖A−Q?T?Q

T? ‖F )/‖A‖F ,

and the relative change in the eigenvalues

Reig = maxi=1,n

|λk − λk||λk| .

Here, ? = 1 corresponds to the unordered real Schur form and ? = 2 represents the orderedSchur form corresponding to the stable invariant subspace.

The condition numbers s and sep are computed as follows:

• Solve the Sylvester equation

T11X −XT22 = −γT12

and check norm of X in parallel and set s = 1/√

(1 + ‖X‖F ) [4].• Compute estimate of sep−1(T11, T22) in parallel using a well-known matrix norm

estimation technique [18, 20, 23] and compute sep = 1/sep−1 taking care of any riskfor numerical overflow.

s signals ill-conditioning if the norm of the spectral projector X on the invariant subspaceassociated with T11 becomes large, i.e., s becomes small. A large estimate sep−1 signals thatthe invariant subspace associated with T11 is ill-conditioned since the separation between theeigenvalues in T11 and T22 (and the exact value of sep(T11, T22)) is small, provided that theestimator computes a reliable estimate. Hence, a small value of sep signals an ill-conditionedinvariant subspace (see, e.g., [33] and the references therein).

The results in Tables III and IV illustrate the potential of the parallel reordering methodin real applications by delivering good serial and parallel performance. Also, the eigenvaluesare reordered with perfect accuracy, even though the corresponding invariant subspace may bevery ill-conditioned. This ill-conditioning is likely an effect of the clustering of the eigenvaluesvery close around the imaginary axis, which also, because of round-off errors, affects the abilityof the QR algorithm to deliver a correct number of stable eigenvalues, which determines thedimension (m) of the invariant subspace computed by the reordering routine.

We end this section by remarking that the condition number computations also scale welleven though they are applied to relatively small problem sizes (n/2 = 1500, 3000) in this case.


Table III. Experimental parallel performance results from computing stable invariant subspaces ofHamiltonian matrices on seth.

Timingsn Pr × Pc t1 t2 t3a t3b ttot Sp

3000 1× 1 2211 165 13.8 107 2497 1.003000 2× 2 849 95.8 6.79 70.2 1022 2.443000 4× 4 256 30.5 3.16 18.3 308 8.113000 8× 8 158 22.6 2.53 14.8 198 12.66000 2× 2 7057 777 47.4 568 8449 1.006000 4× 4 2327 360 19.7 267 2974 2.846000 8× 8 693 135 8.16 99.2 935 9.04

Table IV. Experimental output and accuracy results from computing stable invariant subspaces ofHamiltonian matrices on seth.

Output Accuracyn Pr × Pc m s sep Ro1 Rr1 Ro2 Rr2 Reig

3000 1× 1 1500 0.19E-03 0.14E-04 0.75E+00 0.98E-14 0.88E+00 0.12E-13 0.0E+003000 2× 2 1503 0.27E-04 0.96E-06 0.75E+00 0.11E-13 0.86E+00 0.11E-13 0.0E+003000 4× 4 1477 0.30E-04 0.40E-06 0.71E+00 0.89E-14 0.83E+00 0.11E-13 0.0E+003000 8× 8 1481 0.48E-03 0.57E-06 0.73E+00 0.83E-14 0.86E+00 0.11E-13 0.0E+006000 2× 2 2988 0.52E-03 0.32E-13 0.74E+00 0.13E-13 0.88E+00 0.16E-13 0.0E+006000 4× 4 3015 0.57E-04 0.11E-17 0.72E+00 0.12E-13 0.89E+00 0.15E-13 0.0E+006000 8× 8 2996 0.32E-04 0.25E-12 0.73E+00 0.63E-13 0.88E+00 0.64E-13 0.0E+00

8. Extension to the generalized real Schur form

We have extended the presented parallel algorithm for reordering the standard Schur form toeigenvalue reordering in the generalized Schur form

(S, T ) = QT (A,B)Z, (16)

where (A,B) is a regular matrix pair, Q and Z are orthogonal matrices and (S, T ) is thegeneralized real Schur form (see, e.g., [14]). Besides the fact that the parallel block algorithmnow works on pairs of matrices, the generalized case does not differ in any substantial way fromthe standard case, except that the individual orthogonal transformation matrices from eacheigenvalue swap are stored slightly different (see [27] for details). For our prototype Fortran 77implementation PBDTGSEN the following close to optimal parameters was found by extensivetests: nb = 180, nwin = 60 and neig = nceig = 30 and rmmult = 10 which resulted in auniprocessor runtime 18.94 seconds, which is less than 1 second slower that the results in [27],but much faster than LAPACK.

As for the standard case, PBDTGSEN optionally computes condition numbers for the selectedcluster of eigenvalues and the corresponding deflating subspaces (see, e.g., [24]) by invoking


Table V. Performance of PBDTGSEN on seth.

Random Bottom Random BottomSel. Pr × Pc n time Sp time Sp n time Sp time Sp

5% 1× 1 1500 5.12 1.00 6.56 1.00 4500 98.4 1.00 105 1.002× 2 1500 2.65 1.93 2.91 2.25 4500 39.9 2.47 48.0 2.193× 3 1500 2.66 1.92 2.74 2.39 4500 23.4 4.21 32.4 3.244× 4 1500 2.20 2.32 2.50 2.62 4500 17.3 5.69 18.6 5.655× 5 1500 1.48 3.46 1.97 3.33 4500 15.2 6.47 17.0 6.186× 6 1500 1.05 4.88 1.69 3.88 4500 10.4 9.46 11.8 8.907× 7 1500 0.77 6.65 1.25 5.25 4500 8.93 11.0 9.22 11.48× 8 1500 0.60 8.53 0.67 9.79 4500 7.33 13.4 8.15 12.9

35% 1× 1 1500 13.6 1.00 22.0 1.00 4500 279 1.00 312 1.002× 2 1500 8.02 1.70 11.4 1.93 4500 133 2.10 218 1.433× 3 1500 7.92 1.72 7.63 2.88 4500 77.2 3.61 122 2.564× 4 1500 3.52 3.86 5.41 4.07 4500 50.4 5.54 73.7 4.235× 5 1500 3.21 4.24 4.50 4.89 4500 40.4 6.91 59.5 5.246× 6 1500 3.71 3.67 4.85 4.54 4500 32.2 8.66 43.7 7.147× 7 1500 2.21 6.16 3.33 6.61 4500 26.1 10.7 40.3 7.748× 8 1500 2.15 6.33 2.84 7.75 4500 20.8 13.4 33.3 9.37

50% 1× 1 1500 18.9 1.00 26.9 1.00 4500 301 1.00 584 1.002× 2 1500 13.8 1.37 14.5 1.89 4500 146 2.06 288 2.033× 3 1500 11.1 1.70 10.1 2.66 4500 82.9 3.63 169 3.464× 4 1500 4.88 3.87 7.94 3.39 4500 54.9 5.48 104 5.625× 5 1500 4.01 4.71 6.57 4.10 4500 41.7 7.22 91.9 6.356× 6 1500 3.18 5.94 4.19 6.42 4500 32.8 9.18 64.9 9.007× 7 1500 3.35 5.64 4.12 6.53 4500 32.1 9.37 62.9 9.288× 8 1500 3.01 6.23 3.96 6.79 4500 25.4 11.9 45.1 12.95

5% 1× 1 3000 34.3 1.00 35.9 1.00 6000 - - - -2× 2 3000 13.5 2.54 15.3 2.35 6000 92.4 1.00 110 1.003× 3 3000 8.55 4.02 8.90 4.03 6000 51.1 1.81 58.3 1.894× 4 3000 6.13 5.60 5.79 6.20 6000 34.9 2.65 39.2 2.815× 5 3000 6.20 5.53 4.32 8.31 6000 29.2 3.16 33.0 3.336× 6 3000 4.59 7.47 3.50 10.3 6000 18.8 4.91 20.6 5.347× 7 3000 4.14 8.29 3.40 10.6 6000 16.0 5.78 19.6 5.618× 8 3000 2.69 12.8 2.69 13.3 6000 13.3 6.95 14.5 7.59

35% 1× 1 3000 93.4 1.00 155 1.00 6000 - - - -2× 2 3000 44.6 2.09 74.4 2.08 6000 326 1.00 523 1.003× 3 3000 27.1 3.45 42.0 3.69 6000 180 1.81 280 1.874× 4 3000 19.2 4.86 25.4 6.10 6000 104 3.13 163 3.215× 5 3000 16.7 5.59 22.3 6.95 6000 88.1 3.70 135 3.876× 6 3000 12.4 7.53 18.1 8.56 6000 64.5 5.05 85.4 6.127× 7 3000 10.9 8.57 15.2 10.2 6000 50.7 6.43 80.9 6.468× 8 3000 9.80 9.53 11.8 13.1 6000 43.3 7.53 61.2 8.54

50% 1× 1 3000 105 1.00 194 1.00 6000 - - - -2× 2 3000 53.9 1.95 98.6 1.97 6000 335 1.00 695 1.003× 3 3000 31.9 3.29 64.5 3.01 6000 187 1.79 391 1.784× 4 3000 21.5 4.88 35.8 5.42 6000 112 2.99 234 2.975× 5 3000 18.5 5.68 31.6 6.14 6000 87.5 3.83 181 3.836× 6 3000 14.1 7.45 24.3 7.98 6000 66.6 5.03 127 5.477× 7 3000 12.6 8.33 22.6 8.58 6000 62.0 5.40 114 6.108× 8 3000 9.81 10.7 20.3 9.56 6000 45.5 7.36 88.0 7.90


the generalized coupled Sylvester equation solvers and condition estimators from the SCASYsoftware [16, 17, 30].

Finally, we repeated the experiments from the standard case with similar experimentalresults, see Table V. For n = 6000, the memory on one 2GB node of seth available for users isnot large enough to hold all data objects (signaled by ’-’). In this case, Sp is computed usingpmin = 4 in Equation (13), i.e., the presented values exemplify the speedup going from p0 = 4to p > p0 processor (see also n = 6000 in Table III). By assuming a fictive parallel speedupof at least 2 going from 1 × 1 to 2 × 2 processors for n = 6000 (see the results in Table I),we conclude that the scalability for the generalized case is as good as or even better than thestandard case.

9. Summary and future work

The lack of a fast and reliable parallel reordering routine has turned the attention away fromSchur-based subspace methods for solving a wide range of numerical problems in distributedmemory (DM) environments. By the introduction of the algorithm presented in this paper,this situation might be subject to change.

We remark that ScaLAPACK still lacks a highly efficient parallel implementation of theparallel QZ algorithm and the existing parallel QR algorithm has far from level 3 nodeperformance. Moreover, the developed techniques here can be used to to implement parallelversions of the advanced deflations techniques described in [9, 22].

ACKNOWLEDGEMENTS

The research was conducted using the resources of the High Performance Computing Center North(HPC2N).

REFERENCES

1. E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. W. Demmel, J. J. Dongarra, J. Du Croz, A. Greenbaum,S. Hammarling, A. McKenney, and D. C. Sorensen. LAPACK Users’ Guide. SIAM, Philadelphia, PA,third edition, 1999.

2. Z. Bai and J. W. Demmel. On swapping diagonal blocks in real Schur form. Linear Algebra Appl.,186:73–95, 1993.

3. Z. Bai, J. W. Demmel, J. J. Dongarra, A. Ruhe, and H. van der Vorst, editors. Templates for the Solutionof Algebraic Eigenvalue Problems. Software, Environments, and Tools. SIAM, Philadelphia, PA, 2000.

4. Z. Bai, J. W. Demmel, and A. McKenney. On computing condition numbers for the nonsymmetriceigenproblem. ACM Trans. Math. Software, 19(2):202–223, 1993.

5. M. W. Berry, J. J. Dongarra, and Y. Kim. A parallel algorithm for the reduction of a nonsymmetricmatrix to block upper-Hessenberg form. Parallel Comput., 21(8):1189–1211, 1995.

6. L. S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. W. Demmel, I. Dhillon, J. J. Dongarra, S. Hammarling,G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. ScaLAPACK Users’ Guide. SIAM,Philadelphia, PA, 1997.

7. BLAS - Basic Linear Algebra Subprograms. See http://www.netlib.org/blas/index.html.8. K. Braman, R. Byers, and R. Mathias. The multishift QR algorithm, I: Maintaining well-focused shifts

and level 3 performance. SIAM J. Matrix Anal. Appl., 23(4):929–947, 2002.


9. K. Braman, R. Byers, and R. Mathias. The multishift QR algorithm, II: Aggressive early deflation. SIAMJ. Matrix Anal. Appl., 23(4):948–973, 2002.

10. J. Choi, J. J. Dongarra, and D. W. Walker. The design of a parallel dense linear algebra software library:reduction to Hessenberg, tridiagonal, and bidiagonal form. Numer. Algorithms, 10(3-4):379–399, 1995.

11. K. Dackland and B. Kagstrom. Blocked algorithms and software for reduction of a regular matrix pair togeneralized Schur form. ACM Trans. Math. Software, 25(4):425–454, 1999.

12. J. J. Dongarra, J. Du Croz, I. S. Duff, and S. Hammarling. A set of level 3 basic linear algebra subprograms.ACM Trans. Math. Software, 16:1–17, 1990.

13. J. J. Dongarra, S. Hammarling, and J. H. Wilkinson. Numerical considerations in computing invariantsubspaces. SIAM J. Matrix Anal. Appl., 13(1):145–161, 1992.

14. G. H. Golub and C. F. Van Loan. Matrix Computations. Johns Hopkins University Press, Baltimore,MD, third edition, 1996.

15. A. Grama, A. Gupta, G. Karypsis, and V. Kumar. Introduction to Parallel Computing, Second Edition.Addison-Wesley, 2003.

16. R. Granat and B. Kagstrom. Parallel Solvers for Sylvester-type Matrix Equations with Applications inCondition Estimation, Part I: Theory and Algorithms. Technical Report UMINF-07.15, Department ofComputing Science, Umea University, SE-90187, UMEA, Sweden. Submitted to ACM Transactions onMathematical Software, 2007.

17. R. Granat and B. Kagstrom. Parallel Solvers for Sylvester-type Matrix Equations with Applications inCondition Estimation, Part II: the SCASY Software Library. Technical Report UMINF-07.16, Departmentof Computing Science, Umea University, SE-90187, UMEA, Sweden. Submitted to ACM Transactions onMathematical Software, 2007.

18. W.W. Hager. Condition estimates. SIAM J. Sci. Statist. Comput., (3):311–316, 1984.19. G. Henry, D. S. Watkins, and J. J. Dongarra. A parallel implementation of the nonsymmetric QR algorithm

for distributed memory architectures. SIAM J. Sci. Comput., 24(1):284–311, 2002.20. N. J. Higham. Fortran codes for estimating the one-norm of a real or complex matrix, with applications

to condition estimation. ACM Trans. of Math. Software, 14(4):381–396, 1988.21. B. Kagstrom. A direct method for reordering eigenvalues in the generalized real Schur form of a regular

matrix pair (A, B). In Linear algebra for large scale and real-time applications (Leuven, 1992), volume232 of NATO Adv. Sci. Inst. Ser. E Appl. Sci., pages 195–218. Kluwer Acad. Publ., Dordrecht, 1993.

22. B. Kagstrom and D. Kressner. Multishift Variants of the QZ Algorithm with Aggressive Early Deflation.SIAM Journal on Matrix Analysis and Applications 29, 1, 199–227, 2006.

23. B. Kagstrom and P. Poromaa. Distributed and shared memory block algorithms for the triangular Sylvesterequation with sep−1 estimators. SIAM J. Matrix Anal. Appl., 13(1):90–101, 1992.

24. B. Kagstrom and P. Poromaa. LAPACK-style algorithms and software for solving the generalized Sylvesterequation and estimating the separation between regular matrix pairs. ACM Trans. Math. Software 22, 1,78–103, 1996.

25. B. Kagstrom and P. Poromaa. Computing eigenspaces with specified eigenvalues of a regular matrix pair(A, B) and condition estimation: theory, algorithms and software. Numer. Algorithms, 12(3-4):369–407,1996.

26. D. Kressner. Numerical Methods and Software for General and Structured Eigenvalue Problems. PhDthesis, TU Berlin, Institut fur Mathematik, Berlin, Germany, 2004.

27. Daniel Kressner. Block algorithms for reordering standard and generalized Schur forms. ACMTransactions on Mathematical Software, 32(4):521–532, December 2006.

28. LAPACK - Linear Algebra Package. See http://www.netlib.org/lapack/.29. R. Lehoucq and J. Scott. An evaluation of software for computing eigenvalues of sparse nonsymmetric

matrices. Tech. Report MCS-P547-1195, Argonne National Laboratory, 1996.30. SCASY - ScaLAPACK-style solvers for Sylvester-type matrix equations. See http://www.cs.umu.se/

~granat/scasy.html.31. V. Sima. Algorithms for Linear-Quadratic Optimization, volume 200 of Pure and Applied Mathematics.

Marcel Dekker, Inc., New York, NY, 1996.32. ScaLAPACK Users’ Guide. See http://www.netlib.org/scalapack/slug/.33. Stewart, G. W. and Sun, J.-G. Matrix Perturbation Theory. Academic Press, New York, 1990.34. P. Van Dooren. A generalized eigenvalue approach for solving Riccati equations. SIAM J. Sci. Statist.

Comput., 2(2):121–135, June, 1981.

Documents

Algorithms and Library Software for Periodic and Parallel ... · 1.1 Motivation for this work 1 1.2 Parallel computations, computers and programming models 2 1.3 Matrix computations