Daniel Kressner Chair of Numerical Algorithms and HPC ...the format IMost frequently used in applications featuring dense matrices: integral operators with nonlocal kernel. IHSS matrices/H2

Numerical Linear AlgebraA biased survey

Daniel Kressner

Chair of Numerical Algorithms and HPCMATHICSE / SMA / SB / EPF Lausanne

[email protected]://anchp.epfl.ch

Banff, 7.10.2014

http://anchp.epfl.ch

Purpose of this talk

I give a flavor of the general field of numerical linear algebra 1

I discuss the role of sparsity1

I discuss role of optimization in numerical linear algebra1

I point out some exciting new developments1

1according to the strongly biased/limited view of the speaker

Outline

I Numerical Linear AlgebraI general principlesI the beauty of black boxesI the beauty of error analysis

I Numerical Linear Algebra and Sparsity and OptimizationI sparse and data-sparse matricesI sparse and data-sparse vectors

General principles

Numerical linear algebra

Typical tasks for a matrix A:

I Linear systemsAx = b

I Eigenvalue problemsAx = λx

I Matrix functions

exp(A), log(A),√

A, sign(A), . . .

Recurring principle: Exploitation of structure.

Well-established: Structure in A (sparsity, symmetry, low-rank, . . .)

Current trends: Structure in x (sparsity, low-rank, incorporation ofinformation from underlying application . . .)

Numerical linear algebra in scientific computing...

Application

Mathematical Model

Discretization /Linearization

Linear Algebraproblem

Picture taken from http://en.wikipedia.org/wiki/Food_chain

http://en.wikipedia.org/wiki/Food_chain

...a fundamental component

Numerical linear algebraI often dominates computational time in scientific computing imposes limitations on model complexity, discretizationaccuracy, . . .

I transfers knowledge across different disciplinesI deeply dives into algorithmic design and analysisI provides black-box solversI provides software libraries (LAPACK) at the heart of virtually

every scientific computing package (Maple, Mathematica,NumPy, MATLAB, R, Trilinos)

I is at the forefront of high-performance computing (LINPACKbenchmark)

Software development – a recent example

2005

2009

2012

before 2005. Eigenvalue solver in ScaLAPACK 1.x knownto be slow and buggy.

2005–2009. Algorithmic developments (pipelined multi-shifts, parallel aggressive early deflation) and researchcode. Jointly with R. Granat and B. Kågström.Main publication: A novel parallel QR algorithm for hybriddistributed memory HPC systems. SIAM J. Sci. Comput.,32(4):2345–2378, 2010.

2009–2011. Further algorithmic improvements, debug-ging, bug fixing, documentation, testing/benchmarking,incorporation of feedback from Intel and others. Effortsjoined by PhD student M. Shao.

November 2011. Release of production code as routinePxHSEQR in ScaLAPACK 2.0.

2012 –∞. Code maintenance.

201? ACM TOMS software publication.

Software development – a recent example

Old eigenvalue solver in ScaLAPACK 1.x vs.New eigenvalue solver (research code) vs.New eigenvalue solver in ScaLAPACK 2.0 (production code)

ScaLAPACK 1.x 2009 ScaLAPACK 2.00

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2x 10

4

16755 sec

575 sec 320 sec

2009 ScaLAPACK 2.00

0.5

1

1.5

2

2.5

3x 10

4

7 hours

2 hours

16 000× 16 000 matrix 100 000× 100 000 matrixon 100 cores1 on 1 024 cores

1Intel Xeon quadcore L5420 2.5 GHz nodes

The beauty of black boxesNo need to care about what is inside.

A black box

eig Eigenvalues and eigenvectors.E = eig(A) produces a column vector E containingthe eigenvalues of a square matrix A....

Underlying variant of QR algorithm:I Braman/Byers/Mathias: The multishift QR algorithm. Part I: Maintaining well-focused shifts

and level 3 performance. SIAM J. Matrix Anal., 2002.I Braman/Byers/Mathias: The multishift QR algorithm. Part II: Aggressive early deflation.2

SIAM J. Matrix Anal., 2002.

2Awarded SIAG/Linear Algebra + SIAM outstanding paper prizes.

Not a black box

eig Eigenvalues and eigenvectors.E = eig(A,abstol,reltol,maxit,ns,aed,adhoc,...) producesa column vector E containing the eigenvalues of a squarematrix A, where

- abstol is the tolerance for declaring eigenvalues converged

- reltol is a relative convergence criterion, which may ormay not improve accuracy. Choose at your own risk.Wilkinson [quote from Stewart’2002]: "... if you want to argueabout it, I would rather be on your side."

- ns is the number of shifts used in every iterationChoose carefully! The correct choice depends on your machineas well as on your application!

- aedsize is the size of the aggressive early deflation windowChoose carefully! The correct choice depends on your machineas well as on your application!

- adhoc is a magic number used when things start falling apart

- ...

The beauty of error analysis

Arnoldi/Lanczos/CG in a nutshell

I Assume A allows for cheap matrix-vector multiplications .I For starting vector x0, consider Krylov subspace

Kk (A, x0) := span{x0,Ax0,A2x0, . . . ,Ak−1x0}.

I Arnoldi method = Gram-Schmidt + twist applied to Kk (A, x0).I Produces orthonormal basis Uk ∈ Rn×k of Kk (A, x0).I Arnoldi decomposition:

AUk = Uk Hk + rank 1,

where Hk is k × k Hessenberg.Symmetric A I Hk tridiagonalI three-term recurrence Lanczos methodI no need to store full basis UkI CG/Lanczos = Galerkin with Kk (A, x0)

Numerical example

Lanczos applied to symmetric matrix with isolated eigenvalue:

0 20 40 60 80 10010

−20

10−10

100

1010

‖UTk Uk − I‖2

0 20 40 60 80 10010

−20

10−10

100

1010

Convergence of 3 largest Ritzvalues

Total loss of orthogonality of Uk in finite-precision arithmetic Total failure of CG/Lanczos?

Error analysis

Paige’1976/1980: Lanczos performed in finite-precision arithmeticyields perturbed Arnoldi decomposition

AUk = Uk Hk + rank 1 +4Uk , ‖4Uk‖ = O(machine precision)

I Starting point for deep understanding of behavior of Lanczos infinite precision arithmetic.

I Subsequent results by Greenbaum, Meurant, Strakoš, Wülling,Zemke, and others.

I Restores reputation of several properties that hold in exactarithmetic, but there are subtle differences!

I Survey [Meurant/Strakoš’2006] and book [Meurant’2006].

Data-sparse matrices

A tridiagonal matrix

A =

0 20 40

0

10

20

30

40

50

nz = 148

Exact matrix sparsity is a shy deer.

Inverse of tridiagonal matrix

A−1 =

0 20 40

0

10

20

30

40

50

nz = 2500

I Explicit inverse needed, e.g., in sparse covariance matrixestimation.

I More common situation: Inverse implicitly present in Schurcomplements, e.g., direct sparse factorizations / signalprocessing on graphs.

Inverse of symmetric tridiagonal matrix, cond(A) ≈ 3

A−1 =

0 10 20 30 40 50

0

10

20

30

40

50

|white entries| ≤ 10−15

I Classical result [Demko/Moss/Smith’1984] for banded matrices:Exponential decay of [A−1]ij wrt |i − j |. Decay rate for pos. def.:√

cond(A)− 1√cond(A) + 1

I Proof based on polynomial approximation of 1/x on [λmin, λmax].


A−1 =

0 10 20 30 40 50

0

10

20

30

40

50






A−1 =

0 10 20 30 40 50

0

10

20

30

40

50

|white entries| ≤ 10−15Extensions:I More general graph structures: decay wrt distance between two

nodes [Benzi/Razouk’2007].I 2D structures [Canuto/Simoncini/Verani’2014].I Operator algebra framework [Bickel/Lindner’2012].


A−1 =

0 10 20 30 40 50

0

10

20

30

40

50

I Each off-diagonal block has rank 1.+ Nestedness of singular vectors.= semi-separable matrix (represented by 2 vectors of length n)

2 books by [Vandebril/Van Barel/Mastronardi’2007/2008].

Hierarchical matrices

I cluster tree of column/row indices block partitioning

I each admissible block replaced bylow-rank matrix using, e.g., adaptive crossapproximation

I clustering/admissibility decided viaanalytic properties (smoothness of kernel)or topological properties (graphclustering)

I LU factorization, inversion, QRfactorization, ... can be performed withinthe format

I Most frequently used in applications featuring dense matrices:integral operators with nonlocal kernel.

I HSS matrices/H2 matrices= Hierarchical matrices + nestedness of low-rank factors.O(n log n) storage O(n) storage

I Books by Bebendorf, Börm, and Hackbusch.

Hierarchical matrices: Current trendsI Inject additional analytical information to limit ranks:

Gillman, Greengard, Hao, Martinsson, Zorin, ...I Randomized algorithms for low-rank compression:

Chiu, Demanet, Greengard, Martinsson, ...I Parallel implementation:

Dongarra et al., Keyes et al., Kriemann, Li et al., ...I Combination with sparse LU factorization:

Chandrasekaran, Li, Xia, ...I Application to machine learning:

kernel methods [Si/Hsieh/Dhillon’2014], covariance matrixestimation [Ballani/DK’2014].

Data-sparse vectors

SettingI Linear system or eigenvalue problem

Ax = b, Ax = λx ,

with x ∈ Rn.I Very large/HUGE n requires exploiting additional info beyond

structure of A.

Exploit compressibility of x .

Examples:I stable wavelet/frame discretizations of PDEs approximate sparsity

I Lyapunov matrix equations in control theory/model reduction approximate low (matrix) rank

I discretizations of certain high-dimensional PDEs approximate low (tensor) rank

General algorithms

1. Iterate and compress:I Combine existing iterative solver with compression of iterates.I Analyzed only for stationary iterations.I Examples:

I Adaptive wavelet method by Cohen/Dahmen/DeVore’2001/2002.I Power method+low tensor rank by Beylkin/Mohlenkam’2002.I Richardson iteration+low tensor rank for stochastic PDEs by

Khoromskij/Schwab’2011.I CG/BiCGstab+low tensor rank for parametric PDEs by

DK/Tobler’2011.I Combination of adaptive wavelet+low tensor rank by

Bachmayr/Dahmen’2014.

General algorithms

2. Constrain and optimize:I Reformulate linear algebra problem as optimization problem.I Constrain admissible set to compressed vectors.I Works best for low-rank structures.

Often much more efficient than iterate+compress.I Allows for greedy strategies.

3. Specialized algorithms:

I Low-rank solvers for Lyapunov and Riccati equations;see [Benner/Saak’2013], [Simoncini’2014] for surveys.

I . . .

Example: PDE-eigenvalue problemGoal: Compute smallest eigenvalue for

∆u(ξ) + V (ξ)u(ξ) = λu(ξ) in Ω = [0,1]d ,u(ξ) = 0 on ∂Ω.

Assumption: Potential represented as

V (ξ) =s∑

j=1

V (1)j (ξ1)V(2)j (ξ2) · · ·V

(d)j (ξd ).

finite difference/tensorized finite elements discretization

Au = (AL +AV )u = λu,with

AL =d∑

j=1

I ⊗ · · · ⊗ I︸︷︷︸d−j times

⊗AL ⊗ I ⊗ · · · ⊗ I︸︷︷︸j−1 times

,

AV =s∑

j=1

A(d)V ,j ⊗ · · · ⊗ A(2)V ,j ⊗ A

(1)V ,j .

Example: Henon-Heiles potentialConsider Ω = [−10,2]d and potential ([Meyer et al. 1990; Raab et al.2000; Faou et al. 2009])

V (ξ) =12

d∑j=1

σjξ2j +

d−1∑j=1

(σ∗(ξjξ

2j+1 −

13ξ3j ) +

σ2∗16

(ξ2j + ξ2j+1)

2).

with σj ≡ 1, σ∗ = 0.2.

Discretization with n = 128 dof/dimension for d = 20 dimensions.

I Eigenvector has nd ≈ 1042 entries.I Explicit storage of eigenvector would require 1025 exabyte!

Example: Henon-Heiles potentialConsider Ω = [−10,2]d and potential ([Meyer et al. 1990; Raab et al.2000; Faou et al. 2009])

V (ξ) =12

d∑j=1

σjξ2j +

d−1∑j=1

(σ∗(ξjξ

2j+1 −

13ξ3j ) +

σ2∗16

(ξ2j + ξ2j+1)

2).

with σj ≡ 1, σ∗ = 0.2.

Discretization with n = 128 dof/dimension for d = 20 dimensions.

I Eigenvector has nd ≈ 1042 entries.I Explicit storage of eigenvector would require 1025 exabyte!

Solved with accuracy 10−12 in less than 1 hour on laptop.

Rayleigh quotients wrt low-rank matricesd = 2 : symmetric n2 × n2 matrix A.

λmin(A) = minx 6=0

〈x ,Ax〉〈x , x〉

.

We now...I reshape vector x into n × n matrix X ;I reinterpret Ax as linear operator A : X 7→ A(X );I for example if A =

∑sk=1 Bk ⊗ Ak then

A(X ) =s∑

k=1

Bk XATk .


λmin(A) = minX 6=0

〈X ,A(X )〉〈X ,X 〉

with matrix inner product 〈·, ·〉. We now...I restrict X to low-rank matrices.


λmin(A)≈ minX=UV T 6=0

〈X ,A(X )〉〈X ,X 〉

.

I Approximation error governed by low-rank approximability of X .I Solved by Riemannian optimization techniques or

alternating linear scheme (ALS).

ALSALS for solving


〈X ,A(X )〉〈X ,X 〉

.

Initially:I fix target rank rI U ∈ Rm×r ,V n×r randomly, such that V is ONB

λ̃− λ = 6× 103residual = 3× 103

ALSALS for solving


〈X ,A(X )〉〈X ,X 〉

.

Fix V , optimize for U.

〈X ,A(X )〉 = vec(UV T )TA vec(UV T )= vec(U)T (V ⊗ I)TA(V ⊗ I)vec(U)

Compute smallest eigenvalue of reduced matrix (rn × rn) matrix

(V ⊗ I)TA(V ⊗ I).

Note: Computation of reduced matrix benefits from Kroneckerstructure of A.

ALSALS for solving


〈X ,A(X )〉〈X ,X 〉

.

Fix V , optimize for U.

λ̃− λ = 2× 103residual = 2× 103

ALSALS for solving


〈X ,A(X )〉〈X ,X 〉

.

Orthonormalize U, fix U, optimize for V .

〈X ,A(X )〉 = vec(UV T )TA vec(UV T )= vec(V T )(I ⊗ U)TA(I ⊗ U)vec(V T )

Compute smallest eigenvalue of reduced matrix (rn × rn) matrix

(I ⊗ U)TA(I ⊗ U).

Note: Computation of reduced matrix benefits from Kroneckerstructure of A.

ALSALS for solving


〈X ,A(X )〉〈X ,X 〉

.


λ̃− λ = 1.5× 10−7residual = 7.7×10−3

ALSALS for solving


〈X ,A(X )〉〈X ,X 〉

.

Orthonormalize V , fix V , optimize for U.

λ̃− λ = 1× 10−12residual = 6× 10−7

ALSALS for solving


〈X ,A(X )〉〈X ,X 〉

.


λ̃− λ = 7.6× 10−13residual = 7.2×10−8

d � 1:Low-rank tensor formats

Tensor network diagrams

I Introduced by Roger Penrose.I Heavily used in quantum mechanics (spin networks).

These are two matrices A,B

This is the matrix product C = AB

Cij =r∑

k=1

Aik Bkj

This is the matrix product C = UΣV T

Cij =r∑

k=1

r∑`=1

Uik Σk`Vj`

If r � n: Implicit representation of C via smaller matrices U,V ,Σ.

This is a tensor X of order 3

I X ∈ Rn1×n2×n3 is a 3D arrayI Xijk denotes entry (i , j , k)

This is a tensor X of order 3 in Tucker decomposition

Xijk =r1∑

`1=1

r2∑`2=1

r3∑`3=1

C`1`2`3Ui`1Vj`2Wk`3

Implicit representation of X viaI r1 × r2 × r3 core tensor CI n1 × r1 matrix U spans first modeI n2 × r2 matrix V spans second modeI n3 × r3 matrix W spans third mode.

Tucker decomposition & multilinear rank

Reshape tensor into matrix by slicing, e.g. for first dimension:

X = X(1) = ∈ Rn1×(n2·n3)

Multilinear rank of tensor X ∈ Rn1×n2×n3 defined by tuple

r = (r1, r2, r3), with ri = rank(X(i)).

X = U

W

V

C

Representation of rank-r-tensor:Tucker decomposition:

X = C ×1 U ×2 V ×3 W

U ∈ Rn1×r1 , V ∈ Rn2×r2 , W ∈ Rn3×r3 , andcore tensor C ∈ Rr1×r2×r3

This is a tensor X of order 6 in TT decomposition

I X implicitly represented by four r × n × r tensors and two n × rmatrices

I Quantum mechanics: MPS (matrix product states)I Introduced in numerical analysis by Oseledets and Tyrtishnikov.

Ranks of a tensor in TT decomposition

This partition corresponds to low-rank factorization

X (1,2,3) = UV T , X (1,2,3) ∈ Rn1n2n3×n4n5n6 , U ∈ Rn1n2n3×r3 , V ∈ Rn4n5n6×r3

X (1,2,3) is matricization/unfolding/flattening/reshape of X :

Merge multi-indices (1,2,3) into row indices andmulti-indices (4,5,6) into column indices

The ranks of X (1,...,µ) for µ = 1, . . . ,d − 1 are the TT ranks of X .

When to expect good low-rank approximations?I Consider given function-related tensor and study best

approximation error wrt TT ranks r .I Approximation error from separation wrt to {x1, . . . , xa}:

f (x1, . . . , xa, xa+1, . . . , xd ) ≈r∑

k=1

gk (x1, . . . , xa)hk (xa+1, . . . , xd )

for a = 1, . . . ,d − 1.I Well-known: For analytic functions

error . exp(−rmax{1/a,1/(d−a)}).

I [Temlyakov’1992, Uschmajew/Schneider’2013]: For f ∈ Bs,mix

error . r−2s(log r)2s(max{a,d−a}−1).

Smoothness is neither sufficient nor necessary for high dimensions!

Need to take into account topology of problem (e.g., [Hastings’2008]for quantum ground states). Extended in [DK/Uschmajew’2014].

Two tensor X ,Y of order 6 in TT decomposition

Inner product of two tensors in TT decomposition

I Carrying out contractions requires O(dnr4) instead of O(nd )operations for tensors of order d .

This is a tensor X of order 16 in PEPS

I PEPS = Projected Entangled Pair StatesI Schuch/Wolf/Verstraete/Cirac’2007: inner product of two PEPS is

NP hardI Landsberg/Qi/Ye’2012: PEPS not Zariski closed

ALS for TT decompositions

I Originates from quantum mechanics = one-site DMRG.I More general (numerical analysis) viewpoint developed in

[Holtz/Rohwedder/Schneider’2012; Dolgov/Oseledets’2012].

Goal:min

{ 〈X ,A(X )〉〈X ,X〉

: X ∈Mr, X 6= 0}

Mr =

ALS: Choose one node t , fix all other nodes, determine new tensor atnode t by minimizing Rayleigh quotient 〈X ,A(X )〉〈X ,X〉 . This is done for allnodes (a sweep), and sweeps are continued until convergence.

ALS for TT decompositions: Nuts & Bolts

I Subproblems are of size nr2 × nr2:I Iterative method (LOBPCG) needed if nr 2 > O(102).I Availability of good preconditioner crucial.I Preconditioner can be inherited from full problem.

I Rank adaptation strategies:I DMRG: Merge/optimize/split neighbouring cores.I Cheaper alternative: Enrich neighbouring cores with local

(preconditioned) gradient information [Dolgov/Savostyanov’2013],[DK/Steinlechner/Uschmajew’2013].

I Computation of several eigenvalues possible with block-TTformat [Dolgov/Khoromskij/Oseledets/Savostyanov’2014].

I Local convergence results in [Rohwedder/Uschmajew’2013,Uschmajew/Vandereycken’2013].

Numerical Experiments - Henon-Heiles, d = 20

ALS

0 500 1000 1500 2000 250010

−15

10−10

10−5

100

105

Execution time [s]

0 500 1000 1500 2000 25000

10

20

30

40

50

60err_lambda

res

nr_iter

Size = 12820 ≈ 1042. Maximal TT rank 40.

Numerical Experiments - Henon-Heiles, d = 100

0 500 1000 1500 2000 2500 300010

−8

10−6

10−4

10−2

100

102

104

Resid

ual and e

igenvalu

e e

rror

I spectral discretization with 10 dof/dimension size 10100

I Algorithm: Combination of ALS with preconditioned residuals[DK/Steinlechner/Uschmajew’2013].

I rank adaptivityI computed smallest eigenvalue = 70.7415

Low-rank tensor techniquesI Emerged during last five years in numerical analysis.I Successfully applied to:

I parameter-dependent / multi-dimensional integrals;I electronic structure calculations: Hartree-Fock / DFT;I stochastic and parametric PDEs;I high-dimensional Boltzmann / chemical master / Fokker-Planck /

Schrödinger equations;I micromagnetism;I rational approximation problems;I computational homogenization;I computational finance;I multivariate regression and machine learning;I queuing models;I context-aware recommender systemsI . . .

I For references on these applications, seeI L. Grasedyck, DK, Ch. Tobler (2013). A literature survey of low-

rank tensor approximation techniques. GAMM-Mitteilungen, 36(1).I W. Hackbusch (2012). Tensor Spaces and Numerical Tensor

Calculus, Springer.

ConclusionsI Numerical linear algebra alive and healthy.I Lots of potential to incorporate techniques from compressed

sensing/optimization/... into numerical linear algebra algorithms.

Some important aspects not covered:

I Preconditioning.I Randomized algorithms.I Communication-avoiding algorithms.I Merging iterative methods with (adaptive) discretization.I . . .

Documents

Daniel Kressner Chair of Numerical Algorithms and HPC ...the format IMost frequently used in applications featuring dense matrices: integral operators with nonlocal kernel. IHSS matrices/H2