Upload
jason-riedy
View
1.137
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Progress and directions for LAPACK and ScaLAPACK as of 2005. See LAPACK at netlib and
Citation preview
The Future of LAPACK andScaLAPACK
Jason Riedy, Yozo Hida, James Demmel
EECS DepartmentUniversity of California, Berkeley
November 18, 2005
Outline
Survey responses: What users want
Improving LAPACK and ScaLAPACKImproved NumericsImproved PerformanceImproved FunctionalityImproved Engineering and Community
Two Example ImprovementsNumerics: Iterative Refinement for Ax = bPerformance: The MRRR Algorithm for Ax = λx
2 / 16
Survey: What users want
I Survey available fromhttp://www.netlib.org/lapack-dev/.
I 212 responses, over 100 different, non-anonymous groups
I Problem sizes:100 1K 10K 100K 1M (other)8% 26% 24% 12% 6% (24%)
I >80% interested in small-medium SMPs
I >40% interested in large distributed-memory systems
I Vendor libs seen as faster, buggier
I over 20% want > double precision, 70% out-of-core
I Requests: High-level interfaces, low-level interfaces,parallel redistribution∗ and tuning
3 / 16
ParticipantsI UC Berkeley
I Jim Demmel, Ming Gu, W. Kahan, Beresford Parlett, Xiaoye
Li, Osni Marques, Christof Vomel, David Bindel, Yozo Hida,
Jason Riedy, Jianlin Xia, Jiang Zhu, undergrads, . . .I U Tennessee, Knoxville
I Jack Dongarra, Julien Langou, Julie Langou, Piotr Luszczek,
Stan Tomov, . . .I Other Academic Institutions
I UT Austin, UC Davis, U Kansas, U Maryland, North Carolina
SU, San Jose SU, UC Santa Barbara, TU Berlin, FU Hagen,
U Madrid, U Manchester, U Umea, U Wuppertal, U ZagrebI Research Institutions
I CERFACS, LBL, UEC (Japan)I Industrial Partners
I Cray, HP, Intel, MathWorks, NAG, SGI
You?
4 / 16
Improved Numerics
Improved accuracy with standard asymptotic speed:Some are faster!
I Iterative refinement for linear systems, least squaresDemmel / Hida / Kahan / Li / Mukherjee / Riedy / Sarkisyan
I Pivoting and scaling for symmetric systemsI Definite and indefinite
I Jacobi SVD (and faster) — Drmac / Veselic
I Condition numbers and estimatorsHigham / Cheng / Tisseur
I Useful approximate error estimates
5 / 16
Improved Performance
Improved performance with at least standard accuracy
I MRRR algorithm for eigenvalues, SVDParlett / Dhillon / Vomel / Marques / Willems / Katagiri
I Fast Hessenberg QR & QZByers / Mathias / Braman, Kagstrom / Kressner
I Fast reductions and BLAS2.5van de Geijn, Bischof / Lang, Howell / Fulton
I Recursive data layoutsGustavson / Kagstrom / Elmroth / Jonsson
I generalized SVD — Bai, Wang
I Polynomial roots from semi-separable formGu / Chandrasekaran / Zhu / Xia / Bindel / Garmire / Demmel
I Automated tuning, optimizations in ScaLAPACK, . . .
6 / 16
Improved Functionality
Algorithms
I Updating / downdating factorizations — Stewart, Langou
I More generalized SVDs: products, CSD — Bai, Wang
I More generalized Sylvester, Lyupanov solversKagstrom, Jonsson, Granat
I Quadratic eigenproblems — Mehrmann
I Matrix functions — Higham
Implementations
I Add “missing” features to ScaLAPACK
I Generate LAPACK, ScaLAPACK for higher precisions
7 / 16
Improved Engineering and Community
Use new features without a rewrite
I Use modern Fortran 95, maybe 2003I DO . . . END DO, recursion, allocation (in wrappers)
I Provide higher-level wrappers for common languagesI F95, C, C++
I Automatic generation of precisions, bindingsI Full automation (FLAME, etc.) not quite ready for all
functions
I Tests for algorithms, implementations, installations
Open development
Need a community for long-term evolution.http://www.netlib.org/lapack-dev/
Lots of work to do, research and development.
8 / 16
Two Example Improvements
Recent, locally developed improvements
Improved Numerics
Iterative refinement for linear systems Ax = b:
I Extra precision ⇒ small error, dependable estimate
I Both normwise and componentwise
I (See LAWN 165 for full details.)
Improved Performance
MRRR algorithm for eigenvalue, SVD problems
I Optimal complexity: O(n) per value/vector
I (See LAWNs 162, 163, 166, 167. . . for more details.)
9 / 16
Numerics: Iterative RefinementImprove solution to Ax = b
Repeat: r = b − Ax , dx = A−1r , x = x + dx
Until: good enough
Not-too-ill-conditioned ⇒ error O(√
n ε)
log10
κnorm
log10E
norm
Normwise error vs. condition number κnorm. (2000000 cases)
167212 1156099
463800 22544
190085 260
0 5 10 15100
101
102
103
104
-9
-8
-7
-6
-5
-4
-3
-2
-1
0
1
log10 κnorm
log10E
norm
Normwise error vs. condition number κnorm. (2000000 cases)
40377
1539
821097 1136987
0 5 10 15100
101
102
103
104
-9
-8
-7
-6
-5
-4
-3
-2
-1
0
1
10 / 16
Numerics: Iterative RefinementImprove solution to Ax = b
Repeat: r = b − Ax , dx = A−1r , x = x + dx
Until: good enough
Dependable normwise relative error estimate
log10 Enorm
log10B
norm
Normwise Error vs. Bound (2000000 cases)
133497 485930
1323311
56848
414
-8 -6 -4 -2 0100
101
102
103
104
105
-9
-8
-7
-6
-5
-4
-3
-2
-1
0
1
log10
Enorm
log10B
norm
Normwise Error vs. Bound (2000000 cases)
100
40359
343
1361 18
1957741 78
-8 -6 -4 -2 0100
101
102
103
104
105
-9
-8
-7
-6
-5
-4
-3
-2
-1
0
1
11 / 16
Numerics: Iterative RefinementImprove solution to Ax = b
Repeat: r = b − Ax , dx = A−1r , x = x + dx
Until: good enough
Also small componentwise errors and dependable estimates
log10 κcomp
log10E
com
p
Componentwise error vs. condition number κcomp. (2000000 cases)
41755
32249
545427 1380569
0 5 10 15100
101
102
103
104
-9
-8
-7
-6
-5
-4
-3
-2
-1
0
1
PSfrag replacemen
log10 Ecomp
log10B
com
p
Componentwise Error vs. Bound (2000000 cases)
1 236
41702
13034
25307 53
1912961 6706
-8 -6 -4 -2 0100
101
102
103
104
105
-9
-8
-7
-6
-5
-4
-3
-2
-1
0
1
12 / 16
Relying on Condition Numbers
Need condition numbers for dependable estimates.
Picking the right condition number and estimating it well.
log10 κnorm (single precision)
log10κ
norm
(double
pre
cisi
on)
κnorm: single vs. double precision (2000000 cases)
0.0% 28.9%
30.1%
19.2%
21.8% 0.0%
0 2 4 6 8 10 12 14100
101
102
103
104
0
2
4
6
8
10
12
14
13 / 16
Performance: The MRRR Algorithm
Multiple Relatively Robust Representations
I 1999 Householder Award honorable mention for Dhillon
I Optimal complexity with small error!I O(nk) flops for k eigenvalues/vectors of n × n
tridiagonal matrixI Small residuals: ‖Txi − λixi‖ = O(nε)I Orthogonal eigenvectors: ‖xT
i xj‖ = O(nε)
I Similar algorithm for SVD.
I Eigenvectors computed independently ⇒ naturallyparallelizable
I (LAPACK r3 had bugs, missing cases)
14 / 16
Performance: The MRRR Algorithm
“fast DC”: Wilkinson, deflate like crazy
15 / 16
Summary
I LAPACK and ScaLAPACK are open for improvement!
I Planned improvements inI numerics,I performance,I functionality, andI engineering.
I Forming a community for long-term development.
16 / 16