View
215
Download
0
Embed Size (px)
Citation preview
1
High-Performance Eigensolver for Real Symmetric Matrices:
Parallel Implementations and
Applications in Electronic Structure Calculation
Yihua BaiDepartment of Mathematics and Computer
ScienceIndiana State University
2
Contents
Current status of real symmetric eigensolvers
Motivation BD&C algorithm – a high performance
approximate eigensolver Parallel implementations of BD&C algorithm Applications in electronic structure
calculation and numerical results Summary and Future Work
3
Current Status of Dense Symmetric Eigensolvers
PDSYEVD PDSYEVX PDSYEVR
4
Classical Three Steps to Decompose A=XΛXT
Reduction to symmetric tridiagonal formA=HTHT
Eigen-decomposition of the tridiagonal matrix
T=VΛVT
Cuppen’s divide-and-conquer Bisection and inverse iteration Multiple Relatively Robust Representations
(MRRR) Back-transformation of the eigenvectors
X=HV
5
Bottleneck of Classical Approaches
Reduction time is the bottleneck
PDSYEVD PDSYEVR
Robert C. Ward and Yihua Bai, Performance of Parallel Eigensolvers on Electronic Structure Calculations II, Technical Report UT-CS-06-572, University of Tennessee August 2006
6
Limitation of Classical Approaches
Compute eigen-solution to full accuracy, while lower accuracy frequently sufficient in electronic structure calculation
Questions:Trade accuracy for efficiency?How?
7
Motivation
A high performance approximate eigensolver for
electronic structure calculation
8
Schrödinger’s Equation:An Intrinsic Eigenvalue Problem
H E
9
Computation of Electronic Structure Solve Schrödinger’s Equation efficiently Different approximation methods
Hartree-Fock approximation density functional theory configuration interaction …, etc.
Self-Consistent Field method Solve generalized non-linear real symmetric
eigenvalue problem iteratively A standard linear eigenvalue problem solved in each
iteration. Typically the most time consuming part of electronic
structure calculation Low accuracy suffices in earlier iterations Matrices from application problems may have locality
properties
10
Problem Definition
Given a real symmetric matrix A and accuracy tolerance , want to compute
TA X X
where and contain the approximate eigenvectorsand eigenvalues, respectively, and satisfy
X
221) TA X X O A
21
2) max ( )Ti mach
i nXX I e O n
11
Block Algorithms for Approximate Eigensolver
1) Block-tridiagonal divide-and-conquer (BD&C) – The centerpiece
2) Block tridiagonalization (BT) – Block tridiagonalization of sparse and “effectively” sparse matrices
3) Orthogonal reduction of full matrix to block- tridiagonal form (OBR) – Orthogonal transformations to produce block-tridiagonal matrix
12
1) BD&C Algorithm *1
1 2
1
1
1
T
Tq
q q
B CC B
C
C B
M TV VDecompose:
block tridiagonal matrix
22
TM V V O M wherenumerically orthogonal eigenvector matrix
accuracy tolerance
number of blocksq
V diagonal matrix of eigenvaluesM
* W. N. Gansterer, R. C. Ward, R. P. Muller and W. A. Goddard III, Computing Approximate Eigenpairs of Symmetric Block Tridiagonal Matrices, SIAM J. Sci. Comput., 25 (2003), pp. 65 – 85.
13
Three Steps of BD&C 1. Subdivision
2. Solve Sub-problem
3. Synthesis – the most time consuming step
1
1
qT
i ii
M M WW
}~,,~,~{~21 p
BBBdiagM with
, 1,2, ,Ti i i iB Z D Z i q decompos
e:
1
1
( )q
T Ti i
i
M Z D YY Z
T
i iY Z W 1 , , qZ diag Z Z 1( , , )qD diag D D where:
, ,
decompose T T
i iD yy V V , then multiply Vi and Z
Complexity: a function of deflation, rank, and size
3k n
14
2) Block Tridiagonalization (BT)* An approximation to the original full matrix May require eigenvectors from previous iteration
* Y. Bai, W. N. Gansterer and R. C. Ward, Block-Tridiagonalization of “Effectively” Sparse Symmetric Matrices, ACM Trans. Math. Softw., 30 (2004), pp. 326 – 352.
11 12 13 14 1 11 13
21 22 23 24 2 22 23 24
31 32 33 34 3 31 32 33 3
41 42 43 44 4 42 44
1 2 3 4 3
1 1
1
0 0 0
0 0
0
0 0 0
0 0 0
n
n
n n
n
n n n n nn n nn
T
a a a a a a a
a a a a a a a a
a a a a a a a a aA
a a a a a a a
a a a a a a a
B C
C B
2 2
2 3
1
1
T
Tq
q q
C
MC B
C
C B
Complexity:
2O n
15
3) Orthogonal Reduction to Block-Tridiagonal Matrix (OBR) *
• A full matrix that cannot be sparsified• A sequence of Householder transformations
Complexity:
34
3O n
16
Complexity of Major Components
Algorithm Computational Complexity
BD&C
BT
OBR
38
3n
2n
34
3n
message passing latency time to transfer one floating point number time for one floating point operation ranks for off-diagonal blocks
17
Parallel Implementations Parallel block divide-and-conquer
(PBD&C) * Preprocessing
Parallel block tridiagonalization (PBT) Parallel orthogonal block-tridiagonal
reduction (POBR) **
* Yihua Bai and Robert C. Ward, A Parallel Symmetric Block-Tridiagonal Divide-and-Conquer Algorithm, Technical Report UT-CS-06-571, University of Tennessee, December 2005. Submitted to ACM TOMS** Yihua Bai and Robert C. Ward, Parallel Block Tridiagonalization of Real Symmetric Matrices, Technical Report UT-CS-06-578, University of Tennessee, June 2006. Submitted to ACM TOMS
18
Implementations of PBD&C
Mixed data/task parallel implementation
versus complete data parallel implementation
19
Mixed Parallel Implementation
Data distribution and redistribution
Merging sequence and workload balance
Mixed parallelism – data/task
Deflation
20
Matrix Distribution – Mixed Data/Task Parallelism
Divide processors into groups of sub-grids
Assign each sub-grid to a sub-problem
1 1
1 2
1
1
T
Tq
q q
B C
C B
C
C B
Block-tridiagonal matrix with q diagonal blocks
21
Matrix Distribution – Example
Each diagonal blockassigned a sub-grid
2D block cyclic distributionon each sub-grid
22
Data Redistribution
Distribute from a 22 grid to a 3 3 grid
Redistribute data from one sub-grid to another one (subdivision step)
23
Distribute from a 22 and a 24 grids to a 34 grid
Redistribute data for each merging operation from two sub-grids to one super-grid (synthesis step)
Data Redistribution (cont’d)
24
Merging Sequence
Final merging operation
Idle time
Level 4
Level 3
Level 2
Level 1
Level 0
hright
hlett
Final merging operation counts for up to 75% of total computational cost. Consider low computational complexity and workload balance at the same time for the final merge.
25
Problems Subgrid construction
Example: subgrid 1: 2X2 subgrid 2: 5X5 supergrid: 1X29?
Many communicator handles Can use up to 2k handles, where k=max(number of
diagonal blocks, number of total processors) Portability on different MPI
implementations Example: need minor modification of code
when use mpimx (myrinet mpi)
26
Complete Data Parallel Implementation
1 1
1 2
1
1
T
Tq
q q
B C
C B
C
C B
Block-tridiagonal matrix with q diagonal blocks
Assign all processors to each block in block-tridiagonal matrix
Assume a 2X2 processor grid,Assigned to B1, B2, …, Bq, and C1, C2, …, Cq-1.
27
Advantages and Disadvantages Advantages
One communicator One processor grid Portability to different MPI platform
Disadvantages Not all processors involved in some steps
SVD of off-diagonal blocks Decomposition of diagonal blocks Merge smaller sub-problems
Still need data redistribution for each merging operation
28
Numerical Results
Mixed data/task parallel BD&C subroutine PDSBTDC vs. ScaLAPACK PDSYEVD
Matrices with different eigenvalue distributions and different sizes
Banded application matrix Complete data parallel BD&C subroutine
PDSBTDCD vs. Mixed data/task parallel BD&C subroutine PDSBTDC
29
Machine Specifications IBM p690 System in ORNL
30
PDSBTDC vs. PDSYEVD on Matrices with Different Eigenvalue Distributions
Arithmetically distributed eigenvalues
Geometrically distributed eigenvalues
=10-6, b = 20
31
Accuracy of PDSBTDC
1, , 2
ˆ ˆmax Ti
i nX X I e n
O
1, , 2 2
ˆˆ ˆmax i i ii n
Ax x A
RResidual:
Departure from orthogonality:
610
32
PDSBTDC on Application Matrix
PDSBTDC with different tolerances
Polyalanine matrix, n = 5027, b = 79
33
Performance Test on UT SInRG AMD Opteron Processor 240 Cluster
Number of nodes 64
Memory per node 2GB
Processor per node 2
CPU frequency 1.4 GB
L2 cache 1 MB
TLB size 1024 4K pages
Interconnect Myrinet 2000
Similar performance and scales a little better
34
PDSBTDC vs. PDSBTDCD Performance
Block-tridiagonal matrix with arithmetically distributed eigenvalues,Matrix size = 12000, block size = 20, tolerance = 10-6.Data parallel implementation scales down in SVD of off-diagonal blocks and solving sub-problems.
35
Application in Electronic Structure Calculation
Trans-Polyacetylene
Simple chemical structure Semiconducting conjugated polymer Light emitting devices, flexible Fast nonlinear optical response Strong nonlinear susceptibility
36
Matrix Generated from trans-PA
Yihua Bai, Robert C. Ward, and Guoping Zhang, Parallel Divide-and-Conquer Algorithm for Computing Full Spectrum of Polyacetylene, Poster at the Division of Atomic, Molecular and Optical Physics (DAMOP) 2006 meeting, Knoxville, Tennessee.
37
Two Steps to Compute Approximate Eigen-Solution
Construct block-tridiagonal matrix from the original dense matrix H M = H + E, where M is block
tridiagonal Algorithm: PBT
Compute eigensolutions to reduced accuracy User defined accuracy, typically 10-6
Algorithm: PBD&C
38
Trans-(CH)16000.n=16000, =10-6.
With lower accuracy (i.e., 10-6), the savings in execution time is order of magnitude.
Compare Execution Time with ScaLAPACK PDSYEVD
39
Relative Execution Time with Fixed n2/p
With fixed per-processor problem size,The relative execution time for an O(n3) algorithm should be
as the reference line shows. The curve for our new parallel algorithm shows a computational complexity between O(n2) and O(n3)
2 22 ,2 , 2T n p T n p
40
Conclusion and Future Work
41
Conclusion PBD&C: very efficient on block tridiagonal
matrices with Low ranks for off-diagonal blocks High ratio of deflation
Comparison of PDSBTDC and PDSBTDCD PDSBTDCD performs better with smaller number of
processors in use PDSBTDC scales better as the number of processors
in use increases PBD&C combined with PBT
Efficient on application matrices with specific locality property
42
Future Work
A Parallel Adaptive Eigensolver
Alternative method for computation of eigenvectors
Approximation in sparse eigensolver
Incorporate PBD&C and PBT into SCF for trans-PA
Fine tuning of PDSBTDCD
43
End of Presentation
Thank you!
44
Acknowledgement
Dr. R. P. Muller Sandia National Laboratories Dr. G. Zhang Indiana State University
45
TaskFlowchart
Major Efficiency improvements from
• Reduced accuracy in early iterations of SCF
• Reducing the reduction bottleneck
• Eigenvectors may be required if efforts made to improve efficiency
46
Complexity of Major Components
Sequential Parallel
BD&C
BT
OBR
38
3n
2n
34
3n
3 2 28 2 3.1
8 43 3
n n np n
p p
2
2 22 3n
p n np
23 2 3 log4
3 b
n p n pn
p n p
message passing latency time to transfer one floating point number time for one floating point operationnb block size for parallel 2D matrix
distribution