1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department

1

High-Performance Eigensolver for Real Symmetric Matrices:

Parallel Implementations and

Applications in Electronic Structure Calculation

Yihua BaiDepartment of Mathematics and Computer

ScienceIndiana State University

2

Contents

Current status of real symmetric eigensolvers

Motivation BD&C algorithm – a high performance

approximate eigensolver Parallel implementations of BD&C algorithm Applications in electronic structure

calculation and numerical results Summary and Future Work

3

Current Status of Dense Symmetric Eigensolvers

PDSYEVD PDSYEVX PDSYEVR

4

Classical Three Steps to Decompose A=XΛXT

Reduction to symmetric tridiagonal formA=HTHT

Eigen-decomposition of the tridiagonal matrix

T=VΛVT

Cuppen’s divide-and-conquer Bisection and inverse iteration Multiple Relatively Robust Representations

(MRRR) Back-transformation of the eigenvectors

X=HV

5

Bottleneck of Classical Approaches

Reduction time is the bottleneck

PDSYEVD PDSYEVR

Robert C. Ward and Yihua Bai, Performance of Parallel Eigensolvers on Electronic Structure Calculations II, Technical Report UT-CS-06-572, University of Tennessee August 2006

6

Limitation of Classical Approaches

Compute eigen-solution to full accuracy, while lower accuracy frequently sufficient in electronic structure calculation

Questions:Trade accuracy for efficiency?How?

7

Motivation

A high performance approximate eigensolver for

electronic structure calculation

8

Schrödinger’s Equation:An Intrinsic Eigenvalue Problem

H E

9

Computation of Electronic Structure Solve Schrödinger’s Equation efficiently Different approximation methods

Hartree-Fock approximation density functional theory configuration interaction …, etc.

Self-Consistent Field method Solve generalized non-linear real symmetric

eigenvalue problem iteratively A standard linear eigenvalue problem solved in each

iteration. Typically the most time consuming part of electronic

structure calculation Low accuracy suffices in earlier iterations Matrices from application problems may have locality

properties

10

Problem Definition

Given a real symmetric matrix A and accuracy tolerance , want to compute

TA X X

where and contain the approximate eigenvectorsand eigenvalues, respectively, and satisfy

X

221) TA X X O A

21

2) max ( )Ti mach

i nXX I e O n

11

Block Algorithms for Approximate Eigensolver

1) Block-tridiagonal divide-and-conquer (BD&C) – The centerpiece

2) Block tridiagonalization (BT) – Block tridiagonalization of sparse and “effectively” sparse matrices

3) Orthogonal reduction of full matrix to block- tridiagonal form (OBR) – Orthogonal transformations to produce block-tridiagonal matrix

12

1) BD&C Algorithm *1

1 2

1

1

1

T

Tq

q q

B CC B

C

C B

M TV VDecompose:

block tridiagonal matrix

22

TM V V O M wherenumerically orthogonal eigenvector matrix

accuracy tolerance

number of blocksq

V diagonal matrix of eigenvaluesM

* W. N. Gansterer, R. C. Ward, R. P. Muller and W. A. Goddard III, Computing Approximate Eigenpairs of Symmetric Block Tridiagonal Matrices, SIAM J. Sci. Comput., 25 (2003), pp. 65 – 85.

13

Three Steps of BD&C 1. Subdivision

2. Solve Sub-problem

3. Synthesis – the most time consuming step

1

1

qT

i ii

M M WW

}~,,~,~{~21 p

BBBdiagM with

, 1,2, ,Ti i i iB Z D Z i q decompos

e:

1

1

( )q

T Ti i

i

M Z D YY Z

T

i iY Z W 1 , , qZ diag Z Z 1( , , )qD diag D D where:

, ,

decompose T T

i iD yy V V , then multiply Vi and Z

Complexity: a function of deflation, rank, and size

3k n

14

2) Block Tridiagonalization (BT)* An approximation to the original full matrix May require eigenvectors from previous iteration

* Y. Bai, W. N. Gansterer and R. C. Ward, Block-Tridiagonalization of “Effectively” Sparse Symmetric Matrices, ACM Trans. Math. Softw., 30 (2004), pp. 326 – 352.

11 12 13 14 1 11 13

21 22 23 24 2 22 23 24

31 32 33 34 3 31 32 33 3

41 42 43 44 4 42 44

1 2 3 4 3

1 1

1

0 0 0

0 0

0

0 0 0

0 0 0

n

n

n n

n

n n n n nn n nn

T

a a a a a a a

a a a a a a a a

a a a a a a a a aA

a a a a a a a

a a a a a a a

B C

C B

2 2

2 3

1

1

T

Tq

q q

C

MC B

C

C B

Complexity:

2O n

15

3) Orthogonal Reduction to Block-Tridiagonal Matrix (OBR) *

• A full matrix that cannot be sparsified• A sequence of Householder transformations

Complexity:

34

3O n

16

Complexity of Major Components

Algorithm Computational Complexity

BD&C

BT

OBR

38

3n

2n

34

3n

message passing latency time to transfer one floating point number time for one floating point operation ranks for off-diagonal blocks

17

Parallel Implementations Parallel block divide-and-conquer

(PBD&C) * Preprocessing

Parallel block tridiagonalization (PBT) Parallel orthogonal block-tridiagonal

reduction (POBR) **

* Yihua Bai and Robert C. Ward, A Parallel Symmetric Block-Tridiagonal Divide-and-Conquer Algorithm, Technical Report UT-CS-06-571, University of Tennessee, December 2005. Submitted to ACM TOMS** Yihua Bai and Robert C. Ward, Parallel Block Tridiagonalization of Real Symmetric Matrices, Technical Report UT-CS-06-578, University of Tennessee, June 2006. Submitted to ACM TOMS

18

Implementations of PBD&C

Mixed data/task parallel implementation

versus complete data parallel implementation

19

Mixed Parallel Implementation

Data distribution and redistribution

Merging sequence and workload balance

Mixed parallelism – data/task

Deflation

20

Matrix Distribution – Mixed Data/Task Parallelism

Divide processors into groups of sub-grids

Assign each sub-grid to a sub-problem

1 1

1 2

1

1

T

Tq

q q

B C

C B

C

C B

Block-tridiagonal matrix with q diagonal blocks

21

Matrix Distribution – Example

Each diagonal blockassigned a sub-grid

2D block cyclic distributionon each sub-grid

22

Data Redistribution

Distribute from a 22 grid to a 3 3 grid

Redistribute data from one sub-grid to another one (subdivision step)

23

Distribute from a 22 and a 24 grids to a 34 grid

Redistribute data for each merging operation from two sub-grids to one super-grid (synthesis step)

Data Redistribution (cont’d)

24

Merging Sequence

Final merging operation

Idle time

Level 4

Level 3

Level 2

Level 1

Level 0

hright

hlett

Final merging operation counts for up to 75% of total computational cost. Consider low computational complexity and workload balance at the same time for the final merge.

25

Problems Subgrid construction

Example: subgrid 1: 2X2 subgrid 2: 5X5 supergrid: 1X29?

Many communicator handles Can use up to 2k handles, where k=max(number of

diagonal blocks, number of total processors) Portability on different MPI

implementations Example: need minor modification of code

when use mpimx (myrinet mpi)

26

Complete Data Parallel Implementation

1 1

1 2

1

1

T

Tq

q q

B C

C B

C

C B

Block-tridiagonal matrix with q diagonal blocks

Assign all processors to each block in block-tridiagonal matrix

Assume a 2X2 processor grid,Assigned to B1, B2, …, Bq, and C1, C2, …, Cq-1.

27

Advantages and Disadvantages Advantages

One communicator One processor grid Portability to different MPI platform

Disadvantages Not all processors involved in some steps

SVD of off-diagonal blocks Decomposition of diagonal blocks Merge smaller sub-problems

Still need data redistribution for each merging operation

28

Numerical Results

Mixed data/task parallel BD&C subroutine PDSBTDC vs. ScaLAPACK PDSYEVD

Matrices with different eigenvalue distributions and different sizes

Banded application matrix Complete data parallel BD&C subroutine

PDSBTDCD vs. Mixed data/task parallel BD&C subroutine PDSBTDC

29

Machine Specifications IBM p690 System in ORNL

30

PDSBTDC vs. PDSYEVD on Matrices with Different Eigenvalue Distributions

Arithmetically distributed eigenvalues

Geometrically distributed eigenvalues

=10-6, b = 20

31

Accuracy of PDSBTDC

1, , 2

ˆ ˆmax Ti

i nX X I e n

O

1, , 2 2

ˆˆ ˆmax i i ii n

Ax x A

RResidual:

Departure from orthogonality:

610

32

PDSBTDC on Application Matrix

PDSBTDC with different tolerances

Polyalanine matrix, n = 5027, b = 79

33

Performance Test on UT SInRG AMD Opteron Processor 240 Cluster

Number of nodes 64

Memory per node 2GB

Processor per node 2

CPU frequency 1.4 GB

L2 cache 1 MB

TLB size 1024 4K pages

Interconnect Myrinet 2000

Similar performance and scales a little better

34

PDSBTDC vs. PDSBTDCD Performance

Block-tridiagonal matrix with arithmetically distributed eigenvalues,Matrix size = 12000, block size = 20, tolerance = 10-6.Data parallel implementation scales down in SVD of off-diagonal blocks and solving sub-problems.

35

Application in Electronic Structure Calculation

Trans-Polyacetylene

Simple chemical structure Semiconducting conjugated polymer Light emitting devices, flexible Fast nonlinear optical response Strong nonlinear susceptibility

36

Matrix Generated from trans-PA

Yihua Bai, Robert C. Ward, and Guoping Zhang, Parallel Divide-and-Conquer Algorithm for Computing Full Spectrum of Polyacetylene, Poster at the Division of Atomic, Molecular and Optical Physics (DAMOP) 2006 meeting, Knoxville, Tennessee.

37

Two Steps to Compute Approximate Eigen-Solution

Construct block-tridiagonal matrix from the original dense matrix H M = H + E, where M is block

tridiagonal Algorithm: PBT

Compute eigensolutions to reduced accuracy User defined accuracy, typically 10-6

Algorithm: PBD&C

38

Trans-(CH)16000.n=16000, =10-6.

With lower accuracy (i.e., 10-6), the savings in execution time is order of magnitude.

Compare Execution Time with ScaLAPACK PDSYEVD

39

Relative Execution Time with Fixed n2/p

With fixed per-processor problem size,The relative execution time for an O(n3) algorithm should be

as the reference line shows. The curve for our new parallel algorithm shows a computational complexity between O(n2) and O(n3)

2 22 ,2 , 2T n p T n p

40

Conclusion and Future Work

41

Conclusion PBD&C: very efficient on block tridiagonal

matrices with Low ranks for off-diagonal blocks High ratio of deflation

Comparison of PDSBTDC and PDSBTDCD PDSBTDCD performs better with smaller number of

processors in use PDSBTDC scales better as the number of processors

in use increases PBD&C combined with PBT

Efficient on application matrices with specific locality property

42

Future Work

A Parallel Adaptive Eigensolver

Alternative method for computation of eigenvectors

Approximation in sparse eigensolver

Incorporate PBD&C and PBT into SCF for trans-PA

Fine tuning of PDSBTDCD

43

End of Presentation

Thank you!

44

Acknowledgement

Dr. R. P. Muller Sandia National Laboratories Dr. G. Zhang Indiana State University

45

TaskFlowchart

Major Efficiency improvements from

• Reduced accuracy in early iterations of SCF

• Reducing the reduction bottleneck

• Eigenvectors may be required if efforts made to improve efficiency

46

Complexity of Major Components

Sequential Parallel

BD&C

BT

OBR

38

3n

2n

34

3n

3 2 28 2 3.1

8 43 3

n n np n

p p

2

2 22 3n

p n np

23 2 3 log4

3 b

n p n pn

p n p

message passing latency time to transfer one floating point number time for one floating point operationnb block size for parallel 2D matrix

distribution

Documents

1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department