46
1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department of Mathematics and Computer Science Indiana State University

1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department

1

High-Performance Eigensolver for Real Symmetric Matrices:

Parallel Implementations and

Applications in Electronic Structure Calculation

Yihua BaiDepartment of Mathematics and Computer

ScienceIndiana State University

Page 2: 1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department

2

Contents

Current status of real symmetric eigensolvers

Motivation BD&C algorithm – a high performance

approximate eigensolver Parallel implementations of BD&C algorithm Applications in electronic structure

calculation and numerical results Summary and Future Work

Page 3: 1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department

3

Current Status of Dense Symmetric Eigensolvers

PDSYEVD PDSYEVX PDSYEVR

Page 4: 1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department

4

Classical Three Steps to Decompose A=XΛXT

Reduction to symmetric tridiagonal formA=HTHT

Eigen-decomposition of the tridiagonal matrix

T=VΛVT

Cuppen’s divide-and-conquer Bisection and inverse iteration Multiple Relatively Robust Representations

(MRRR) Back-transformation of the eigenvectors

X=HV

Page 5: 1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department

5

Bottleneck of Classical Approaches

Reduction time is the bottleneck

PDSYEVD PDSYEVR

Robert C. Ward and Yihua Bai, Performance of Parallel Eigensolvers on Electronic Structure Calculations II, Technical Report UT-CS-06-572, University of Tennessee August 2006

Page 6: 1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department

6

Limitation of Classical Approaches

Compute eigen-solution to full accuracy, while lower accuracy frequently sufficient in electronic structure calculation

Questions:Trade accuracy for efficiency?How?

Page 7: 1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department

7

Motivation

A high performance approximate eigensolver for

electronic structure calculation

Page 8: 1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department

8

Schrödinger’s Equation:An Intrinsic Eigenvalue Problem

H E

Page 9: 1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department

9

Computation of Electronic Structure Solve Schrödinger’s Equation efficiently Different approximation methods

Hartree-Fock approximation density functional theory configuration interaction …, etc.

Self-Consistent Field method Solve generalized non-linear real symmetric

eigenvalue problem iteratively A standard linear eigenvalue problem solved in each

iteration. Typically the most time consuming part of electronic

structure calculation Low accuracy suffices in earlier iterations Matrices from application problems may have locality

properties

Page 10: 1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department

10

Problem Definition

Given a real symmetric matrix A and accuracy tolerance , want to compute

TA X X

where and contain the approximate eigenvectorsand eigenvalues, respectively, and satisfy

X

221) TA X X O A

21

2) max ( )Ti mach

i nXX I e O n

Page 11: 1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department

11

Block Algorithms for Approximate Eigensolver

1) Block-tridiagonal divide-and-conquer (BD&C) – The centerpiece

2) Block tridiagonalization (BT) – Block tridiagonalization of sparse and “effectively” sparse matrices

3) Orthogonal reduction of full matrix to block- tridiagonal form (OBR) – Orthogonal transformations to produce block-tridiagonal matrix

Page 12: 1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department

12

1) BD&C Algorithm *1

1 2

1

1

1

T

Tq

q q

B CC B

C

C B

M TV VDecompose:

block tridiagonal matrix

22

TM V V O M wherenumerically orthogonal eigenvector matrix

accuracy tolerance

number of blocksq

V diagonal matrix of eigenvaluesM

* W. N. Gansterer, R. C. Ward, R. P. Muller and W. A. Goddard III, Computing Approximate Eigenpairs of Symmetric Block Tridiagonal Matrices, SIAM J. Sci. Comput., 25 (2003), pp. 65 – 85.

Page 13: 1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department

13

Three Steps of BD&C 1. Subdivision

2. Solve Sub-problem

3. Synthesis – the most time consuming step

1

1

qT

i ii

M M WW

}~,,~,~{~21 p

BBBdiagM with

, 1,2, ,Ti i i iB Z D Z i q decompos

e:

1

1

( )q

T Ti i

i

M Z D YY Z

T

i iY Z W 1 , , qZ diag Z Z 1( , , )qD diag D D where:

, ,

decompose T T

i iD yy V V , then multiply Vi and Z

Complexity: a function of deflation, rank, and size

3k n

Page 14: 1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department

14

2) Block Tridiagonalization (BT)* An approximation to the original full matrix May require eigenvectors from previous iteration

* Y. Bai, W. N. Gansterer and R. C. Ward, Block-Tridiagonalization of “Effectively” Sparse Symmetric Matrices, ACM Trans. Math. Softw., 30 (2004), pp. 326 – 352.

11 12 13 14 1 11 13

21 22 23 24 2 22 23 24

31 32 33 34 3 31 32 33 3

41 42 43 44 4 42 44

1 2 3 4 3

1 1

1

0 0 0

0 0

0

0 0 0

0 0 0

n

n

n n

n

n n n n nn n nn

T

a a a a a a a

a a a a a a a a

a a a a a a a a aA

a a a a a a a

a a a a a a a

B C

C B

2 2

2 3

1

1

T

Tq

q q

C

MC B

C

C B

Complexity:

2O n

Page 15: 1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department

15

3) Orthogonal Reduction to Block-Tridiagonal Matrix (OBR) *

• A full matrix that cannot be sparsified• A sequence of Householder transformations

Complexity:

34

3O n

Page 16: 1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department

16

Complexity of Major Components

Algorithm Computational Complexity

BD&C

BT

OBR

38

3n

2n

34

3n

message passing latency time to transfer one floating point number time for one floating point operation ranks for off-diagonal blocks

Page 17: 1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department

17

Parallel Implementations Parallel block divide-and-conquer

(PBD&C) * Preprocessing

Parallel block tridiagonalization (PBT) Parallel orthogonal block-tridiagonal

reduction (POBR) **

* Yihua Bai and Robert C. Ward, A Parallel Symmetric Block-Tridiagonal Divide-and-Conquer Algorithm, Technical Report UT-CS-06-571, University of Tennessee, December 2005. Submitted to ACM TOMS** Yihua Bai and Robert C. Ward, Parallel Block Tridiagonalization of Real Symmetric Matrices, Technical Report UT-CS-06-578, University of Tennessee, June 2006. Submitted to ACM TOMS

Page 18: 1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department

18

Implementations of PBD&C

Mixed data/task parallel implementation

versus complete data parallel implementation

Page 19: 1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department

19

Mixed Parallel Implementation

Data distribution and redistribution

Merging sequence and workload balance

Mixed parallelism – data/task

Deflation

Page 20: 1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department

20

Matrix Distribution – Mixed Data/Task Parallelism

Divide processors into groups of sub-grids

Assign each sub-grid to a sub-problem

1 1

1 2

1

1

T

Tq

q q

B C

C B

C

C B

Block-tridiagonal matrix with q diagonal blocks

Page 21: 1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department

21

Matrix Distribution – Example

Each diagonal blockassigned a sub-grid

2D block cyclic distributionon each sub-grid

Page 22: 1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department

22

Data Redistribution

Distribute from a 22 grid to a 3 3 grid

Redistribute data from one sub-grid to another one (subdivision step)

Page 23: 1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department

23

Distribute from a 22 and a 24 grids to a 34 grid

Redistribute data for each merging operation from two sub-grids to one super-grid (synthesis step)

Data Redistribution (cont’d)

Page 24: 1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department

24

Merging Sequence

Final merging operation

Idle time

Level 4

Level 3

Level 2

Level 1

Level 0

hright

hlett

Final merging operation counts for up to 75% of total computational cost. Consider low computational complexity and workload balance at the same time for the final merge.

Page 25: 1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department

25

Problems Subgrid construction

Example: subgrid 1: 2X2 subgrid 2: 5X5 supergrid: 1X29?

Many communicator handles Can use up to 2k handles, where k=max(number of

diagonal blocks, number of total processors) Portability on different MPI

implementations Example: need minor modification of code

when use mpimx (myrinet mpi)

Page 26: 1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department

26

Complete Data Parallel Implementation

1 1

1 2

1

1

T

Tq

q q

B C

C B

C

C B

Block-tridiagonal matrix with q diagonal blocks

Assign all processors to each block in block-tridiagonal matrix

Assume a 2X2 processor grid,Assigned to B1, B2, …, Bq, and C1, C2, …, Cq-1.

Page 27: 1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department

27

Advantages and Disadvantages Advantages

One communicator One processor grid Portability to different MPI platform

Disadvantages Not all processors involved in some steps

SVD of off-diagonal blocks Decomposition of diagonal blocks Merge smaller sub-problems

Still need data redistribution for each merging operation

Page 28: 1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department

28

Numerical Results

Mixed data/task parallel BD&C subroutine PDSBTDC vs. ScaLAPACK PDSYEVD

Matrices with different eigenvalue distributions and different sizes

Banded application matrix Complete data parallel BD&C subroutine

PDSBTDCD vs. Mixed data/task parallel BD&C subroutine PDSBTDC

Page 29: 1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department

29

Machine Specifications IBM p690 System in ORNL

Page 30: 1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department

30

PDSBTDC vs. PDSYEVD on Matrices with Different Eigenvalue Distributions

Arithmetically distributed eigenvalues

Geometrically distributed eigenvalues

=10-6, b = 20

Page 31: 1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department

31

Accuracy of PDSBTDC

1, , 2

ˆ ˆmax Ti

i nX X I e n

O

1, , 2 2

ˆˆ ˆmax i i ii n

Ax x A

RResidual:

Departure from orthogonality:

610

Page 32: 1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department

32

PDSBTDC on Application Matrix

PDSBTDC with different tolerances

Polyalanine matrix, n = 5027, b = 79

Page 33: 1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department

33

Performance Test on UT SInRG AMD Opteron Processor 240 Cluster

Number of nodes 64

Memory per node 2GB

Processor per node 2

CPU frequency 1.4 GB

L2 cache 1 MB

TLB size 1024 4K pages

Interconnect Myrinet 2000

Similar performance and scales a little better

Page 34: 1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department

34

PDSBTDC vs. PDSBTDCD Performance

Block-tridiagonal matrix with arithmetically distributed eigenvalues,Matrix size = 12000, block size = 20, tolerance = 10-6.Data parallel implementation scales down in SVD of off-diagonal blocks and solving sub-problems.

Page 35: 1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department

35

Application in Electronic Structure Calculation

Trans-Polyacetylene

Simple chemical structure Semiconducting conjugated polymer Light emitting devices, flexible Fast nonlinear optical response Strong nonlinear susceptibility

Page 36: 1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department

36

Matrix Generated from trans-PA

Yihua Bai, Robert C. Ward, and Guoping Zhang, Parallel Divide-and-Conquer Algorithm for Computing Full Spectrum of Polyacetylene, Poster at the Division of Atomic, Molecular and Optical Physics (DAMOP) 2006 meeting, Knoxville, Tennessee.

Page 37: 1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department

37

Two Steps to Compute Approximate Eigen-Solution

Construct block-tridiagonal matrix from the original dense matrix H M = H + E, where M is block

tridiagonal Algorithm: PBT

Compute eigensolutions to reduced accuracy User defined accuracy, typically 10-6

Algorithm: PBD&C

Page 38: 1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department

38

Trans-(CH)16000.n=16000, =10-6.

With lower accuracy (i.e., 10-6), the savings in execution time is order of magnitude.

Compare Execution Time with ScaLAPACK PDSYEVD

Page 39: 1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department

39

Relative Execution Time with Fixed n2/p

With fixed per-processor problem size,The relative execution time for an O(n3) algorithm should be

as the reference line shows. The curve for our new parallel algorithm shows a computational complexity between O(n2) and O(n3)

2 22 ,2 , 2T n p T n p

Page 40: 1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department

40

Conclusion and Future Work

Page 41: 1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department

41

Conclusion PBD&C: very efficient on block tridiagonal

matrices with Low ranks for off-diagonal blocks High ratio of deflation

Comparison of PDSBTDC and PDSBTDCD PDSBTDCD performs better with smaller number of

processors in use PDSBTDC scales better as the number of processors

in use increases PBD&C combined with PBT

Efficient on application matrices with specific locality property

Page 42: 1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department

42

Future Work

A Parallel Adaptive Eigensolver

Alternative method for computation of eigenvectors

Approximation in sparse eigensolver

Incorporate PBD&C and PBT into SCF for trans-PA

Fine tuning of PDSBTDCD

Page 43: 1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department

43

End of Presentation

Thank you!

Page 44: 1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department

44

Acknowledgement

Dr. R. P. Muller Sandia National Laboratories Dr. G. Zhang Indiana State University

Page 45: 1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department

45

TaskFlowchart

Major Efficiency improvements from

• Reduced accuracy in early iterations of SCF

• Reducing the reduction bottleneck

• Eigenvectors may be required if efforts made to improve efficiency

Page 46: 1 High-Performance Eigensolver for Real Symmetric Matrices: Parallel Implementations and Applications in Electronic Structure Calculation Yihua Bai Department

46

Complexity of Major Components

Sequential Parallel

BD&C

BT

OBR

38

3n

2n

34

3n

3 2 28 2 3.1

8 43 3

n n np n

p p

2

2 22 3n

p n np

23 2 3 log4

3 b

n p n pn

p n p

message passing latency time to transfer one floating point number time for one floating point operationnb block size for parallel 2D matrix

distribution