Fast Quantum Molecular Dynamics on LATTE | GTC 2013on-demand.gputechconf.com/gtc/2013/presentations/S3195-Fast... · Operated by Los Alamos National Security, LLC for the U.S. Department

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

U N C L A S S I F I E D Slide 1

Fast Quantum Molecular Dynamics on Multi-GPU Architectures in

LATTE

S. Mniszewski*, M. Cawkwell, A. Niklasson

GPU Technology Conference

San Jose, California

March 18-21, 2013

*[email protected]


U N C L A S S I F I E D Slide 2

Background: Quantum Molecular Dynamics

n  In molecular dynamics simulation, the relative positions of atoms evolve over a series of time steps according to the force acting on each atom

n  Employed in materials science, chemistry, and biology to study structures, defects, and equilibrium and non-equilibrium phenomena

n  Dependence on an interatomic potential to calculate forces and energy

n  Quantum-based models capture the making and breaking of covalent bonds, charge transfer between species of differing electronegativities, and long-range electrostatic interactions


U N C L A S S I F I E D

Quantum-based Interatomic Potentials n  Electronic structure of atoms and molecules is modeled explicitly

n  Most accurate and reliable descriptions of interatomic bonding

n  Their prohibitive computational cost has prevented widespread use – better algorithms and GPU architectures are important paths forward

Slide 3

E = 2Tr ρH"# $%

fi = −2Tr ρ

∂H∂R i

$

%&

'

()

Energy

Force Number of atoms

Tim

e pe

r M

D t

ime

step

Quantum MD with improved algorithms and GPUs

Empirical pair potentials

n  Hamiltonian matrix H

n  The density matrix, rho, is computed self-consistently from H



The Density Matrix Computation

n  Typically, algorithms used in quantum-based models, most notably matrix diagonalization, are not ideally suited to GPUs •  Due to their complexity •  Difficulty in extracting thread-level parallelism •  Difficulty of avoiding branching within warps

n  New approach in LATTE •  Computed directly from the Hamiltonian through a recursive expansion of the Fermi

Operator with the second order spectral projection (SP2) algorithm •  Based on a series of generalized matrix-matrix multiplications •  Only one matrix-matrix multiplication is required per iteration •  Maps very well to GPUs

Slide 4



0 0.5 1x

0

0.5

1

f(x

)

f(x) = x2

f(x) = 2x - x2

f8(f

7(...f

1(x) ...))

The Second Order Spectral Projection Algorithm (SP2) – Reduced Complexity

Slide 5

ρ = θ µI − H$% &' = lim

i→∞fi[ fi−1[… f0[X0 ]…]]

X0 =

εmaxI − Hεmax − εmin

fi[X i ] =X i

2 if 2Tr[X i ] ≥ N

2X i − X i2 if 2Tr[X i ] < Ne

Recursive Fermi Operator expansion



The GPU Implementation n  Part of the LATTE codebase

•  Employs a semi-empirical tight-binding model of interatomic bonding that is based on the formalisms derived from density functional theory

•  Density matrix build is by far the slowest step in the calculation •  CPU version in Fortran 90

n  Hardware/Software Architecture •  Keeneland* cluster at the National Institute for Computational Sciences •  CPU - 2 Intel hex-core Xeon CPUs per node, Intel Fortran Compiler, MKL •  3 Nvidia M2090 GPUs (previously M2070 GPUs) •  CUDA 4.2, CUBLAS, and a thread block size of 1024

n  Use of CUDA Features on GPUs •  Unified Virtual Addressing •  Peer to peer memory access/copy •  Streams –sequence of commands •  Single thread access to all GPUs

Slide 6

*J.S. Vetter, R. Glassbrook, J. Dongarra, K. Schwan, B. Loftis, S. McNally, J. Meredith, J. Rogers, P. Roth, K. Spafford, and S. Yalamanchili, “Keeneland: Bringing heterogeneous GPU computing to the computational science community,” IEEE Computing in Science and Engineering, 13(5):90-5, 2011, http://dx.doi.org/10.1109/MCSE.2011.83.



Estimate εmax and εmin X = (εmaxI-H)/(εmax-εmin) TraceX = Tr[X] /* Trace kernel on GPU */ Until converged do

Xtmp = X Xtmp = X2+Xtmp /*CUBLAS xGEMM */ TraceXtmp = Tr[Xtmp] /*Trace kernel on GPU */ if |2TraceX – 2TraceXtmp – Ne| > |2TraceX + 2TraceXtmp –Ne| X = X + Xtmp TraceX = TraceX + TraceXtmp else X = X – Xtmp TraceX = TraceX – TraceXtmp

end until ρ = X

SP2 Algorithm Using the Hybrid CPU/GPU Approach

Slide 7



SP2 Algorithm Using the Full GPU Approach

Slide 8

Estimate εmax and εmin X = (εmaxI-H)/(εmax-εmin) TraceX = Tr[X] /* Trace kernel on GPU */ Until converged do

Xtmp = X Xtmp = X2+Xtmp /*CUBLAS xGEMM */ TraceXtmp = Tr[Xtmp] /*Trace kernel on GPU */ if |2TraceX – 2TraceXtmp – Ne| > |2TraceX + 2TraceXtmp –Ne| X = X + Xtmp /* CUBLAS xAXPY */ TraceX = TraceX + TraceXtmp /* CUBLAS xAXPY */ else X = X – Xtmp /* CUBLAS xAXPY */ TraceX = TraceX – TraceXtmp

end until ρ = X



CUBLAS Matrix Multiplication Performance (Nvidia M2070)

Slide 9

Song, F., Tomov, S., Dongarra, J. "Enabling and Scaling Matrix Computations on Heterogeneous Multi-Core and Multi-GPU Systems," 26th ACM International Conference on Supercomputing (ICS 2012), ACM, San Servolo Island, Venice, Italy, June, 2012.



Array Padding for Performance (M x M)

Slide 10

0.30

0.35

0.40

0.45

Tim

e per

xG

EM

M e

xec

uti

on (

s)

3500 3600 3700 3800 3900 4000M

0.14

0.16

0.18

0.20

0.22

(a)

(b)

Double

Single

Average time to execute the CUBLAS xGEMM generalized matrix-matrix multiplication for M × M matrices. (a) DGEMM, (b) SGEMM. The broken and solid lines correspond computations performed with and without the padding of the arrays, respectively.



Performance Analysis: Density Matrix Calculation (Nvidia M2070) – Liquid Methane (10-1000 molecules)

Slide 11

100 1000Number of orbitals

0.01

0.1

1

10

100

Tim

e p

er !

cal

cula

tio

n (

s)

0 2000 4000 6000 8000Number of orbitals

50

100

150

200

250

300T

ime

per

! c

alcu

lati

on (

s)

DiagonalizationSP2: CPU algorithm 1SP2: CPU/GPU algorithm 2SP2: GPU algorithm 3

Double



10-16

10-15

10-14

10-13

|| !

2 -

! ||

2

DiagonalizationSP2: CPU algorithm 1SP2: CPU/GPU algorithm 2SP2: GPU algorithm 3

Error Analysis in Density Matrices – Idempotency Measure

Slide 12

The SP2 algorithm has errors that are independent of system size whereas traditional diagonalization yields errors that increase with the number of atoms.

ρ2 = ρ

Error = ρ2 − ρ

2



Multi-GPU Generalized Matrix-Matrix Multiplication

n  Using multiple streams for sub-block matrix-matrix multiplications, additions, and matrix traces

n  Efficient reassembly of blocked matrix via native functionality of CUBLAS DGEMM/SGEMM

Slide 13

!

!

=

GPU 0

GPU N-1

. . .

. . .

GPU 0

GPU N-1



Using Streams for SP2 Sub-block Matrix Operations

Slide 14

Matrix blocks assembled in GPU 0 and redistributed

X2 + X = Xtmp (XGEMM)

Tr[Xtmp]

X +/- Xtmp (xAXPY)

GPU 0 – Stream 0 GPU 1 – Stream 1



Performance Analysis: Density Matrix Calculation (Nvidia M2090) – Liquid Methane (10 – 1250 molecules)

Slide 15

0 2000 4000 6000 8000 10000Matrix dimension

0

30

60

90

Tim

e per

den

sity

mat

rix b

uil

d (

s) SP2: 1 GPUSP2: 2 GPUsSP2: 3 GPUsSP2: CPUDiagonalization



Performance Analysis for 1-3 GPUs

Slide 16

1000 10000Matrix dimension

0.01

0.1

1

10

100T

ime

per

den

sity

mat

rix

bu

ild

(s) 1 GPU

2 GPUs3 GPUs

1 GPU optimal 2 GPUs optimal 3 GPUs optimal



Summary

n  GPUs can be effectively used for the density matrix computation in quantum mechanical models

n  The recursive SP2 algorithm is well suited to the GPU architecture

n  Transfer of arrays between the CPU and GPU are a minor performance contribution

n  Array padding is important for performance

n  The GPU version of the SP2 algorithm provides comparable or better accuracy over traditional diagonalization

n  Massive speed-ups with respect to traditional algorithms have been seen with no loss of accuracy

Slide 17



Related Publications

1.  Sanville EJ, Bock N, Coe J, Mniszewski SM, Niklasson AMN, Cawkwell MJ, 2010, LATTE. Los Alamos National Laboratory (LA-CC-10-004), http://savannah.nongpu.org/projects/latte.

2.  Cawkwell MJ, Sanville EJ, Mniszewski SM, Niklasson AMN, 2012, Computing the Density Matrix in Electronic Structure Theory on Graphics Processing Units. J. Chem. Theory Comput., Vol 8, Issue 11, pp. 4094-4101, http://pubs.acs.org/doi/full/10.1021/ct300442w.

3.  Mniszewski SM, Cawkwell MJ, Niklasson AMN, 2013, Quantum-based Dynamics on Graphics Processing Units. 2013 Associate Directorate for Theory, Simulation, and Computation (ADTSC) Highlights (in press).

Slide 18

Documents

Fast Quantum Molecular Dynamics on LATTE | GTC 2013on-demand.gputechconf.com/gtc/2013/presentations/S3195-Fast... · Operated by Los Alamos National Security, LLC for the U.S. Department