Ab Initio Quantum Chemistry on Graphics Processing · PDF fileAb Initio Quantum Chemistry on...

Preview:

Citation preview

Ab Initio Quantum Chemistry

on Graphics Processing UnitsRethinking Algorithms for Massively Parallel Architectures

Jorg Kussmann

Theoretical Chemistry, University of Munich (LMU)

23rd May 2014

J. Kussmann Quantum Chemistry@GPU

Outline

Introduction

Challenges of Ab Initio Quantum Chemistry

Optimizing SCF-Algorithms @ GPUs

Data-Arrangement

Coulomb-, Exchange-, XC-Potential

Exchange Potential: GPU-specific optimization

Examplary Calculations: SCF & Properties

Hybrid MPI/CUDA Parallelization

Outlook: Post-HF Algorithms @ GPUs

Challenge

SOS-MP2 @ GPUs

J. Kussmann Quantum Chemistry@GPU

PART 1: Ab Initio Methods

Schrodinger equation:

Molecular properties:

Energetics/Geometries

Vibrational frequencies

Electric properties

Magnetic properties

Dynamic properties

HΨ = i~Ψstat−→ HΨ = EΨ

Conventional methods:

Systems with up to a few 100 atoms on HF- or KS-DFT level (min. O(M3))Aim: Reduce scaling to O(M)!

J. Kussmann Quantum Chemistry@GPU

PART 1: Ab Initio Methods

Schrodinger equation:

Molecular properties:

Energetics/Geometries

Vibrational frequencies

Electric properties

Magnetic properties

Dynamic properties

HΨ = i~Ψstat−→ HΨ = EΨ

Conventional methods:

Systems with up to a few 100 atoms on HF- or KS-DFT level (min. O(M3))Aim: Reduce scaling to O(M)!

J. Kussmann Quantum Chemistry@GPU

PART 1: Ab Initio Methods

Schrodinger equation:

Molecular properties:

Energetics/Geometries

Vibrational frequencies

Electric properties

Magnetic properties

Dynamic properties

HΨ = i~Ψstat−→ HΨ = EΨ

Conventional methods:

Systems with up to a few 100 atoms on HF- or KS-DFT level (min. O(M3))

Aim: Reduce scaling to O(M)!

J. Kussmann Quantum Chemistry@GPU

Computational Effort: SCF Calculations

Roothaan-Hall: FC = SCǫ

Fµν = hcoreµν + Jµν [P] − (1 − a)Kµν [P] + V XC

µν [a, P]

Rate-determing steps:

1) Fock-Build O(N2)−→O(N)2) Diagonalization: F −→ C O(N3)−→O(N)

aaaaaaaaaaaaaaaaaaaaaaaaaa [Kussmann/Beer/Ochsenfeld, WIREs Comput Mol Sci 3, 614 (2013)]

{

a = 0 : HF0 < a < 1 : hybrid-DFTa = 1 : KS-DFT

Example: 16 A-T base pairs

HF/SVP (ϑint = 10−10, ϑconv = 10−7)

1052 atoms, 11230 basis functions

3 078 087 function pairs9.5 × 1012 primitive 2-e− integrals

O(N) Fock-Build (8 cores): 30 000 s(19 SCF-iterations for tight convergence)

J. Kussmann Quantum Chemistry@GPU

Moore’s Law: 1965-2010

Embrace new technologies: GPUs

J. Kussmann Quantum Chemistry@GPU

Moore’s Law: 1965-2010

Embrace new technologies: GPUs

J. Kussmann Quantum Chemistry@GPU

Implementation of GPU-algorithms

Automatic code generation

All double-precision, higher l-qn support

Coulomb

McMurchie-Davidson based J-engine

Pre/Post-processing on CPU

Ignore bra/ket symmetry (2 x integrals)

Exchange

McMurchie-Davidson

Evaluate complete integral on GPU

Exploit only 1 permutational symmetry (4 x integrals)

1 thread / 1 prim. integral: fine-grained data arrangement

[Ufimtsev/Martinez, JCTC 4, 222 (2008)]

J. Kussmann Quantum Chemistry@GPU

Coulomb Potential

J. Kussmann Quantum Chemistry@GPU

Exchange Potential

J. Kussmann Quantum Chemistry@GPU

Implementation of GPU-algorithms

Automatic code generation

All double-precision, higher l-qn support

Coulomb

McMurchie-Davidson based J-engine

Pre/Post-processing on CPU

Ignore bra/ket symmetry (2 x integrals)

Exchange

McMurchie-Davidson

Evaluate complete integral on GPU

Exploit only 1 permutational symmetry (4 x integrals)

Coulomb very fast, try to improve on exchange first...

A) Reduce scaling to linear

B) Reduce local memory effort

C) Reduce shared memory effort

J. Kussmann Quantum Chemistry@GPU

A) PreLinK: O(N) Exact Exchange on GPUs

Problem: O(N) algorithms employ loads of book-keeping,

Problem: branching, communication

Loop: bra l-quantum number combination

Loop: ket l-quantum number combination

Loop: bra shell-pairs µ, λ

Determine sig. (µλ|σν) quartets:

QµλPmaxλσ Qσν ≥ ϑint + permutations

Loop: ket shell-pairs σ, ν

Evaluate: Kµν, Kµσ, Kλν, Kλσ

End Loop

End Loop

End Loop

Screening within inner loop

J. Kussmann Quantum Chemistry@GPU

A) PreLinK: O(N) Exact Exchange on GPUs

Problem: O(N) algorithms employ loads of book-keeping,

Problem: branching, communication

Solution: Perform screening prior to integral evaluation by

Solution: pre-selection: PreLinK

J. Kussmann Quantum Chemistry@GPU

A) PreLinK: O(N) Exact Exchange on GPUs

Problem: O(N) algorithms employ loads of book-keeping,

Problem: branching, communication

Solution: Perform screening prior to integral evaluation by

Solution: pre-selection: PreLinK

Kµν =∑

λσ

(µλ|νσ)Pλσ

Schwarz: (µλ|νσ) ≤ QµλQνσ =√

(µλ|µλ)√

(νσ|νσ)

PreLinK: Q′

µν =∑

λσ

QµλQνσ|Pλσ| ≥ Kµν

−→ Q′

= Q × |P| × Q

Determine significant elements of K from Q′

!

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa [Kussmann/Ochsenfeld, JCP 138, 134114 (2013)]

J. Kussmann Quantum Chemistry@GPU

A) PreLinK: Pre-Selection Threshold

|P| Overestimation of K

16 α-D-glucose units, HF/SVP

J. Kussmann Quantum Chemistry@GPU

A) PreLinK: Pre-Selection Threshold

Effect of pre-selection on final SCF energy

DNA-fragment with 4 A-T base-pairs, HF/SVP

(ϑconv = 10−7, ϑint = 10−10).

Errors in µHartree.

Error always below convergence criterion

J. Kussmann Quantum Chemistry@GPU

A) PreLinK: Timings

Linear alkanes, HF/SV, max.: C640H1282

1-4 x NVidia M2090 (old generation, Kepler: approx. 3 x faster)

J. Kussmann Quantum Chemistry@GPU

B) Improving the Exchange: Reduced Local Memory

16 A-T base pairs, HF/SVP (ϑint = 10−10, ϑpre = 10−3, 1 x GTX Titan)

Resort to Rys-quadrature for larger total l-qn

J. Kussmann Quantum Chemistry@GPU

C) Improving the Exchange: Reduced Shared Memory

Shared Memory per thread-block

Most suitable size: 8x8 thread-blocks, use shared memory for Kµν

Ex.: d-shells (l-qn = 2), 48 kB shared memory

36 cartesian Kµν elements

Memory per thread-block: 8 x 8 x 8 (double) x 36 = 18.4 kB

Max. 2 thread-blocks per SMX, only 128 out of 192 cores

25 pure Kµν elements

Memory per thread-block: 8 x 8 x 8 (double) x 25 = 12.8 kB

Max. 3 thread-blocks per SMX, 192 out of 192 cores

Direct transformation to pure allows larger l-qn shells!

Ex.: 2 A-T base pairs, HF/TZVP

267 s (cart) vs 216 s (pure)

Significant impact: 20% speedup

Only ca. 7% of l-qn combinations affected

J. Kussmann Quantum Chemistry@GPU

Examplary Calculations: Water-Cluster

SCF Fock-Build and Nuclear Gradient (4 x GTX Titan, PBE0/SVP, 75/302)

PreLinK for Gradients [Kussmann/Ochsenfeld, in preparation]

J. Kussmann Quantum Chemistry@GPU

NMR-Shieldings @ GPU

Timings: Water-Clusters (4 x GTX Titan, PBE0/SVP, 75/302)

Algorithm

dJ/dB: Reuse SCF-kernels with l + 1, different post-processing

dK/dB: Special GPU-kernels

K [dP/dB]: 6 x SCF-kernels (skew symmetry)

J. Kussmann Quantum Chemistry@GPU

CIS/RPA @ GPU

Timings: Water-Clusters (4 x GTX Titan, PBE/SVP, 75/302)

J. Kussmann Quantum Chemistry@GPU

Hybrid MPI/CUDA Parallelization: SCF Calculations

HF/SVP (Single Fock-build, ϑint = 10−10, ϑpre = 10−3)

16 A-T base pairs (H2O)1123

Hardware/Parallelization

Per Node: 12 CPU cores (Intel E5-2620 v2 @ 2.0 GHz), 4 GTX Titan

Primitive Load-balancing, Master-Slave work distribution

1 Gb Ethernet

J. Kussmann Quantum Chemistry@GPU

Hybrid MPI/CUDA Parallelization: SCF Calculations

HF/SVP

16 A-T base pairs (H2O)1123

J. Kussmann Quantum Chemistry@GPU

Hybrid MPI/CUDA Parallelization: MutM@H2O

J. Kussmann Quantum Chemistry@GPU

Post-HF @ GPUs

Challenge

Less favorable scaling, conv. O(N5) at best (MP2)

Not integral evaluation, but linear algebra rate-determining

Porting CPU-algorithms shows small speedups only

Problem: DGEMM-speedup is rather small (ca. x 8)

Ansatz

Re-considering algorithms with GPUs in mind

First attempt: SOS-RI-MP2 [O(N4)]

[Jung/Shao/Head-Gordon, J. Comp. Chem. 12, 1953 (2007)]

J. Kussmann Quantum Chemistry@GPU

Post-HF @ GPUs: SOS-RI-MP2

EOSRI−MP2 = −

ijab

RSR′S′

(ia|R)[

J−1]

RS(S|jb)(ia|R′)

[

J−1]

R′S′(S′|jb)

ǫa + ǫb − ǫi − ǫj

JRS : two-center/two-electron integrals (aux. basis)

Laplace-Transform:

EOSRI−AO−MP2 = −

α

µνλσ

µ′ν′λ′σ′

RSR′S′

Poccµµ′Pvirtνν

′Poccλλ

′Pvirtσσ

(µν|R)[

J−1

]

RS(S|λσ)(µ′ν′|R′)

[

J−1

]

R′S′

(S′|λ′σ′).

Evaluation via Intermediates:

ZRS =∑

µνµ′ν′

(R|µ′ν′)Poccµµ′Pvirtνν

′(µν|S) =∑

µν

(R|µν)(µν|S)

Correlation Energy: EOSRI−AO−MP2

= −∑

α

RS ZRS ZSR with Z = ZJ−1

[Maurer/Kussmann/Ochsenfeld, submitted (2014)]

J. Kussmann Quantum Chemistry@GPU

Post-HF @ GPUs: SOS-RI-MP2 @ GPUs

Ansatz

Use Cholesky-factors of pseudo-densities & sparse algebra

O(N3)

Evaluate ZRS via J-engine on GPUs.

Algorithm

(1) Calculation of (R|µν) O(N2)

(2) Calculation of JRS = (R|S) O(N2)

(3) Calculation of J−1 O(N3)

(4) Calculation of pseudo-densities O(N3)

(5) Transformation of (R|µν) to (R|µν) O(N2)

(6) Contraction∑

µν(R|µν)(µν|S) (@ GPU) O(N3)

(7) Multiplication ZJ−1 O(N3)

(8) Contraction∑

RS ZRSZSR O(N2)

[Maurer/Kussmann/Ochsenfeld, submitted (2014)]

J. Kussmann Quantum Chemistry@GPU

SOS-RI-MP2: J-engine@GPU

J. Kussmann Quantum Chemistry@GPU

SOS-RI-MP2 @ GPU: Linear Alkanes

J. Kussmann Quantum Chemistry@GPU

SOS-RI-MP2 @ GPU: DNA

J. Kussmann Quantum Chemistry@GPU

Conclusions

Rethink algorithms, don’t simply transfer CPU-code

Coulomb: O(N2) J-engine, but small pre-factor

Efficient O(N) exchange evaluation on GPUs by PreLinK

Performance/Cost

(DNA16 @ HF/SVP, 1052 atoms, 11230 BF, 1 x Fock)

Q-Chem @ 8 CPU-cores: ∼ 30000 s (∼ 2000 e)

FermiONs++ @ 4 M2090: ∼ 2100 s (∼ 10000 e)

FermiONs++ @ 4 Titan: ∼ 500 s (∼ 8000 e)

∼ 60 x faster, 4 x more expensive

Fine-grained data-arrangement

strong-scaling parallelization

FermiONs++: Release 2014

J. Kussmann Quantum Chemistry@GPU

Acknowledgement

◮ Prof. Dr. C. Ochsenfeld

◮ Dr. Simon Maurer

◮ Group

Thank you for your attention...

J. Kussmann Quantum Chemistry@GPU

Recommended