Ab Initio Quantum Chemistry on Graphics Processing · PDF fileAb Initio Quantum Chemistry on Graphics Processing Units Rethinking Algorithms for Massively Parallel Architectures Jorg

Ab Initio Quantum Chemistry

on Graphics Processing UnitsRethinking Algorithms for Massively Parallel Architectures

Jorg Kussmann

Theoretical Chemistry, University of Munich (LMU)

23rd May 2014

J. Kussmann Quantum Chemistry@GPU

Outline

Introduction

Challenges of Ab Initio Quantum Chemistry

Optimizing SCF-Algorithms @ GPUs

Data-Arrangement

Coulomb-, Exchange-, XC-Potential

Exchange Potential: GPU-specific optimization

Examplary Calculations: SCF & Properties

Hybrid MPI/CUDA Parallelization

Outlook: Post-HF Algorithms @ GPUs

Challenge

SOS-MP2 @ GPUs


PART 1: Ab Initio Methods

Schrodinger equation:

Molecular properties:

Energetics/Geometries

Vibrational frequencies

Electric properties

Magnetic properties

Dynamic properties

HΨ = i~Ψstat−→ HΨ = EΨ

Conventional methods:

Systems with up to a few 100 atoms on HF- or KS-DFT level (min. O(M3))Aim: Reduce scaling to O(M)!







Electric properties

Magnetic properties

Dynamic properties



Systems with up to a few 100 atoms on HF- or KS-DFT level (min. O(M3))Aim: Reduce scaling to O(M)!







Electric properties

Magnetic properties

Dynamic properties



Systems with up to a few 100 atoms on HF- or KS-DFT level (min. O(M3))

Aim: Reduce scaling to O(M)!


Computational Effort: SCF Calculations

Roothaan-Hall: FC = SCǫ

Fµν = hcoreµν + Jµν [P] − (1 − a)Kµν [P] + V XC

µν [a, P]

Rate-determing steps:

1) Fock-Build O(N2)−→O(N)2) Diagonalization: F −→ C O(N3)−→O(N)

aaaaaaaaaaaaaaaaaaaaaaaaaa [Kussmann/Beer/Ochsenfeld, WIREs Comput Mol Sci 3, 614 (2013)]

{

a = 0 : HF0 < a < 1 : hybrid-DFTa = 1 : KS-DFT

Example: 16 A-T base pairs

HF/SVP (ϑint = 10−10, ϑconv = 10−7)

1052 atoms, 11230 basis functions

3 078 087 function pairs9.5 × 1012 primitive 2-e− integrals

O(N) Fock-Build (8 cores): 30 000 s(19 SCF-iterations for tight convergence)


Moore’s Law: 1965-2010

Embrace new technologies: GPUs


Moore’s Law: 1965-2010

Embrace new technologies: GPUs


Implementation of GPU-algorithms

Automatic code generation

All double-precision, higher l-qn support

Coulomb

McMurchie-Davidson based J-engine

Pre/Post-processing on CPU

Ignore bra/ket symmetry (2 x integrals)

Exchange

McMurchie-Davidson

Evaluate complete integral on GPU

Exploit only 1 permutational symmetry (4 x integrals)

1 thread / 1 prim. integral: fine-grained data arrangement

[Ufimtsev/Martinez, JCTC 4, 222 (2008)]


Coulomb Potential


Exchange Potential


Implementation of GPU-algorithms

Automatic code generation

All double-precision, higher l-qn support

Coulomb

McMurchie-Davidson based J-engine

Pre/Post-processing on CPU

Ignore bra/ket symmetry (2 x integrals)

Exchange

McMurchie-Davidson

Evaluate complete integral on GPU

Exploit only 1 permutational symmetry (4 x integrals)

Coulomb very fast, try to improve on exchange first...

A) Reduce scaling to linear

B) Reduce local memory effort

C) Reduce shared memory effort


A) PreLinK: O(N) Exact Exchange on GPUs

Problem: O(N) algorithms employ loads of book-keeping,

Problem: branching, communication

Loop: bra l-quantum number combination

Loop: ket l-quantum number combination

Loop: bra shell-pairs µ, λ

Determine sig. (µλ|σν) quartets:

QµλPmaxλσ Qσν ≥ ϑint + permutations

Loop: ket shell-pairs σ, ν

Evaluate: Kµν, Kµσ, Kλν, Kλσ

End Loop

End Loop

End Loop

Screening within inner loop





Solution: Perform screening prior to integral evaluation by

Solution: pre-selection: PreLinK





Solution: Perform screening prior to integral evaluation by

Solution: pre-selection: PreLinK

Kµν =∑

λσ

(µλ|νσ)Pλσ

Schwarz: (µλ|νσ) ≤ QµλQνσ =√

(µλ|µλ)√

(νσ|νσ)

PreLinK: Q′

µν =∑

λσ

QµλQνσ|Pλσ| ≥ Kµν

−→ Q′

= Q × |P| × Q

Determine significant elements of K from Q′

!

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa [Kussmann/Ochsenfeld, JCP 138, 134114 (2013)]


A) PreLinK: Pre-Selection Threshold

|P| Overestimation of K

16 α-D-glucose units, HF/SVP


A) PreLinK: Pre-Selection Threshold

Effect of pre-selection on final SCF energy

DNA-fragment with 4 A-T base-pairs, HF/SVP

(ϑconv = 10−7, ϑint = 10−10).

Errors in µHartree.

Error always below convergence criterion


A) PreLinK: Timings

Linear alkanes, HF/SV, max.: C640H1282

1-4 x NVidia M2090 (old generation, Kepler: approx. 3 x faster)


B) Improving the Exchange: Reduced Local Memory

16 A-T base pairs, HF/SVP (ϑint = 10−10, ϑpre = 10−3, 1 x GTX Titan)

Resort to Rys-quadrature for larger total l-qn


C) Improving the Exchange: Reduced Shared Memory

Shared Memory per thread-block

Most suitable size: 8x8 thread-blocks, use shared memory for Kµν

Ex.: d-shells (l-qn = 2), 48 kB shared memory

36 cartesian Kµν elements

Memory per thread-block: 8 x 8 x 8 (double) x 36 = 18.4 kB

Max. 2 thread-blocks per SMX, only 128 out of 192 cores

25 pure Kµν elements

Memory per thread-block: 8 x 8 x 8 (double) x 25 = 12.8 kB

Max. 3 thread-blocks per SMX, 192 out of 192 cores

Direct transformation to pure allows larger l-qn shells!

Ex.: 2 A-T base pairs, HF/TZVP

267 s (cart) vs 216 s (pure)

Significant impact: 20% speedup

Only ca. 7% of l-qn combinations affected


Examplary Calculations: Water-Cluster

SCF Fock-Build and Nuclear Gradient (4 x GTX Titan, PBE0/SVP, 75/302)

PreLinK for Gradients [Kussmann/Ochsenfeld, in preparation]


NMR-Shieldings @ GPU

Timings: Water-Clusters (4 x GTX Titan, PBE0/SVP, 75/302)

Algorithm

dJ/dB: Reuse SCF-kernels with l + 1, different post-processing

dK/dB: Special GPU-kernels

K [dP/dB]: 6 x SCF-kernels (skew symmetry)


CIS/RPA @ GPU

Timings: Water-Clusters (4 x GTX Titan, PBE/SVP, 75/302)


Hybrid MPI/CUDA Parallelization: SCF Calculations

HF/SVP (Single Fock-build, ϑint = 10−10, ϑpre = 10−3)

16 A-T base pairs (H2O)1123

Hardware/Parallelization

Per Node: 12 CPU cores (Intel E5-2620 v2 @ 2.0 GHz), 4 GTX Titan

Primitive Load-balancing, Master-Slave work distribution

1 Gb Ethernet


Hybrid MPI/CUDA Parallelization: SCF Calculations

HF/SVP

16 A-T base pairs (H2O)1123


Hybrid MPI/CUDA Parallelization: MutM@H2O


Post-HF @ GPUs

Challenge

Less favorable scaling, conv. O(N5) at best (MP2)

Not integral evaluation, but linear algebra rate-determining

Porting CPU-algorithms shows small speedups only

Problem: DGEMM-speedup is rather small (ca. x 8)

Ansatz

Re-considering algorithms with GPUs in mind

First attempt: SOS-RI-MP2 [O(N4)]

[Jung/Shao/Head-Gordon, J. Comp. Chem. 12, 1953 (2007)]


Post-HF @ GPUs: SOS-RI-MP2

EOSRI−MP2 = −

∑

ijab

∑

RSR′S′

(ia|R)[

J−1]

RS(S|jb)(ia|R′)

[

J−1]

R′S′(S′|jb)

ǫa + ǫb − ǫi − ǫj

JRS : two-center/two-electron integrals (aux. basis)

Laplace-Transform:

EOSRI−AO−MP2 = −

∑

α

∑

µνλσ

µ′ν′λ′σ′

∑

RSR′S′

Poccµµ′Pvirtνν

′Poccλλ

′Pvirtσσ

′

(µν|R)[

J−1

]

RS(S|λσ)(µ′ν′|R′)

[

J−1

]

R′S′

(S′|λ′σ′).

Evaluation via Intermediates:

ZRS =∑

µνµ′ν′

(R|µ′ν′)Poccµµ′Pvirtνν

′(µν|S) =∑

µν

(R|µν)(µν|S)

Correlation Energy: EOSRI−AO−MP2

= −∑

α

∑

RS ZRS ZSR with Z = ZJ−1

[Maurer/Kussmann/Ochsenfeld, submitted (2014)]


Post-HF @ GPUs: SOS-RI-MP2 @ GPUs

Ansatz

Use Cholesky-factors of pseudo-densities & sparse algebra

O(N3)

Evaluate ZRS via J-engine on GPUs.

Algorithm

(1) Calculation of (R|µν) O(N2)

(2) Calculation of JRS = (R|S) O(N2)

(3) Calculation of J−1 O(N3)

(4) Calculation of pseudo-densities O(N3)

(5) Transformation of (R|µν) to (R|µν) O(N2)

(6) Contraction∑

µν(R|µν)(µν|S) (@ GPU) O(N3)

(7) Multiplication ZJ−1 O(N3)

(8) Contraction∑

RS ZRSZSR O(N2)

[Maurer/Kussmann/Ochsenfeld, submitted (2014)]


SOS-RI-MP2: J-engine@GPU


SOS-RI-MP2 @ GPU: Linear Alkanes


SOS-RI-MP2 @ GPU: DNA


Conclusions

Rethink algorithms, don’t simply transfer CPU-code

Coulomb: O(N2) J-engine, but small pre-factor

Efficient O(N) exchange evaluation on GPUs by PreLinK

Performance/Cost

(DNA16 @ HF/SVP, 1052 atoms, 11230 BF, 1 x Fock)

Q-Chem @ 8 CPU-cores: ∼ 30000 s (∼ 2000 e)

FermiONs++ @ 4 M2090: ∼ 2100 s (∼ 10000 e)

FermiONs++ @ 4 Titan: ∼ 500 s (∼ 8000 e)

∼ 60 x faster, 4 x more expensive

Fine-grained data-arrangement

strong-scaling parallelization

FermiONs++: Release 2014


Acknowledgement

◮ Prof. Dr. C. Ochsenfeld

◮ Dr. Simon Maurer

◮ Group

Thank you for your attention...


Documents

Ab Initio Quantum Chemistry on Graphics Processing · PDF fileAb Initio Quantum Chemistry on Graphics Processing Units Rethinking Algorithms for Massively Parallel Architectures Jorg