Upload
doandang
View
219
Download
1
Embed Size (px)
Citation preview
Ab Initio Quantum Chemistry
on Graphics Processing UnitsRethinking Algorithms for Massively Parallel Architectures
Jorg Kussmann
Theoretical Chemistry, University of Munich (LMU)
23rd May 2014
J. Kussmann Quantum Chemistry@GPU
Outline
Introduction
Challenges of Ab Initio Quantum Chemistry
Optimizing SCF-Algorithms @ GPUs
Data-Arrangement
Coulomb-, Exchange-, XC-Potential
Exchange Potential: GPU-specific optimization
Examplary Calculations: SCF & Properties
Hybrid MPI/CUDA Parallelization
Outlook: Post-HF Algorithms @ GPUs
Challenge
SOS-MP2 @ GPUs
J. Kussmann Quantum Chemistry@GPU
PART 1: Ab Initio Methods
Schrodinger equation:
Molecular properties:
Energetics/Geometries
Vibrational frequencies
Electric properties
Magnetic properties
Dynamic properties
HΨ = i~Ψstat−→ HΨ = EΨ
Conventional methods:
Systems with up to a few 100 atoms on HF- or KS-DFT level (min. O(M3))Aim: Reduce scaling to O(M)!
J. Kussmann Quantum Chemistry@GPU
PART 1: Ab Initio Methods
Schrodinger equation:
Molecular properties:
Energetics/Geometries
Vibrational frequencies
Electric properties
Magnetic properties
Dynamic properties
HΨ = i~Ψstat−→ HΨ = EΨ
Conventional methods:
Systems with up to a few 100 atoms on HF- or KS-DFT level (min. O(M3))Aim: Reduce scaling to O(M)!
J. Kussmann Quantum Chemistry@GPU
PART 1: Ab Initio Methods
Schrodinger equation:
Molecular properties:
Energetics/Geometries
Vibrational frequencies
Electric properties
Magnetic properties
Dynamic properties
HΨ = i~Ψstat−→ HΨ = EΨ
Conventional methods:
Systems with up to a few 100 atoms on HF- or KS-DFT level (min. O(M3))
Aim: Reduce scaling to O(M)!
J. Kussmann Quantum Chemistry@GPU
Computational Effort: SCF Calculations
Roothaan-Hall: FC = SCǫ
Fµν = hcoreµν + Jµν [P] − (1 − a)Kµν [P] + V XC
µν [a, P]
Rate-determing steps:
1) Fock-Build O(N2)−→O(N)2) Diagonalization: F −→ C O(N3)−→O(N)
aaaaaaaaaaaaaaaaaaaaaaaaaa [Kussmann/Beer/Ochsenfeld, WIREs Comput Mol Sci 3, 614 (2013)]
{
a = 0 : HF0 < a < 1 : hybrid-DFTa = 1 : KS-DFT
Example: 16 A-T base pairs
HF/SVP (ϑint = 10−10, ϑconv = 10−7)
1052 atoms, 11230 basis functions
3 078 087 function pairs9.5 × 1012 primitive 2-e− integrals
O(N) Fock-Build (8 cores): 30 000 s(19 SCF-iterations for tight convergence)
J. Kussmann Quantum Chemistry@GPU
Moore’s Law: 1965-2010
Embrace new technologies: GPUs
J. Kussmann Quantum Chemistry@GPU
Moore’s Law: 1965-2010
Embrace new technologies: GPUs
J. Kussmann Quantum Chemistry@GPU
Implementation of GPU-algorithms
Automatic code generation
All double-precision, higher l-qn support
Coulomb
McMurchie-Davidson based J-engine
Pre/Post-processing on CPU
Ignore bra/ket symmetry (2 x integrals)
Exchange
McMurchie-Davidson
Evaluate complete integral on GPU
Exploit only 1 permutational symmetry (4 x integrals)
1 thread / 1 prim. integral: fine-grained data arrangement
[Ufimtsev/Martinez, JCTC 4, 222 (2008)]
J. Kussmann Quantum Chemistry@GPU
Coulomb Potential
J. Kussmann Quantum Chemistry@GPU
Exchange Potential
J. Kussmann Quantum Chemistry@GPU
Implementation of GPU-algorithms
Automatic code generation
All double-precision, higher l-qn support
Coulomb
McMurchie-Davidson based J-engine
Pre/Post-processing on CPU
Ignore bra/ket symmetry (2 x integrals)
Exchange
McMurchie-Davidson
Evaluate complete integral on GPU
Exploit only 1 permutational symmetry (4 x integrals)
Coulomb very fast, try to improve on exchange first...
A) Reduce scaling to linear
B) Reduce local memory effort
C) Reduce shared memory effort
J. Kussmann Quantum Chemistry@GPU
A) PreLinK: O(N) Exact Exchange on GPUs
Problem: O(N) algorithms employ loads of book-keeping,
Problem: branching, communication
Loop: bra l-quantum number combination
Loop: ket l-quantum number combination
Loop: bra shell-pairs µ, λ
Determine sig. (µλ|σν) quartets:
QµλPmaxλσ Qσν ≥ ϑint + permutations
Loop: ket shell-pairs σ, ν
Evaluate: Kµν, Kµσ, Kλν, Kλσ
End Loop
End Loop
End Loop
Screening within inner loop
J. Kussmann Quantum Chemistry@GPU
A) PreLinK: O(N) Exact Exchange on GPUs
Problem: O(N) algorithms employ loads of book-keeping,
Problem: branching, communication
Solution: Perform screening prior to integral evaluation by
Solution: pre-selection: PreLinK
J. Kussmann Quantum Chemistry@GPU
A) PreLinK: O(N) Exact Exchange on GPUs
Problem: O(N) algorithms employ loads of book-keeping,
Problem: branching, communication
Solution: Perform screening prior to integral evaluation by
Solution: pre-selection: PreLinK
Kµν =∑
λσ
(µλ|νσ)Pλσ
Schwarz: (µλ|νσ) ≤ QµλQνσ =√
(µλ|µλ)√
(νσ|νσ)
PreLinK: Q′
µν =∑
λσ
QµλQνσ|Pλσ| ≥ Kµν
−→ Q′
= Q × |P| × Q
Determine significant elements of K from Q′
!
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa [Kussmann/Ochsenfeld, JCP 138, 134114 (2013)]
J. Kussmann Quantum Chemistry@GPU
A) PreLinK: Pre-Selection Threshold
|P| Overestimation of K
16 α-D-glucose units, HF/SVP
J. Kussmann Quantum Chemistry@GPU
A) PreLinK: Pre-Selection Threshold
Effect of pre-selection on final SCF energy
DNA-fragment with 4 A-T base-pairs, HF/SVP
(ϑconv = 10−7, ϑint = 10−10).
Errors in µHartree.
Error always below convergence criterion
J. Kussmann Quantum Chemistry@GPU
A) PreLinK: Timings
Linear alkanes, HF/SV, max.: C640H1282
1-4 x NVidia M2090 (old generation, Kepler: approx. 3 x faster)
J. Kussmann Quantum Chemistry@GPU
B) Improving the Exchange: Reduced Local Memory
16 A-T base pairs, HF/SVP (ϑint = 10−10, ϑpre = 10−3, 1 x GTX Titan)
Resort to Rys-quadrature for larger total l-qn
J. Kussmann Quantum Chemistry@GPU
C) Improving the Exchange: Reduced Shared Memory
Shared Memory per thread-block
Most suitable size: 8x8 thread-blocks, use shared memory for Kµν
Ex.: d-shells (l-qn = 2), 48 kB shared memory
36 cartesian Kµν elements
Memory per thread-block: 8 x 8 x 8 (double) x 36 = 18.4 kB
Max. 2 thread-blocks per SMX, only 128 out of 192 cores
25 pure Kµν elements
Memory per thread-block: 8 x 8 x 8 (double) x 25 = 12.8 kB
Max. 3 thread-blocks per SMX, 192 out of 192 cores
Direct transformation to pure allows larger l-qn shells!
Ex.: 2 A-T base pairs, HF/TZVP
267 s (cart) vs 216 s (pure)
Significant impact: 20% speedup
Only ca. 7% of l-qn combinations affected
J. Kussmann Quantum Chemistry@GPU
Examplary Calculations: Water-Cluster
SCF Fock-Build and Nuclear Gradient (4 x GTX Titan, PBE0/SVP, 75/302)
PreLinK for Gradients [Kussmann/Ochsenfeld, in preparation]
J. Kussmann Quantum Chemistry@GPU
NMR-Shieldings @ GPU
Timings: Water-Clusters (4 x GTX Titan, PBE0/SVP, 75/302)
Algorithm
dJ/dB: Reuse SCF-kernels with l + 1, different post-processing
dK/dB: Special GPU-kernels
K [dP/dB]: 6 x SCF-kernels (skew symmetry)
J. Kussmann Quantum Chemistry@GPU
CIS/RPA @ GPU
Timings: Water-Clusters (4 x GTX Titan, PBE/SVP, 75/302)
J. Kussmann Quantum Chemistry@GPU
Hybrid MPI/CUDA Parallelization: SCF Calculations
HF/SVP (Single Fock-build, ϑint = 10−10, ϑpre = 10−3)
16 A-T base pairs (H2O)1123
Hardware/Parallelization
Per Node: 12 CPU cores (Intel E5-2620 v2 @ 2.0 GHz), 4 GTX Titan
Primitive Load-balancing, Master-Slave work distribution
1 Gb Ethernet
J. Kussmann Quantum Chemistry@GPU
Hybrid MPI/CUDA Parallelization: SCF Calculations
HF/SVP
16 A-T base pairs (H2O)1123
J. Kussmann Quantum Chemistry@GPU
Hybrid MPI/CUDA Parallelization: MutM@H2O
J. Kussmann Quantum Chemistry@GPU
Post-HF @ GPUs
Challenge
Less favorable scaling, conv. O(N5) at best (MP2)
Not integral evaluation, but linear algebra rate-determining
Porting CPU-algorithms shows small speedups only
Problem: DGEMM-speedup is rather small (ca. x 8)
Ansatz
Re-considering algorithms with GPUs in mind
First attempt: SOS-RI-MP2 [O(N4)]
[Jung/Shao/Head-Gordon, J. Comp. Chem. 12, 1953 (2007)]
J. Kussmann Quantum Chemistry@GPU
Post-HF @ GPUs: SOS-RI-MP2
EOSRI−MP2 = −
∑
ijab
∑
RSR′S′
(ia|R)[
J−1]
RS(S|jb)(ia|R′)
[
J−1]
R′S′(S′|jb)
ǫa + ǫb − ǫi − ǫj
JRS : two-center/two-electron integrals (aux. basis)
Laplace-Transform:
EOSRI−AO−MP2 = −
∑
α
∑
µνλσ
µ′ν′λ′σ′
∑
RSR′S′
Poccµµ′Pvirtνν
′Poccλλ
′Pvirtσσ
′
(µν|R)[
J−1
]
RS(S|λσ)(µ′ν′|R′)
[
J−1
]
R′S′
(S′|λ′σ′).
Evaluation via Intermediates:
ZRS =∑
µνµ′ν′
(R|µ′ν′)Poccµµ′Pvirtνν
′(µν|S) =∑
µν
(R|µν)(µν|S)
Correlation Energy: EOSRI−AO−MP2
= −∑
α
∑
RS ZRS ZSR with Z = ZJ−1
[Maurer/Kussmann/Ochsenfeld, submitted (2014)]
J. Kussmann Quantum Chemistry@GPU
Post-HF @ GPUs: SOS-RI-MP2 @ GPUs
Ansatz
Use Cholesky-factors of pseudo-densities & sparse algebra
O(N3)
Evaluate ZRS via J-engine on GPUs.
Algorithm
(1) Calculation of (R|µν) O(N2)
(2) Calculation of JRS = (R|S) O(N2)
(3) Calculation of J−1 O(N3)
(4) Calculation of pseudo-densities O(N3)
(5) Transformation of (R|µν) to (R|µν) O(N2)
(6) Contraction∑
µν(R|µν)(µν|S) (@ GPU) O(N3)
(7) Multiplication ZJ−1 O(N3)
(8) Contraction∑
RS ZRSZSR O(N2)
[Maurer/Kussmann/Ochsenfeld, submitted (2014)]
J. Kussmann Quantum Chemistry@GPU
SOS-RI-MP2: J-engine@GPU
J. Kussmann Quantum Chemistry@GPU
SOS-RI-MP2 @ GPU: Linear Alkanes
J. Kussmann Quantum Chemistry@GPU
SOS-RI-MP2 @ GPU: DNA
J. Kussmann Quantum Chemistry@GPU
Conclusions
Rethink algorithms, don’t simply transfer CPU-code
Coulomb: O(N2) J-engine, but small pre-factor
Efficient O(N) exchange evaluation on GPUs by PreLinK
Performance/Cost
(DNA16 @ HF/SVP, 1052 atoms, 11230 BF, 1 x Fock)
Q-Chem @ 8 CPU-cores: ∼ 30000 s (∼ 2000 e)
FermiONs++ @ 4 M2090: ∼ 2100 s (∼ 10000 e)
FermiONs++ @ 4 Titan: ∼ 500 s (∼ 8000 e)
∼ 60 x faster, 4 x more expensive
Fine-grained data-arrangement
strong-scaling parallelization
FermiONs++: Release 2014
J. Kussmann Quantum Chemistry@GPU
Acknowledgement
◮ Prof. Dr. C. Ochsenfeld
◮ Dr. Simon Maurer
◮ Group
Thank you for your attention...
J. Kussmann Quantum Chemistry@GPU