Ab Initio Quantum Chemistry on Graphics Processing · PDF fileAb Initio Quantum Chemistry on...

Ab Initio Quantum Chemistry

on Graphics Processing UnitsRethinking Algorithms for Massively Parallel Architectures

Jorg Kussmann

Theoretical Chemistry, University of Munich (LMU)

23rd May 2014

J. Kussmann Quantum Chemistry@GPU

Outline

Introduction

Challenges of Ab Initio Quantum Chemistry

Optimizing SCF-Algorithms @ GPUs

Data-Arrangement

Coulomb-, Exchange-, XC-Potential

Exchange Potential: GPU-specific optimization

Examplary Calculations: SCF & Properties

Hybrid MPI/CUDA Parallelization

Outlook: Post-HF Algorithms @ GPUs

Challenge

SOS-MP2 @ GPUs

PART 1: Ab Initio Methods

Schrodinger equation:

Molecular properties:

Energetics/Geometries

Vibrational frequencies

Electric properties

Magnetic properties

Dynamic properties

HΨ = i~Ψstat−→ HΨ = EΨ

Conventional methods:

Systems with up to a few 100 atoms on HF- or KS-DFT level (min. O(M3))Aim: Reduce scaling to O(M)!

Electric properties

Magnetic properties

Dynamic properties

Systems with up to a few 100 atoms on HF- or KS-DFT level (min. O(M3))Aim: Reduce scaling to O(M)!

Electric properties

Magnetic properties

Dynamic properties

Systems with up to a few 100 atoms on HF- or KS-DFT level (min. O(M3))

Aim: Reduce scaling to O(M)!

Computational Effort: SCF Calculations

Roothaan-Hall: FC = SCǫ

Fµν = hcoreµν + Jµν [P] − (1 − a)Kµν [P] + V XC

µν [a, P]

Rate-determing steps:

1) Fock-Build O(N2)−→O(N)2) Diagonalization: F −→ C O(N3)−→O(N)

aaaaaaaaaaaaaaaaaaaaaaaaaa [Kussmann/Beer/Ochsenfeld, WIREs Comput Mol Sci 3, 614 (2013)]

a = 0 : HF0 < a < 1 : hybrid-DFTa = 1 : KS-DFT

Example: 16 A-T base pairs

HF/SVP (ϑint = 10−10, ϑconv = 10−7)

1052 atoms, 11230 basis functions

3 078 087 function pairs9.5 × 1012 primitive 2-e− integrals

O(N) Fock-Build (8 cores): 30 000 s(19 SCF-iterations for tight convergence)

Moore’s Law: 1965-2010

Embrace new technologies: GPUs

Moore’s Law: 1965-2010

Embrace new technologies: GPUs

Implementation of GPU-algorithms

Automatic code generation

All double-precision, higher l-qn support

Coulomb

McMurchie-Davidson based J-engine

Pre/Post-processing on CPU

Ignore bra/ket symmetry (2 x integrals)

Exchange

McMurchie-Davidson

Evaluate complete integral on GPU

Exploit only 1 permutational symmetry (4 x integrals)

1 thread / 1 prim. integral: fine-grained data arrangement

[Ufimtsev/Martinez, JCTC 4, 222 (2008)]

Coulomb Potential

Exchange Potential

Implementation of GPU-algorithms

Automatic code generation

All double-precision, higher l-qn support

Coulomb

McMurchie-Davidson based J-engine

Pre/Post-processing on CPU

Ignore bra/ket symmetry (2 x integrals)

Exchange

McMurchie-Davidson

Evaluate complete integral on GPU

Exploit only 1 permutational symmetry (4 x integrals)

Coulomb very fast, try to improve on exchange first...

A) Reduce scaling to linear

B) Reduce local memory effort

C) Reduce shared memory effort

A) PreLinK: O(N) Exact Exchange on GPUs

Problem: O(N) algorithms employ loads of book-keeping,

Problem: branching, communication

Loop: bra l-quantum number combination

Loop: ket l-quantum number combination

Loop: bra shell-pairs µ, λ

Determine sig. (µλ|σν) quartets:

QµλPmaxλσ Qσν ≥ ϑint + permutations

Loop: ket shell-pairs σ, ν

Evaluate: Kµν, Kµσ, Kλν, Kλσ

End Loop

Screening within inner loop

Solution: Perform screening prior to integral evaluation by

Solution: pre-selection: PreLinK

Solution: Perform screening prior to integral evaluation by

Solution: pre-selection: PreLinK

Kµν =∑

(µλ|νσ)Pλσ

Schwarz: (µλ|νσ) ≤ QµλQνσ =√

(µλ|µλ)√

(νσ|νσ)

PreLinK: Q′

µν =∑

QµλQνσ|Pλσ| ≥ Kµν

−→ Q′

= Q × |P| × Q

Determine significant elements of K from Q′

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa [Kussmann/Ochsenfeld, JCP 138, 134114 (2013)]

A) PreLinK: Pre-Selection Threshold

|P| Overestimation of K

16 α-D-glucose units, HF/SVP

A) PreLinK: Pre-Selection Threshold

Effect of pre-selection on final SCF energy

DNA-fragment with 4 A-T base-pairs, HF/SVP

(ϑconv = 10−7, ϑint = 10−10).

Errors in µHartree.

Error always below convergence criterion

A) PreLinK: Timings

Linear alkanes, HF/SV, max.: C640H1282

1-4 x NVidia M2090 (old generation, Kepler: approx. 3 x faster)

B) Improving the Exchange: Reduced Local Memory

16 A-T base pairs, HF/SVP (ϑint = 10−10, ϑpre = 10−3, 1 x GTX Titan)

Resort to Rys-quadrature for larger total l-qn

C) Improving the Exchange: Reduced Shared Memory

Shared Memory per thread-block

Most suitable size: 8x8 thread-blocks, use shared memory for Kµν

Ex.: d-shells (l-qn = 2), 48 kB shared memory

36 cartesian Kµν elements

Memory per thread-block: 8 x 8 x 8 (double) x 36 = 18.4 kB

Max. 2 thread-blocks per SMX, only 128 out of 192 cores

25 pure Kµν elements

Memory per thread-block: 8 x 8 x 8 (double) x 25 = 12.8 kB

Max. 3 thread-blocks per SMX, 192 out of 192 cores

Direct transformation to pure allows larger l-qn shells!

Ex.: 2 A-T base pairs, HF/TZVP

267 s (cart) vs 216 s (pure)

Significant impact: 20% speedup

Only ca. 7% of l-qn combinations affected

Examplary Calculations: Water-Cluster

SCF Fock-Build and Nuclear Gradient (4 x GTX Titan, PBE0/SVP, 75/302)

PreLinK for Gradients [Kussmann/Ochsenfeld, in preparation]

NMR-Shieldings @ GPU

Timings: Water-Clusters (4 x GTX Titan, PBE0/SVP, 75/302)

Algorithm

dJ/dB: Reuse SCF-kernels with l + 1, different post-processing

dK/dB: Special GPU-kernels

K [dP/dB]: 6 x SCF-kernels (skew symmetry)

CIS/RPA @ GPU

Timings: Water-Clusters (4 x GTX Titan, PBE/SVP, 75/302)

Hybrid MPI/CUDA Parallelization: SCF Calculations

HF/SVP (Single Fock-build, ϑint = 10−10, ϑpre = 10−3)

16 A-T base pairs (H2O)1123

Hardware/Parallelization

Per Node: 12 CPU cores (Intel E5-2620 v2 @ 2.0 GHz), 4 GTX Titan

Primitive Load-balancing, Master-Slave work distribution

1 Gb Ethernet

Hybrid MPI/CUDA Parallelization: SCF Calculations

HF/SVP

16 A-T base pairs (H2O)1123

Hybrid MPI/CUDA Parallelization: MutM@H2O

Post-HF @ GPUs

Challenge

Less favorable scaling, conv. O(N5) at best (MP2)

Not integral evaluation, but linear algebra rate-determining

Porting CPU-algorithms shows small speedups only

Problem: DGEMM-speedup is rather small (ca. x 8)

Ansatz

Re-considering algorithms with GPUs in mind

First attempt: SOS-RI-MP2 [O(N4)]

[Jung/Shao/Head-Gordon, J. Comp. Chem. 12, 1953 (2007)]

Post-HF @ GPUs: SOS-RI-MP2

EOSRI−MP2 = −

RSR′S′

(ia|R)[

J−1]

RS(S|jb)(ia|R′)

J−1]

R′S′(S′|jb)

ǫa + ǫb − ǫi − ǫj

JRS : two-center/two-electron integrals (aux. basis)

Laplace-Transform:

EOSRI−AO−MP2 = −

µνλσ

µ′ν′λ′σ′

RSR′S′

Poccµµ′Pvirtνν

′Poccλλ

′Pvirtσσ

(µν|R)[

RS(S|λσ)(µ′ν′|R′)

R′S′

(S′|λ′σ′).

Evaluation via Intermediates:

ZRS =∑

µνµ′ν′

(R|µ′ν′)Poccµµ′Pvirtνν

′(µν|S) =∑

(R|µν)(µν|S)

Correlation Energy: EOSRI−AO−MP2

= −∑

RS ZRS ZSR with Z = ZJ−1

[Maurer/Kussmann/Ochsenfeld, submitted (2014)]

Post-HF @ GPUs: SOS-RI-MP2 @ GPUs

Ansatz

Use Cholesky-factors of pseudo-densities & sparse algebra

Evaluate ZRS via J-engine on GPUs.

Algorithm

(1) Calculation of (R|µν) O(N2)

(2) Calculation of JRS = (R|S) O(N2)

(3) Calculation of J−1 O(N3)

(4) Calculation of pseudo-densities O(N3)

(5) Transformation of (R|µν) to (R|µν) O(N2)

(6) Contraction∑

µν(R|µν)(µν|S) (@ GPU) O(N3)

(7) Multiplication ZJ−1 O(N3)

(8) Contraction∑

RS ZRSZSR O(N2)

[Maurer/Kussmann/Ochsenfeld, submitted (2014)]

SOS-RI-MP2: J-engine@GPU

SOS-RI-MP2 @ GPU: Linear Alkanes

SOS-RI-MP2 @ GPU: DNA

Conclusions

Rethink algorithms, don’t simply transfer CPU-code

Coulomb: O(N2) J-engine, but small pre-factor

Efficient O(N) exchange evaluation on GPUs by PreLinK

Performance/Cost

(DNA16 @ HF/SVP, 1052 atoms, 11230 BF, 1 x Fock)

Q-Chem @ 8 CPU-cores: ∼ 30000 s (∼ 2000 e)

FermiONs++ @ 4 M2090: ∼ 2100 s (∼ 10000 e)

FermiONs++ @ 4 Titan: ∼ 500 s (∼ 8000 e)

∼ 60 x faster, 4 x more expensive

Fine-grained data-arrangement

strong-scaling parallelization

FermiONs++: Release 2014

Acknowledgement

◮ Prof. Dr. C. Ochsenfeld

◮ Dr. Simon Maurer

◮ Group

Thank you for your attention...

Ab Initio Quantum Chemistry on Graphics Processing · PDF fileAb Initio Quantum Chemistry on...

Documents

Quantum Chemistry with GAMESS - University Of Illinoismcc.illinois.edu/summerschool/2006/presentations/... · Primary focus is on ab initio quantum chemistry calculations Also can

Chemistry at surfaces: from ab initio structures to ... at surfaces: from ab initio structures to quantum dynamics ... “Theoretical design” of a catalyst ... and we addressed these

Ab Initio GW-NEGF · 2018-04-30 · Beyond DFT-Landauer Quantum Transport: Ab Initio GW-NEGF in Nano- and Molecular Electronics Pierre Darancet, Tonatiuh Rangel, Andrea Ferretti,

Comparison of three quantum chemical ab initio methods for …bezugly/JP_CS_2008.pdf · 2008-09-23 · Comparison of three quantum chemical ab initio methods for band structure calculations:

Cooperative Phenomena - Goethe University Frankfurt · Ab initio Quantum Monte Carlo, perturbation theory for spin systems . Transregio 49 Frankfurt / Kaiserslautern / Mainz Ab initio

Combining quantum wavepacket ab initio molecular dynamics ...Combining quantum wavepacket ab initio molecular dynamics with QM/MM and QM/QM techniques: Implementation blending ONIOM

Ab-initio molecular dynamics - Theory Departmentth.fhi-berlin.mpg.de/th/Meetings/DFT-workshop-Berlin2011/... · Ab-initio molecular dynamics: from the basics up to quantum effects

Ab Initio Dots · 2019-07-04 · Ab Initio Exact Diagonalization Simulation of the Nagaoka Transition in Quantum Dots Yao Wang,1, Juan Pablo Dehollain, 2Fang Liu,3 Uditendu Mukhopadhyay,

DAFTAR PUSTAKA Ab Initio, (2008). A Practical Introduction to Ab Initio Software, USA Ab Initio, (2010). Ab Initio Help Content, USA Ab Initio, (2011)

Perovskite Quantum Dots Modeled Using ab Initio and Replica … · 2016. 11. 15. · Perovskite Quantum Dots Modeled Using Ab Initio and Replica Exchange Molecular Dynamics Journal:

Quantum chemical molecular modellingmichalak/mmod2008/L12.pdf · 2009. 1. 13. · Quantum chemical modelling of chemical processes • Computational methods: ab initio and semi-empirical

Calculating phase-coherent quantum transport in nanoelectronics with ab initio ...people.tamu.edu/~feng/Publication/Qian_8.pdf · 2015-01-17 · Calculating phase-coherent quantum

Ab initio surface energetics: beyond chemical accuracy · “Benchmarks for surface formation energy from quantum Monte Carlo and quantum chemistry”. Pre-sented at k2010, Free University,

AB INITIO MOLECULAR DYNAMICS - Prace Training … · Ab Initio Molecular Dynamics • Background • Review of Classical MD • Essential Quantum Mechanics • Born-Oppenheimer Molecular

Thermochemistry of icosahedral closo-dicarboranes: A composite ab initio … · 1 Thermochemistry of icosahedral closo-dicarboranes: A composite ab initio quantum-chemical perspective

Spintronics and Quantum ThermodynamicsSpintronics and Quantum Thermodynamics The team utilized analytical and ab-initio theories to establish a link between this spin engine concept

Ab-initio simulation of liquid water by quantum Monte Carlomdt26/tti_talks/qmcitaa_14/sorella_tti2014.pdf · Ab-initio simulation of liquid water by quantum Monte Carlo Sandro Sorella

b-initio ackage ienna imulation - Drexel University · 1 Introduction VASP is a complex package for performing ab-initio quantum-mechanical molecular dynamics (MD) simulations using

QMCPACK : An open source ab initio Quantum Monte Carlo … · 2018-04-06 · QMCPACK : An open source ab initio Quantum Monte Carlo package 2 12 Materials Research Laboratory, University

Traditional vs. ab initio modeling · Experimental data Analysis Predictions Simulations Microscopic parameters Quantum mechanical calculations Validation Traditional vs. ab initio