Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Introducing ONETEP
Part II - Efficient implementation of a parallel code
Chris-Kriton Skylaris
Peter D. Haynes
Arash A. Mostofi
Mike C. Payne
Electronic Structure Discussion Group
Theory of Condensed Matter, Cavendish Laboratory
Cambridge, 19 May 2004
Chris-Kriton Skylaris. ESDG 19/5/2004. TCM group, Cavendish Laboratory, University of Cambridge. 1
ONETEP: A density-matrix linear-scaling DFT method
Non-orthogonal Generalised Wannier Functions (NGWFs)Molecular orbitals (MOs)
Chris-Kriton Skylaris. ESDG 19/5/2004. TCM group, Cavendish Laboratory, University of Cambridge. 2
PSINC basis set (=plane waves) for the NGWFs
simulation cell
α
Chris-Kriton Skylaris. ESDG 19/5/2004. TCM group, Cavendish Laboratory, University of Cambridge. 3
DFT always computationally demanding – ONETEP O(N) scheme should take full advantage of parallel computers
ONETEP two-nested-loop CG optimisation scheme
Chris-Kriton Skylaris. ESDG 19/5/2004. TCM group, Cavendish Laboratory, University of Cambridge. 4
Parallel implementation using the Message Passing Interface (MPI) paradigm – each processor runs its own copy of the program with its own data
Parallelisation of data: Distribution of atoms (and NGWFs) to processors according to a space-filling curve
without SF curve with SF curve
Chris-Kriton Skylaris. ESDG 19/5/2004. TCM group, Cavendish Laboratory, University of Cambridge. 5
Effect on sparsity pattern of Hamiltonian matrix
4650 x 4650without SF curve with SF curve
Chris-Kriton Skylaris. ESDG 19/5/2004. TCM group, Cavendish Laboratory, University of Cambridge. 6
Atomic orbital “DZP” basis: 27500 x 27500
“Small” sparse matrices!
BRC4-RAD51 complex
(3000 atoms)
Can scale up to 100 processors (~5000 atoms) without data-parallel matrices!
ONETEP NGWFs: 7600 x 7600
1% -> 58MB!
1% -> 4.4 MB!
Chris-Kriton Skylaris. ESDG 19/5/2004. TCM group, Cavendish Laboratory, University of Cambridge. 7
PPD & FFTbox representation of functionsφα in PPDs φα in FFTbox
Deposit PPDs
Extract PPDs
• QM operators
• Sums, products, interpolation
• Compact storage in 1D arrays
• Fast communication
Chris-Kriton Skylaris. ESDG 19/5/2004. TCM group, Cavendish Laboratory, University of Cambridge. 8
Quantum mechanical operators delocalise to the whole volume of the FFTbox. Computationally expensive!
• T
• V local
• V non-local
• Interpolation to 2Gmax PSINC basis (fine grid)
φβ ÔφβÔ^
^
^
Calculation of Ôφβ for each φα is linear-scaling but not ideal (large prefactor)!Chris-Kriton Skylaris. ESDG 19/5/2004. TCM group, Cavendish Laboratory, University of Cambridge. 9
There is a way to reduce the prefactor: Enlarge FFTbox!
Smallest FFT box ensuring:
• Hermiticity
• Uniform representation for each operator
• But: φβ needs to be re-centred for every single φα
which overlaps with it!
Same advantages plus:
• φβ does not need to be re-centred. Ôφβ needs to be calculated only once!
Chris-Kriton Skylaris. ESDG 19/5/2004. TCM group, Cavendish Laboratory, University of Cambridge. 10
Efficient calculation of integrals <φα|Ô|φβ>
φαÔφβ
φα is never placed in FFTbox!
Extract onlyPPDs in common with φα
dot product of (small) 1D arrays(ddot)
<φα|Ô|φβ> =
Chris-Kriton Skylaris. ESDG 19/5/2004. TCM group, Cavendish Laboratory, University of Cambridge. 11
Keep Ôφβ FFTbox in memory and re-use for <φβ|Ô|φβ>,<φγ|Ô|φβ>, etc.
Efficient calculation of the charge density, n(r)
Chris-Kriton Skylaris. ESDG 19/5/2004. TCM group, Cavendish Laboratory, University of Cambridge. 12
Only two interpolations per n(r;α), independent of num β!
interpolate
multiply
deposit in simulation cell
Σβα
Efficient calculation of the NGWF gradient
Chris-Kriton Skylaris. ESDG 19/5/2004. TCM group, Cavendish Laboratory, University of Cambridge. 13
(restricted to region of φα - suitable for updating φα in conjugate gradients optimisation)
Only one application of Ĥ per φα!
extract only PPDsbelonging to φα
+= =
“shave” values outside region of φα
=
Linear-scaling with small prefactor
~N3
~N
0
5
10
15
20
25
0 200 400 600 800 1000
Number of atoms
Cal
cula
tion
time
(hou
rs) ONETEP
CASTEP
Chris-Kriton Skylaris. ESDG 19/5/2004. TCM group, Cavendish Laboratory, University of Cambridge. 14
Communication model
Functions on different processors which overlap need to be communicated during computation• Use only point-to-point communication
• Keep functions to be sent in buffers and interleave communication with computation by using non-blocking sends
• Only use PPD representation during communication
• Keep in memory batches of Ôφβ to minimise communication
• Send only if there is an overlap with current batch of functions of receiving processor
• Work with processor-processor blocks of functions/matrices – do Nprocessor supersteps
Chris-Kriton Skylaris. ESDG 19/5/2004. TCM group, Cavendish Laboratory, University of Cambridge. 15
Communication model simplest case – evaluation of integrals: outline from the viewpoint of processor X
Initialise my NGWF batch
Apply Ô to my batch
loop over procs
I will be sending to Y & receiving from Z
Send my φθ if needed by YReceive φη from Z if overlap with any member of my batch
loop over all my φθNGWFs
φη
store batch integrals in sparse form
loop over batches
Chris-Kriton Skylaris. ESDG 19/5/2004. TCM group, Cavendish Laboratory, University of Cambridge. 16
0
20
40
60
80
100
0 20 40 60 80 100
Number of processors
Spee
d-up
BRC4 (570 atoms)BN DWNT (1200 atoms)Ideal
Speedup with increasing number of processors
Chris-Kriton Skylaris. ESDG 19/5/2004. TCM group, Cavendish Laboratory, University of Cambridge. 17
Chris-Kriton Skylaris. ESDG 19/5/2004. TCM group, Cavendish Laboratory, University of Cambridge. 18
Gly Gly50
True linear-scaling: Number of iterations small, independent of N
Nanotubes
Gly200
1.E-07
1.E-06
1.E-05
1.E-04
1.E-03
1.E-02
1.E-01
0 5 10 15 20 25Iteration
Ener
gy g
ain
/ ato
m (H
a)
Gly (10 atoms)Gly50 (353 atoms)Gly200 (1403 atoms)Nanotubes (516 atoms)
1.E-05
1.E-04
1.E-03
0 10 20Iteration
Res
idua
l nor
m (H
a)
Chemical accuracy, no Basis Set Superposition Error
CASTEPPlane-waves 1292 eV
ONETEPPlane-waves 1292 eV
H: 1 NGWFO: 4 NGWFs
NWCHEMcc-pVTZ+diffuse
H: 25 contracted GaussiansO: 55 contracted Gaussians
0
2
4
6
8
1 2 3 4 5 6 7O--H distance (Angstrom)
Ener
gy (k
cal/m
ol)
ONETEP
CASTEP
Gaussian
~O(N)
~O(N3)
~O(N3)NWCHEM
Basis sets, number of localised functions
Chris-Kriton Skylaris. ESDG 19/5/2004. TCM group, Cavendish Laboratory, University of Cambridge. 19
Simulation cell size cost is O(0) in number of atoms. Can do plane-wave calculations in huge simulation cells!
ONETEP calculations of carbon nanotube tip in uniform external electric field (C.-K. Skylaris & G. Csányi et al.)
Local potential 50Å x 50Å x 50Å simulation cell
Charge density on local potential iso-surface near Fermi level
Chris-Kriton Skylaris. ESDG 19/5/2004. TCM group, Cavendish Laboratory, University of Cambridge. 20
Conclusions• ONETEP – linear-scaling total energy code – takes
full advantage of parallel computers
• Accuracy equivalent to conventional cubic-scaling Plane-wave / Gaussian DFT codes
• Low prefactor – breakeven with cubic-scaling codes in the region of a few hundred atoms
• Work in progress: data-parallelisation of simulation cell – even larger simulation cells
• A whole new level of large scale first principles simulations from condensed matter physics to biology now possible - see next week’s instalment which will be brought to you by Peter D. Haynes
Chris-Kriton Skylaris. ESDG 19/5/2004. TCM group, Cavendish Laboratory, University of Cambridge. 21