1
abc cba bca cab acb cab acb acb cab cba bca cba bca N 3 -N 3 /N p N p ≤N N p : Number of processes N×N×N grid N < N p < N 2 N p = N 2 (2-N/N p )N 3 -N 2 2N 2 (N-1) N 3 -N 3 /N p (2-N/N p )N 3 -N 2 2N 2 (N-1) N 3 -N 3 /N p (2-N/N p )N 3 -N 2 2N 2 (N-1) N 3 -N 3 /N p (2-N/N p )N 3 -N 2 2N 2 (N-1) 2(N 3 -N 3 /N p ) 2N 2 (N-1) 2N 2 (N-1) 2N 2 (N-1) 2N 2 (N-1) 2(N 3 -N 3 /N p ) 2(N 3 -N 3 /N p ) 2(N 3 -N 3 /N p ) 2N 2 (N-1) 2N 2 (N-1) 2N 2 (N-1) 2N 2 (N-1) N 2 < N p < N 3 N 3 +2N p (N-1) N 3 +2N p (N-1) N 3 +2N p (N-1) N 3 +2N p (N-1) N 3 +2N p (N-1) N 3 +2N p (N-1) N 3 +2N p (N-1) N 3 +2N p (N-1) N p = N 3 3N 3 (N-1) 3N 3 (N-1) 3N 3 (N-1) 3N 3 (N-1) 3N 3 (N-1) 3N 3 (N-1) 3N 3 (N-1) 3N 3 (N-1) Motivation A Massively Parallel Domain Decomposition Method for Large-Scale DFT Electronic Structure Calculations Truong Vinh Truong Duy 1,2 and Taisuke Ozaki 1 1 Research Center for Simulation Science, Japan Advanced Institute of Science and Technology (JAIST) 2 Institute for Solid State Physics, The University of Tokyo Email: duytvt@{jaist.ac.jp,issp.u-tokyo.ac.jp}, [email protected] Fig. 4: Atom decomposition with 16,384 diamond atoms and 19 processes, and CNTs. 1. Atom Decomposition Method Two key ideas: (i) the modified recursive bisection method for recur- sively decomposing the domain by constructing a binary tree, and (ii) the moment of inertia tensor for finding a principal axis of each sub- domain to reorder the atoms based on their projection on the axis and divide them into two sub-domains to fit the binary tree structure. 2. Grid Decomposition Method Define four data structures to make data locality consistent with that of the clustered atoms for minimizing inter-process communications. 3. 3D Adaptive Order-Aware Decomposition Method for 3D FFT Automatically decompose in 1D, 2D, or 3D depending on the pro- cess number while giving priority to lower order. Summary Our method Atom decomposition method + Grid decomposition method. 3D adaptive order-aware decomposition method for 3D FFT. The parallel efficiency at 131,072 cores is 67.7% compared to the baseline of 16,384 cores with 131,072 diamond atoms. Future work Evaluate our method with non-linear scaling methods. Acknowledgements CMSI and Materials Design through Computics, MEXT. References [1] T.V.T. Duy and T. Ozaki, arXiv:1209.4506 (2012). [2] T.V.T. Duy and T. Ozaki, arXiv:1302.6189 (2013). Purpose Method Evaluation 67.7% 100% 70.8% 100% Structure C Charge density: A→B→C Hartree potential by FFT B(abc)→B(cab)→B(cba) →B(cab)→B(abc) Exchange-correlation potential: C, D Charge mixing: B(cab)→B(cba)→B(abc) Total energy: B P1 P2 P3 P4 P5 P2 P3 P4 Fig. 1: The modified recursive bisection method with the binary tree for 19 processes. Fig. 2: The moment of inertia tensor for 3D-to-1D atom reordering. Original Domain (26 Atoms) Subdomains (13 Atoms) Sub-Subdomains (7 and 6 Atoms) Fig. 3: Example of the atom decomposition method with 26 atoms. Purpose Develop a domain decomposition method for enabling large-scale DFT calculations with hundreds of thousands of atoms and cores. Objectives Approximately the same computational amount for each process. Locality held: nearby atoms assigned to the same process. Inter-process communications minimized. Applicable to any numbers of atoms and processes. Applicable to any distribution patterns of atoms in space. Computationally inexpensive. Density functional theory (DFT) Quantum mechanical modeling method used to investigate the electro- nic structure of many-body systems in physics and chemistry. Petaflops era and beyond The K computer with approximately 700,000 cores. Exaflops machines with millions of cores expected to arrive by 2020. OpenMX (Open source package for Material eXplorer) Linear scaling DFT code. Large-scale calculations demanded. http://www.openmx-square.org/ Fig. 5: Data structures and the calculation flow. Fig. 9 : Strong scaling on the K computer: 131,072 diamond atoms (left) and 26,000 atoms in the DNA structure (right) with OpenMX and O(N) Krylov subspace method. Fig. 6: The decomposition method in 2D row-wise and the communication amount. Fig. 7: Linear scaling with 2,048 cores and the diamond structure on XT5. Fig. 8: Weak scaling with the diamond structure on XT5.

A Massively Parallel Domain Decomposition Method for Large … · 2013-06-10 · abc cba bca cab acb cab acb acb cab cba bca cba bca N3-N3/N Np≤N p Np: Number of processes N×N×N

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Massively Parallel Domain Decomposition Method for Large … · 2013-06-10 · abc cba bca cab acb cab acb acb cab cba bca cba bca N3-N3/N Np≤N p Np: Number of processes N×N×N

abc

cba bca cab acb

cab acb acb cab cba bca cba bca

N3-N

3/Np

Np ≤ N

Np: Number

of processes

N×N×N grid

N < Np < N2

Np= N2

(2-N/Np)N3-N

2

2N2(N-1)

N3-N

3/Np

(2-N/Np)N3-N

2

2N2(N-1)

N3-N

3/Np

(2-N/Np)N3-N

2

2N2(N-1)

N3-N

3/Np

(2-N/Np)N3-N

2

2N2(N-1)

2(N3-N

3/Np)

2N2(N-1) 2N

2(N-1) 2N

2(N-1) 2N

2(N-1)

2(N3-N

3/Np) 2(N

3-N

3/Np) 2(N

3-N

3/Np)

2N2(N-1) 2N

2(N-1) 2N

2(N-1) 2N

2(N-1)

N2 < Np < N3N

3+2Np(N-1) N

3+2Np(N-1) N

3+2Np(N-1) N

3+2Np(N-1) N

3+2Np(N-1) N

3+2Np(N-1) N

3+2Np(N-1) N

3+2Np(N-1)

Np= N33N

3(N-1) 3N

3(N-1) 3N

3(N-1) 3N

3(N-1) 3N

3(N-1) 3N

3(N-1) 3N

3(N-1) 3N

3(N-1)

Motivation

A Massively Parallel Domain Decomposition Method forLarge-Scale DFT Electronic Structure Calculations

Truong Vinh Truong Duy1,2 and Taisuke Ozaki11Research Center for Simulation Science, Japan Advanced Institute of Science and Technology (JAIST)

2Institute for Solid State Physics, The University of TokyoEmail: duytvt@{jaist.ac.jp,issp.u-tokyo.ac.jp}, [email protected]

Fig. 4: Atom decomposition with 16,384 diamond atoms and 19 processes, and CNTs.

1. Atom Decomposition MethodTwo key ideas: (i) the modified recursive bisection method for recur-sively decomposing the domain by constructing a binary tree, and (ii)the moment of inertia tensor for finding a principal axis of each sub-domain to reorder the atoms based on their projection on the axis anddivide them into two sub-domains to fit the binary tree structure.

2. Grid Decomposition MethodDefine four data structures to make data locality consistent with that of the clustered atoms for minimizing inter-process communications.

3. 3D Adaptive Order-Aware Decomposition Method for 3D FFTAutomatically decompose in 1D, 2D, or 3D depending on the pro-cess number while giving priority to lower order.

SummaryOur method

� Atom decomposition method + Grid decomposition method.� 3D adaptive order-aware decomposition method for 3D FFT.� The parallel efficiency at 131,072 cores is 67.7% compared to the

baseline of 16,384 cores with 131,072 diamond atoms.Future work

� Evaluate our method with non-linear scaling methods.Acknowledgements

CMSI and Materials Design through Computics, MEXT.References[1] T.V.T. Duy and T. Ozaki, arXiv:1209.4506 (2012).[2] T.V.T. Duy and T. Ozaki, arXiv:1302.6189 (2013).

Purpose

Method

Evaluation

67.7%

100%

70.8%

100%

Structure C

Charge density:

A→B→C

Hartree potential by FFT

B(abc)→B(cab)→B(cba)

→B(cab)→B(abc)

Exchange-correlation

potential: C, D

Charge mixing:

B(cab)→B(cba)→B(abc)

Total energy: B

P1

P2

P3

P4

P5

P2

P3

P4

Fig. 1: The modified recursive bisection method with the binary tree for 19 processes.

Fig. 2: The moment of inertia tensor for 3D-to-1D atom reordering.

Original Domain (26 Atoms)

Subdomains (13 Atoms)

Sub-Subdomains(7 and 6 Atoms)

Fig. 3: Example of the atom decomposition method with 26 atoms.

PurposeDevelop a domain decomposition method for enabling large-scale DFT calculations with hundreds of thousands of atoms and cores.

Objectives� Approximately the same computational amount for each process.� Locality held: nearby atoms assigned to the same process.� Inter-process communications minimized. � Applicable to any numbers of atoms and processes.� Applicable to any distribution patterns of atoms in space.� Computationally inexpensive.

Density functional theory (DFT) Quantum mechanical modeling method used to investigate the electro-nic structure of many-body systems in physics and chemistry.

Petaflops era and beyond� The K computer with approximately 700,000 cores.� Exaflops machines with millions of cores expected to arrive by 2020.

OpenMX (Open source package for Material eXplorer)� Linear scaling DFT code.� Large-scale calculations demanded.

http://www.openmx-square.org/

Fig. 5: Data structures and the calculation flow.

Fig. 9 : Strong scaling on the K computer: 131,072 diamond atoms (left) and 26,000 atoms in the DNA structure (right) with OpenMX and O(N) Krylov subspace method.

Fig. 6: The decomposition method in 2D row-wise and the communication amount.

Fig. 7: Linear scaling with 2,048 cores and the diamond structure on XT5.

Fig. 8: Weak scaling with the diamond structure on XT5.