1
MG Proto: A Multigrid Solver for x86 multicore Systems Balint Joo (Jefferson Lab), Thorsten Kurth (NERSC) Introduction Adaptive Aggregation based multi-grid (MG) methods [1,2,3] are becoming the standard for solvers in both the propagator calculation and recently even in the gauge generation parts of Lattice QCD calculations with Wilson Clover Fermions. The system solved is A x = b, where A is the Dirac operator Multi-Grid Solvers V-Cycle & K-Cycle Coarse Operator Acknowledgement This work is supported by the US Department Of Energy, Office of Science, Offices of Nuclear Physics and Advanced Scientific Computing Research, through the SciDAC program under contract DE-AC05-06OR23177 under which JSA LLC operates and manages Jefferson Lab, and under the 17-SC-20-SC Exascale Computing Project. Conclusions & Outlook Performance Results e 0 (I - S A) k (1 - P A -1 c RA)(1 - S A) j e 0 MG aims to reduce the short wavelength (UV) modes on a fine grid using a Smoother (S). The error due to the longer wavelength modes is solved on a coarser grid by solving with a coarsened operator (A -1 c ). A typical cycle is the V-cycle: S A c -1 j-iterations of Pre-smoothing k-iterations of post-smoothing Restriction R Prolongation P Bottom Solve (recursive) The error is reduced as (e.g. [2]): Near Null Space Block Aggregation Low modes of the A are `self similar’ on cubic-blocks of the lattice due to local coherence (weak approximation). Hence one way to define R is to aggregate the fine degrees of freedom over cubic blocks with near null-space vectors produced in a setup phase S Fine Grid d.o.f : V f x N spin x N color Coarse Grid d.o.f : V c x N chiral x N null Restriction: Aggregation over sites, colors, chiral spin components. The resulting Coarse operator is a nearest-neighbor operator similar in structure to the fine operator: Where X 0 (x) and X µ (x) are matrices of dimension N null xN chiral. A c (x)= X 0 (x)+ X μ=1..8 X μ (x) δ x,xμ SIMD for matrix-vector operation Applying Ac consists of 9 matrix vector multiplications. We restrict to N null being a multiple of 8 and vectorize using AVX512 intrinsics: Nested Parallelism for Aggregation Aggregations for restriction and prolongation permit nested parallelism through a) parallelism over blocks and b) within blocks. Parallelism within blocks may be desirable if there are very few coarse sites (e.g. a coarse level with 16 sites, on a KNL system) 2 4 6 8 10 12 14 16 #inner threads 0 25 50 75 100 125 150 175 GFLOP/s V=4x4x4x4, 24, man. ser. red. 16 threads 32 threads 64 threads 128 threads 256 threads 2 4 6 8 10 12 14 16 #inner threads 0 50 100 150 200 250 GFLOP/s V=4x4x4x8, 24, man. ser. red. 16 threads 32 threads 64 threads 128 threads 256 threads 2 4 6 8 10 12 14 16 #inner threads 0 50 100 150 200 250 300 350 400 GFLOP/s V=4x4x8x8, 24, man. ser. red. 16 threads 32 threads 64 threads 128 threads 256 threads 2 4 6 8 10 12 14 16 #inner threads 0 100 200 300 400 GFLOP/s V=4x8x8x8, 24, man. ser. red. 16 threads 32 threads 64 threads 128 threads 256 threads 2 4 6 8 10 12 14 16 #inner threads 0 25 50 75 100 125 150 175 GFLOP/s V=4x4x4x4, 24, man. par. red. 16 threads 32 threads 64 threads 128 threads 256 threads 2 4 6 8 10 12 14 16 #inner threads 0 50 100 150 200 250 GFLOP/s V=4x4x4x8, 24, man. par. red. 16 threads 32 threads 64 threads 128 threads 256 threads 2 4 6 8 10 12 14 16 #inner threads 0 50 100 150 200 250 300 350 400 GFLOP/s V=4x4x8x8, 24, man. par. red. 16 threads 32 threads 64 threads 128 threads 256 threads 2 4 6 8 10 12 14 16 #inner threads 0 100 200 300 400 GFLOP/s V=4x8x8x8, 24, man. par. red. 16 threads 32 threads 64 threads 128 threads 256 threads 2 4 6 8 10 12 14 16 #inner threads 0 25 50 75 100 125 150 175 GFLOP/s V=4x4x4x4, 24, exp. cust. red. 16 threads 32 threads 64 threads 128 threads 256 threads 2 4 6 8 10 12 14 16 #inner threads 0 50 100 150 200 250 GFLOP/s V=4x4x4x8, 24, exp. cust. red. 16 threads 32 threads 64 threads 128 threads 256 threads 2 4 6 8 10 12 14 16 #inner threads 0 50 100 150 200 250 300 350 400 GFLOP/s V=4x4x8x8, 24, exp. cust. red. 16 threads 32 threads 64 threads 128 threads 256 threads 2 4 6 8 10 12 14 16 #inner threads 0 100 200 300 400 GFLOP/s V=4x8x8x8, 24, exp. cust. red. 16 threads 32 threads 64 threads 128 threads 256 threads Utilizing threading within the site was only really beneficial when there was not enough parallelism via sites (V=4 4 and V=4 3 x8 cases). In this instance benefit was visible when manual (man) implementation of nested parallelism was employed, rather than through explicit OpenMP nested (exp) parallelism. In these (man) cases serial reductions (ser. red.) in the blocks proved more efficient than parallel ones. [4] Our implementation delivers a roughly 8x improvement over our best previous solver for KNL and Skylake systems. Coincidentally, 64 nodes of Stampede performs similarly to 64 nodes of Titan in 2016 using QUDA-MG. This work opens Cori and Theta for propagator calculations and for gauge generation using multi-grid in the future. It also serves as a basis for performance portability explorations. Performance results and strong scaling on Stampede 2 using Skylake (SKX) nodes Multigrid provides approximately an 8x reduction in solve time than the fastest available, mxsed precision BiCGStab solver from the QPhiX library. Similar performance improvements are also visible on KNL systems, e.g. Cori and Theta Other optimizations Our MGProto implementation uses the QPhiX library for threading and vectorization on the fine grid. In addition we have implemented Schur Decomposition based even-odd preconditioning In all the solvers used on all MG levels. References

MG Proto: A Multigrid Solver for x86 multicore Systems

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: MG Proto: A Multigrid Solver for x86 multicore Systems

MG Proto: A Multigrid Solver for x86 multicore Systems Balint Joo (Jefferson Lab), Thorsten Kurth (NERSC)

IntroductionAdaptive Aggregation based multi-grid (MG) methods [1,2,3] are becoming the standard for solvers in both the propagator calculation and recently even in the gauge generation parts of Lattice QCD calculations with Wilson Clover Fermions. The system solved is A x = b, where A is the Dirac operator

Multi-Grid SolversV-Cycle & K-Cycle

Coarse Operator

AcknowledgementThis work is supported by the US Department Of Energy, Office of Science, Offices of Nuclear Physics and Advanced Scientific Computing Research, through the SciDAC program under contract DE-AC05-06OR23177 under which JSA LLC operates and manages Jefferson Lab, and under the 17-SC-20-SC Exascale Computing Project.

Conclusions & Outlook

Performance Results

e0 (I � SA)k(1� PA�1c RA)(1� SA)je0

MG aims to reduce the short wavelength (UV) modes on a fine grid using a Smoother (S). The error due to the longer wavelength modes is solved on a coarser grid by solving with a coarsened operator (A-1c). A typical cycle is the V-cycle:

S

Ac-1

j-iterations of Pre-smoothing

k-iterations of post-smoothing

Restriction RProlongation P

Bottom Solve (recursive)

The error is reduced as (e.g. [2]):

Near Null Space Block AggregationLow modes of the A are `self similar’ on cubic-blocks of the lattice due to local coherence (weak approximation). Hence one way to define R is to aggregate the fine degrees of freedom over cubic blocks with near null-space vectors produced in a setup phase

S

Fine Grid d.o.f : Vf x Nspin x Ncolor Coarse Grid d.o.f : Vc x Nchiral x Nnull

Restriction: Aggregation over sites, colors, chiral spin components.

The resulting Coarse operator is a nearest-neighbor operator similar in structure to the fine operator:

Where X0(x) and Xµ(x) are matrices of dimension NnullxNchiral.

Ac(x) = X0(x) +X

µ=1..8

Xµ(x) �x,x+µ̂

SIMD for matrix-vector operationApplying Ac consists of 9 matrix vector multiplications. We restrict to Nnull being a multiple of 8 and vectorize using AVX512 intrinsics:

Nested Parallelism for AggregationAggregations for restriction and prolongation permit nested parallelism through a) parallelism over blocks and b ) w i th in b locks . Parallelism within blocks may be desirable if there are very few coarse sites (e.g. a coarse level with 16 sites, on a KNL system)

2 4 6 8 10 12 14 16#inner threads

0255075

100125150175

GFLO

P/s

V=4x4x4x4, 24, man. ser. red.16 threads32 threads64 threads128 threads256 threads

2 4 6 8 10 12 14 16#inner threads

0

50

100

150

200

250

GFLO

P/s

V=4x4x4x8, 24, man. ser. red.16 threads32 threads64 threads128 threads256 threads

2 4 6 8 10 12 14 16#inner threads

050

100150200250300350400

GFLO

P/s

V=4x4x8x8, 24, man. ser. red.16 threads32 threads64 threads128 threads256 threads

2 4 6 8 10 12 14 16#inner threads

0

100

200

300

400

GFLO

P/s

V=4x8x8x8, 24, man. ser. red.16 threads32 threads64 threads128 threads256 threads

2 4 6 8 10 12 14 16#inner threads

0255075

100125150175

GFLO

P/s

V=4x4x4x4, 24, man. par. red.16 threads32 threads64 threads128 threads256 threads

2 4 6 8 10 12 14 16#inner threads

0

50

100

150

200

250

GFLO

P/s

V=4x4x4x8, 24, man. par. red.16 threads32 threads64 threads128 threads256 threads

2 4 6 8 10 12 14 16#inner threads

050

100150200250300350400

GFLO

P/s

V=4x4x8x8, 24, man. par. red.16 threads32 threads64 threads128 threads256 threads

2 4 6 8 10 12 14 16#inner threads

0

100

200

300

400

GFLO

P/s

V=4x8x8x8, 24, man. par. red.16 threads32 threads64 threads128 threads256 threads

2 4 6 8 10 12 14 16#inner threads

0255075

100125150175

GFLO

P/s

V=4x4x4x4, 24, exp. cust. red.16 threads32 threads64 threads128 threads256 threads

2 4 6 8 10 12 14 16#inner threads

0

50

100

150

200

250

GFLO

P/s

V=4x4x4x8, 24, exp. cust. red.16 threads32 threads64 threads128 threads256 threads

2 4 6 8 10 12 14 16#inner threads

050

100150200250300350400

GFLO

P/s

V=4x4x8x8, 24, exp. cust. red.16 threads32 threads64 threads128 threads256 threads

2 4 6 8 10 12 14 16#inner threads

0

100

200

300

400

GFLO

P/s

V=4x8x8x8, 24, exp. cust. red.16 threads32 threads64 threads128 threads256 threads

Utilizing threading within the site was only really beneficial when there was not enough parallelism via sites (V=44 and V=43x8 cases). In this instance benefit was visible when manual (man) implementation of nested parallelism was employed, rather than through explicit OpenMP nested (exp) parallelism. In these (man) cases serial reductions (ser. red.) in the blocks proved more efficient than parallel ones. [4]

Our implementation delivers a roughly 8x improvement over our best previous solver for KNL and Skylake systems. Coincidentally, 64 nodes of Stampede performs similarly to 64 nodes of Titan in 2016 using QUDA-MG. This work opens Cori and Theta for propagator calculations and for gauge generation using multi-grid in the future. It also serves as a basis for performance portability explorations.

Performance results and strong scaling on Stampede 2 using Skylake (SKX) nodes Multigrid provides approximately an 8x reduction in solve time than the fastest available, mxsed precision BiCGStab solver from the QPhiX library. Similar performance improvements are also visible on KNL systems, e.g. Cori and Theta

Other optimizations Our MGProto implementation uses the QPhiX library for threading and vectorization on the fine grid. In addition we have implemented Schur Decomposition based even-odd preconditioning In all the solvers used on all MG levels.

References