Upload
others
View
28
Download
0
Embed Size (px)
Citation preview
M Clark, NVIDIAwith M Cheng and R Brower
Adaptive Multigrid Algorithms for GPUs
Tuesday, July 30, 13
Contents
• Adaptive Multigrid• QUDA Library• Implementation• Extremely Preliminary Results• Future Work
Tuesday, July 30, 13
Adaptive Multigrid
Osborn et al, arXiv:1011.2775
10
100
-0.088 -0.086 -0.084 -0.082 -0.08 -0.078 -0.076 -0.074
seco
nds
per s
olve
mass
323x256 anisotropic clover on 1024 BG/P cores
2x
13.9x
24.5x
msmlightmphys
mixed precision BiCGStabmixed precision multigrid
Tuesday, July 30, 13
Hierarchical algorithms for LQCD
• Adaptive Geometric Multigrid– Based on adaptive smooth aggregation (Brezina et al 2004)– Low modes have weak-approximation property => locally co-linear– Apply fixed geometric coarsening (Brannick et al 2007, Babich et al 2010)– see also Frommer et al 2012
• Inexact Deflation (Lüscher 2007)– Equivalent to adaptive “unsmoothed” aggregation– Local coherence = Weak-approximation property– Uses an additive correction vs. MG’s multiplicative correction
• Residual reduced by a constant per iteration– Convergence in O(1) iterations, O(N) per iteration– O(N) total solution cost
Tuesday, July 30, 13
Multigrid V-cycle
• Solve1. Smooth2. Compute residual3. Restrict residual4. Recurse on coarse problem5. Prolongate correction6. Smooth7. If not converged, goto 1
• Typically use multigrid as a preconditioner for a Krylov method • For LQCD, we do not know the null space components that need
to be preserved on the coarse grid
Tuesday, July 30, 13
Adaptive Geometric Multigrid • Adaptively find candidate null-space vectors
– Dynamically learn the null space and use this to define the prolongator– Algorithm is self learning
• Setup1. Set solver to be simple smoother2. Apply current solver to random vector vi = P(D) ηi
3. If convergence good enough, solver setup complete4. Construct prolongator using fixed coarsening (1 - P R) vk = 0
➡ Typically use 44 geometric blocks➡ Preserve chirality when coarsening R = γ5 P† γ5 = P†
5. Construct coarse operator (Dc = P† D P)6. Recurse on coarse problem7. Set solver to be augmented V-cycle, goto 2
Tuesday, July 30, 13
Motivation
-0.43 -0.42 -0.41 -0.4mass
100
1000
10000
1e+05
Dir
ac o
per
ato
r ap
pli
cati
on
s
32396 CG
24364 CG
16364 CG
24364 Eig-CG
16364 Eig-CG
32396 MG-GCR
24364 MG-GCR
16364 MG-GCR
Phys. Rev. Lett. 105, 201602 (2010)
Tuesday, July 30, 13
8
The March of GPUs
0
50
100
150
200
250
300
2007 2008 2009 2010 2011 2012 2013
Peak Memory Bandwidth
GBy
tes/s
M1060
Nehalem 3 GHz
Westmere3 GHz
8-‐core Sandy Bridge
3 GHz
FermiM2070
Fermi+M2090
0
300
600
900
1200
1500
2007 2008 2009 2010 2011 2012 2013
Peak Double Precision FP
Gflo
ps/s
Nehalem3 GHz
Westmere3 GHz
FermiM2070
Fermi+M2090
M1060
8-‐coreSandy Bridge
3 GHz
NVIDIA GPU (ECC off) x86 CPUDouble Precision: NVIDIA GPU Double Precision: x86 CPU
Kepler Kepler
Kepler+Kepler+
Tuesday, July 30, 13
Enter QUDA
• “QCD on CUDA” – http://lattice.github.com/quda• Effort started at Boston University in 2008, now in wide use as the
GPU backend for BQCD, Chroma, CPS, MILC, etc.• Provides:
— Various solvers for several discretizations, including multi-GPU support and domain-decomposed (Schwarz) preconditioners
— Additional performance-critical routines needed for gauge-field generation
• Maximize performance– Exploit physical symmetries– Mixed-precision methods– Autotuning for high performance on all CUDA-capable architectures– Cache blocking
Tuesday, July 30, 13
Preliminary, NVIDIA Confidential – not for distribution
Chroma (Lattice QCD) – High Energy & Nuclear Physics
Chroma 483x512 lattice Relative Scaling (Application Time)
XK7 (K20X) (BiCGStab)
XK7 (K20X) (DD+GCR)
XE6 (2x Interlagos)
0
2
4
6
8
10
12
14
16
18
0 128 256 384 512 640 768 896 1024 1152 1280
Rel
ativ
e Sc
alin
g
# Nodes
3.58x vs. XE6 @1152 nodes
“XK7” node = XK7 (1x K20X + 1x Interlagos) “XE6” node = XE6 (2x Interlagos)
Tuesday, July 30, 13
The Challenge of Multigrid on GPU
• For competitiveness, MG on GPU is a must• GPU requirements very different from CPU
– Each thread is slow, but O(10,000) threads per GPU
• Fine grids run very efficiently– High parallel throughput problem
• Coarse grids are worst possible scenario– More cores than degrees of freedom– Increasingly serial and latency bound– Little’s law (bytes = bandwidth * latency)– Amdahl’s law limiter
Tuesday, July 30, 13
Hierarchical algorithms on heterogeneous architectures
Thousands of cores for parallel processing
Few Cores optimized for serial work
CPU
GPU
Tuesday, July 30, 13
Design Goals
• Flexibility– Deploy level i on either CPU or GPU– All algorithmic flow decisions made at runtime– Autotune for a given heterogeneous architecture
• (Short term) Provide optimal solvers to legacy apps– e.g., Chroma, CPS, MILC, etc.
• (Long term) Hierarchical algorithm toolbox – Little to no barrier to trying new algorithms
Tuesday, July 30, 13
Multigrid and QUDA
• QUDA designed to abstract algorithm from the heterogeneity
LatticeField
ColorSpinorField GaugeField
cudaColorSpinorField cpuColorSpinorField cpuGaugeFieldcudaGaugeField
Tuesday, July 30, 13
Multigrid and QUDA
• QUDA designed to abstract algorithm from the heterogeneity
LatticeField
ColorSpinorField GaugeField
cudaColorSpinorField cpuColorSpinorField cpuGaugeFieldcudaGaugeField
Algorithms
Tuesday, July 30, 13
Multigrid and QUDA
• QUDA designed to abstract algorithm from the heterogeneity
LatticeField
ColorSpinorField GaugeField
cudaColorSpinorField cpuColorSpinorField cpuGaugeFieldcudaGaugeField
Architecture
Tuesday, July 30, 13
Multigrid and QUDA
• While envisaged to be fairly abstract – Rarely implemented like this in practice– Product of rapid development by different developers
• Adding multigrid required a lot of work– Improves maintainability of QUDA across the board
Tuesday, July 30, 13
Writing the same code for two architectures
• Use C++ templates to abstract arch specifics– Load/store order, caching modifiers, precision, intrinsics
• CPU and GPU kernels almost identical– Index computation (for loop -> thread id)– Block reductions (shared memory reduction and / or atomic operations)
// Applies the grid prolongation operator (coarse to fine) template <class FineSpinor, class CoarseSpinor> void prolongate(FineSpinor &out, const CoarseSpinor &in, const int *geo_map, const int *spin_map) {
for (int x=0; x<out.Volume(); x++) { for (int s=0; s<out.Nspin(); s++) {! for (int c=0; c<out.Ncolor(); c++) {! out(x, s, c) = in(geo_map[x], spin_map[s], c);! } } }
}
// Applies the grid prolongation operator (coarse to fine) template <class FineSpinor, class CoarseSpinor> __global__ void prolongate(FineSpinor out, const CoarseSpinor in, const int *geo_map, const int *spin_map) {
int x = blockIdx.x*blockDim.x + threadIdx.x; for (int s=0; s<out.Nspin(); s++) { for (int c=0; c<out.Ncolor(); c++) {! out(x, s, c) = in(geo_map[x], spin_map[s], c); } }
}
CPU GPUTuesday, July 30, 13
Current Status
• First multigrid solver working in QUDA as of last Friday• Some components still on CPU only
• Designed to interoperate with J. Osborn’s qopqdp implementation– Can verify algorithm correctness, and share null space vectors
GPU CPUFine grid operator ✓Block Orthogonalization ✓Prolongator ✓ ✓Restrictor ✓ ✓Construct coarse gauge field ✓Coarse grid operator ✓Vector BLAS ✓ ✓
Tuesday, July 30, 13
Very preliminary two-level results
0
10
20
30
40
2.3
1.7
2.4
1.6
13.6
9.4
23.334.3
qcdlib
QUDA
Prolongation Restriction Coarse Grid Smoothing
level 0 243x128level 1 83x16D on GPUDc, P, R on CPU
Tim
e (s
)
Tuesday, July 30, 13
QUDA as a Hierarchical Algorithm Tool
• Lots of interesting questions to be explored• Exploit closer coupling of precision and algorithm
– QUDA designed for complete run-time specification of precision at any point
– Currently supports 64-bit, 32-bit, 16-bit– Is 128-bit or 8-bit useful at all for hierarchical algorithms?
• Domain-decomposition (DD) and multigrid– DD approaches likely vital for strong scaling– DD solvers are effective for high-frequency dampening– Overlapping domains likely more important at coarser scales
Tuesday, July 30, 13
Summary
• Introduction to multigrid on QUDA• Basic framework complete, proof of concept• Still lots of work to do
– Most of the nitty gritty details worked out– Now time for fun
• Beta testing for end of year– Chroma Wilson / Wilson-clover support first
• Lessons today are relevant for Exascale preparation
mclark at nvidia dot com
Tuesday, July 30, 13
Backup slides
Tuesday, July 30, 13
Failure of Geometric Multigrid for LQCD
2-d Laplace operator errorwith Gauss-Seidel interation
2-d U(1) Wilson-Dirac operator after 200 Gauss-Seidel iterations
Tuesday, July 30, 13
Failure of Geometric Multigrid for LQCD
2-d Laplace operator errorwith Gauss-Seidel interation
2-d U(1) Wilson-Dirac operator after 200 Gauss-Seidel iterations
Tuesday, July 30, 13
Failure of Geometric Multigrid for LQCD
2-d Laplace operator errorwith Gauss-Seidel interation
2-d U(1) Wilson-Dirac operator after 200 Gauss-Seidel iterations
Tuesday, July 30, 13
The Need for Just-In-Time Compilation
• Tightly-coupled variables should be at the register level• Dynamic indexing cannot be resolved in register variables
– Array values with indices not known at compile time spill out into global memory (L1 / L2 / DRAM)
// Applies the grid prolongation operator (coarse to fine) template <class FineSpinor, class CoarseSpinor> __global__ void prolongate(FineSpinor out, const CoarseSpinor in, const int *geo_map, const int *spin_map) {
int x = blockIdx.x*blockDim.x + threadIdx.x; for (int s=0; s<out.Nspin(); s++) { for (int c=0; c<out.Ncolor(); c++) {! out(x, s, c) = in(geo_map[x], spin_map[s], c); } }
}
Tuesday, July 30, 13
The Need for Just-In-Time Compilation
• Possible solutions– Template over every possible Nv ⊗ precision for each MG kernel
– One thread per colour matrix row (inefficient for Nv mod 32 ≠ 0)– Compile necessary kernel at runtime
• JIT support will be coming in CUDA 6.x– Final performant implementation will likely require this
// Applies the grid prolongation operator (coarse to fine) template <class FineSpinor, class CoarseSpinor, int Ncolor, int Nspin> __global__ void prolongate(FineSpinor out, const CoarseSpinor in, const int *geo_map, const int *spin_map) { int x = blockIdx.x*blockDim.x + threadIdx.x; for (int s=0; s<Nspin; s++) { for (int c=0; c<Ncolor; c++) {! out(x, s, c) = in(geo_map[x], spin_map[s], c); } } }
Tuesday, July 30, 13
Heterogeneous Updating Scheme
CPU
GPU • Multiplicative MG is
necessarily serial process– Cannot utilize both GPU and
CPU simultanesouly
Tuesday, July 30, 13
Heterogeneous Updating Scheme
CPU
GPU • Multiplicative MG is
necessarily serial process– Cannot utilize both GPU and
CPU simultanesouly
• Additive MG is parallel– Can utilize both GPU and CPU
simultanesouly
• Additive MG requires accurate coarse-grid solution
– Not amenable to multi-level – Only need additive correction
at CPU<->GPU level interface
Tuesday, July 30, 13
The Kepler Architecture• Kepler K20X
– 2688 processing cores
– 3995 SP Gflops peak (665.5 fma)
– Effective SIMD width of 32 threads (warp)
• Deep memory hierarchy
– As we move away from registers
• Bandwidth decreases
• Latency increases
– Each level imposes a minimum arithmetic intensity to achieve peak
• Limited on-chip memory
– 65,536 32-bit registers, 255 registers per thread
– 48 KiB shared memory
– 1.5 MiB L2
Device MemoryDevice Memory
Multiprocessor 1
Core1
Core2
Core32
. . .
Multiprocessor 2
Core1
Core2
Core32
. . .
Multiprocessor n
Core1
Core2
Core32
. . .
. . .
RegistersRegisters RegistersRegisters RegistersRegisters
177 GB/s
1.345 TB/s
L2 CacheL2 Cache
ShMem
ShMem
10.76 TB/s
TexTex ShMem
ShMem TexTex Sh
Mem
ShMem TexTexL1 L1 L1
Host MemoryHost Memory
8.0 GB/s per directionPCIe
280 GB/s
On
ch
ipO
ff c
hip
250 GB/s
500 GB/s
2.5 TB/s
2.5 TB/s
192 192 192192 192 192
Tuesday, July 30, 13
QUDA is community driven§ Ron Babich (NVIDIA)
§ Kip Barros (LANL)
§ Rich Brower (Boston University)
§ Michael Cheng (Boston University)
§ Justin Foley (University of Utah)
§ Joel Giedt (Rensselaer Polytechnic Institute)
§ Steve Gottlieb (Indiana University)
§ Bálint Joó (Jlab)
§ Hyung-Jin Kim (BNL)
§ Jian Liang (IHEP)
§ Claudio Rebbi (Boston University)
§ Guochun Shi (NCSA -> Google)
§ Alexei Strelchenko (Cyprus Institute -> FNAL)
§ Alejandro Vaquero (Cyprus Institute)
§ Frank Winter (UoE -> Jlab)
§ Yibo Yang (IHEP)
Tuesday, July 30, 13
QUDA Mission Statement
• QUDA is– a library enabling legacy applications to run on GPUs– open source so anyone can join the fun– evolving
• more features• cleaner, easier to maintain
– a research tool into how to reach the exascale • Lessons learned are mostly (platform) agnostic• Domain-specific knowledge is key• Free from the restrictions of DSLs, e.g., multigrid in QDP
Tuesday, July 30, 13
Mapping the Wilson Dslash to CUDA
• Assign a single space-time point to each thread• V = XYZT threads
• V = 244 => 3.3x106 threads
• Fine-grained parallelization
• Looping over direction each thread must
• Load the neighboring spinor (24 numbers x8)
• Load the color matrix connecting the sites (18 numbers x8)
• Do the computation
• Save the result (24 numbers)
• Arithmetic intensity
• 1320 floating point operations per site
• 1440 bytes per site (single precision)
• 0.92 naive arithmetic intensity
review basic details of the LQCD application and of NVIDIAGPU hardware. We then briefly consider some related workin Section IV before turning to a general description of theQUDA library in Section V. Our parallelization of the quarkinteraction matrix is described in VI, and we present anddiscuss our performance data for the parallelized solver inSection VII. We finish with conclusions and a discussion offuture work in Section VIII.
II. LATTICE QCDThe necessity for a lattice discretized formulation of QCD
arises due to the failure of perturbative approaches commonlyused for calculations in other quantum field theories, such aselectrodynamics. Quarks, the fundamental particles that are atthe heart of QCD, are described by the Dirac operator actingin the presence of a local SU(3) symmetry. On the lattice,the Dirac operator becomes a large sparse matrix, M , and thecalculation of quark physics is essentially reduced to manysolutions to systems of linear equations given by
Mx = b. (1)
The form of M on which we focus in this work is theSheikholeslami-Wohlert [6] (colloquially known as Wilson-clover) form, which is a central difference discretization of theDirac operator. When acting in a vector space that is the tensorproduct of a 4-dimensional discretized Euclidean spacetime,spin space, and color space it is given by
Mx,x0 = �12
4⇤
µ=1
�P�µ ⇤ Uµ
x �x+µ̂,x0 + P+µ ⇤ Uµ†x�µ̂ �x�µ̂,x0
⇥
+ (4 + m + Ax)�x,x0
⌅ �12Dx,x0 + (4 + m + Ax)�x,x0 . (2)
Here �x,y is the Kronecker delta; P±µ are 4 ⇥ 4 matrixprojectors in spin space; U is the QCD gauge field whichis a field of special unitary 3⇥ 3 (i.e., SU(3)) matrices actingin color space that live between the spacetime sites (and henceare referred to as link matrices); Ax is the 12⇥12 clover matrixfield acting in both spin and color space,1 corresponding toa first order discretization correction; and m is the quarkmass parameter. The indices x and x0 are spacetime indices(the spin and color indices have been suppressed for brevity).This matrix acts on a vector consisting of a complex-valued12-component color-spinor (or just spinor) for each point inspacetime. We refer to the complete lattice vector as a spinorfield.
Since M is a large sparse matrix, an iterative Krylovsolver is typically used to obtain solutions to (1), requiringmany repeated evaluations of the sparse matrix-vector product.The matrix is non-Hermitian, so either Conjugate Gradients[7] on the normal equations (CGNE or CGNR) is used, ormore commonly, the system is solved directly using a non-symmetric method, e.g., BiCGstab [8]. Even-odd (also known
1Each clover matrix has a Hermitian block diagonal, anti-Hermitian blockoff-diagonal structure, and can be fully described by 72 real numbers.
Fig. 1. The nearest neighbor stencil part of the lattice Dirac operator D,as defined in (2), in the µ� ⇥ plane. The color-spinor fields are located onthe sites. The SU(3) color matrices Uµ
x are associated with the links. Thenearest neighbor nature of the stencil suggests a natural even-odd (red-black)coloring for the sites.
as red-black) preconditioning is used to accelerate the solutionfinding process, where the nearest neighbor property of theDx,x0 matrix (see Fig. 1) is exploited to solve the Schur com-plement system [9]. This has no effect on the overall efficiencysince the fields are reordered such that all components ofa given parity are contiguous. The quark mass controls thecondition number of the matrix, and hence the convergence ofsuch iterative solvers. Unfortunately, physical quark massescorrespond to nearly indefinite matrices. Given that currentleading lattice volumes are 323 ⇥ 256, for > 108 degrees offreedom in total, this represents an extremely computationallydemanding task.
III. GRAPHICS PROCESSING UNITS
In the context of general-purpose computing, a GPU iseffectively an independent parallel processor with its ownlocally-attached memory, herein referred to as device memory.The GPU relies on the host, however, to schedule blocks ofcode (or kernels) for execution, as well as for I/O. Data isexchanged between the GPU and the host via explicit memorycopies, which take place over the PCI-Express bus. The low-level details of the data transfers, as well as management ofthe execution environment, are handled by the GPU devicedriver and the runtime system.
It follows that a GPU cluster embodies an inherently het-erogeneous architecture. Each node consists of one or moreprocessors (the CPU) that is optimized for serial or moderatelyparallel code and attached to a relatively large amount ofmemory capable of tens of GB/s of sustained bandwidth. Atthe same time, each node incorporates one or more processors(the GPU) optimized for highly parallel code attached to arelatively small amount of very fast memory, capable of 150GB/s or more of sustained bandwidth. The challenge we face isthat these two powerful subsystems are connected by a narrowcommunications channel, the PCI-E bus, which sustains atmost 6 GB/s and often less. As a consequence, it is criticalto avoid unnecessary transfers between the GPU and the host.
Dx,x
0 =
Tuesday, July 30, 13
Mapping the Wilson Dslash to CUDA
• Assign a single space-time point to each thread• V = XYZT threads
• V = 244 => 3.3x106 threads
• Fine-grained parallelization
• Looping over direction each thread must
• Load the neighboring spinor (24 numbers x8)
• Load the color matrix connecting the sites (18 numbers x8)
• Do the computation
• Save the result (24 numbers)
• Arithmetic intensity
• 1320 floating point operations per site
• 1440 bytes per site (single precision)
• 0.92 naive arithmetic intensity
review basic details of the LQCD application and of NVIDIAGPU hardware. We then briefly consider some related workin Section IV before turning to a general description of theQUDA library in Section V. Our parallelization of the quarkinteraction matrix is described in VI, and we present anddiscuss our performance data for the parallelized solver inSection VII. We finish with conclusions and a discussion offuture work in Section VIII.
II. LATTICE QCDThe necessity for a lattice discretized formulation of QCD
arises due to the failure of perturbative approaches commonlyused for calculations in other quantum field theories, such aselectrodynamics. Quarks, the fundamental particles that are atthe heart of QCD, are described by the Dirac operator actingin the presence of a local SU(3) symmetry. On the lattice,the Dirac operator becomes a large sparse matrix, M , and thecalculation of quark physics is essentially reduced to manysolutions to systems of linear equations given by
Mx = b. (1)
The form of M on which we focus in this work is theSheikholeslami-Wohlert [6] (colloquially known as Wilson-clover) form, which is a central difference discretization of theDirac operator. When acting in a vector space that is the tensorproduct of a 4-dimensional discretized Euclidean spacetime,spin space, and color space it is given by
Mx,x0 = �12
4⇤
µ=1
�P�µ ⇤ Uµ
x �x+µ̂,x0 + P+µ ⇤ Uµ†x�µ̂ �x�µ̂,x0
⇥
+ (4 + m + Ax)�x,x0
⌅ �12Dx,x0 + (4 + m + Ax)�x,x0 . (2)
Here �x,y is the Kronecker delta; P±µ are 4 ⇥ 4 matrixprojectors in spin space; U is the QCD gauge field whichis a field of special unitary 3⇥ 3 (i.e., SU(3)) matrices actingin color space that live between the spacetime sites (and henceare referred to as link matrices); Ax is the 12⇥12 clover matrixfield acting in both spin and color space,1 corresponding toa first order discretization correction; and m is the quarkmass parameter. The indices x and x0 are spacetime indices(the spin and color indices have been suppressed for brevity).This matrix acts on a vector consisting of a complex-valued12-component color-spinor (or just spinor) for each point inspacetime. We refer to the complete lattice vector as a spinorfield.
Since M is a large sparse matrix, an iterative Krylovsolver is typically used to obtain solutions to (1), requiringmany repeated evaluations of the sparse matrix-vector product.The matrix is non-Hermitian, so either Conjugate Gradients[7] on the normal equations (CGNE or CGNR) is used, ormore commonly, the system is solved directly using a non-symmetric method, e.g., BiCGstab [8]. Even-odd (also known
1Each clover matrix has a Hermitian block diagonal, anti-Hermitian blockoff-diagonal structure, and can be fully described by 72 real numbers.
Fig. 1. The nearest neighbor stencil part of the lattice Dirac operator D,as defined in (2), in the µ� ⇥ plane. The color-spinor fields are located onthe sites. The SU(3) color matrices Uµ
x are associated with the links. Thenearest neighbor nature of the stencil suggests a natural even-odd (red-black)coloring for the sites.
as red-black) preconditioning is used to accelerate the solutionfinding process, where the nearest neighbor property of theDx,x0 matrix (see Fig. 1) is exploited to solve the Schur com-plement system [9]. This has no effect on the overall efficiencysince the fields are reordered such that all components ofa given parity are contiguous. The quark mass controls thecondition number of the matrix, and hence the convergence ofsuch iterative solvers. Unfortunately, physical quark massescorrespond to nearly indefinite matrices. Given that currentleading lattice volumes are 323 ⇥ 256, for > 108 degrees offreedom in total, this represents an extremely computationallydemanding task.
III. GRAPHICS PROCESSING UNITS
In the context of general-purpose computing, a GPU iseffectively an independent parallel processor with its ownlocally-attached memory, herein referred to as device memory.The GPU relies on the host, however, to schedule blocks ofcode (or kernels) for execution, as well as for I/O. Data isexchanged between the GPU and the host via explicit memorycopies, which take place over the PCI-Express bus. The low-level details of the data transfers, as well as management ofthe execution environment, are handled by the GPU devicedriver and the runtime system.
It follows that a GPU cluster embodies an inherently het-erogeneous architecture. Each node consists of one or moreprocessors (the CPU) that is optimized for serial or moderatelyparallel code and attached to a relatively large amount ofmemory capable of tens of GB/s of sustained bandwidth. Atthe same time, each node incorporates one or more processors(the GPU) optimized for highly parallel code attached to arelatively small amount of very fast memory, capable of 150GB/s or more of sustained bandwidth. The challenge we face isthat these two powerful subsystems are connected by a narrowcommunications channel, the PCI-E bus, which sustains atmost 6 GB/s and often less. As a consequence, it is criticalto avoid unnecessary transfers between the GPU and the host.
Dx,x
0 =
Tesla K20XTesla K20X
Gflops 3995
GB/s 250
AI 16
bandwidth bound
Tuesday, July 30, 13
Krylov Solver Implementation
• Complete solver must be on GPU
• Transfer b to GPU (reorder)
• Solve Mx=b
• Transfer x to CPU (reorder)
• Entire algorithms must run on GPUs
• Time-critical kernel is the stencil application (SpMV)
• Also require BLAS level-1 type operations
• e.g., AXPY operations: b += ax, NORM operations: c = (b,b)
• Roll our own kernels for kernel fusion and custom precision
while (|rk|> ε) {βk = (rk,rk)/(rk-1,rk-1)pk+1 = rk - βkpk
α = (rk,rk)/(pk+1,Apk+1)rk+1 = rk - αApk+1xk+1 = xk + αpk+1k = k+1
}
conjugate gradient
Tuesday, July 30, 13
Run-time autotuning
§ Motivation:— Kernel performance (but not output) strongly dependent on launch
parameters:§ gridDim (trading off with work per thread), blockDim§ blocks/SM (controlled by over-allocating shared memory)
§ Design objectives:— Tune launch parameters for all performance-critical kernels at run-
time as needed (on first launch).— Cache optimal parameters in memory between launches.— Optionally cache parameters to disk between runs.— Preserve correctness.
Tuesday, July 30, 13
Auto-tuned “warp-throttling”
§ Motivation: Increase reuse in limited L2 cache.
0
50
100
150
200
250
300
350
400
450
500
GTX 580 GTX 680 GTX 580 GTX 680 GTX 580 GTX 680
Double Single Half
BlockDim only
BlockDim & Blocks/SM
Tuesday, July 30, 13
Run-time autotuning: Implementation
§ Parameters stored in a global cache:static std::map<TuneKey, TuneParam> tunecache;
§ TuneKey is a struct of strings specifying the kernel name, lattice volume, etc.
§ TuneParam is a struct specifying the tune blockDim, gridDim, etc.
§ Kernels get wrapped in a child class of Tunable (next slide)§ tuneLaunch() searches the cache and tunes if not found:
TuneParam tuneLaunch(Tunable &tunable, QudaTune enabled, QudaVerbosity verbosity);
Tuesday, July 30, 13
Run-time autotuning: Usage
§ Before:myKernelWrapper(a, b, c);
§ After:MyKernelWrapper *k = new MyKernelWrapper(a, b, c);k-‐>apply(); // <-‐-‐ automatically tunes if necessary
§ Here MyKernelWrapper inherits from Tunable and optionally overloads various virtual member functions (next slide).
§ Wrapping related kernels in a class hierarchy is often useful anyway, independent of tuning.
Tuesday, July 30, 13
Virtual member functions of Tunable
§ Invoke the kernel (tuning if necessary):— apply()
§ Save and restore state before/after tuning:— preTune(), postTune()
§ Advance to next set of trial parameters in the tuning:— advanceGridDim(), advanceBlockDim(), advanceSharedBytes()— advanceTuneParam() // simply calls the above by default
§ Performance reporting— flops(), bytes(), perfString()
§ etc.
Tuesday, July 30, 13
Kepler Wilson-Dslash Performance
8 16 32 64 128Temporal Extent
200
300
400
500
600
700
800G
FLO
PSHalf 8 GFHalf 8Half 12Single 8 GFSingle 8Single 12
V = 243xT K20X Dslash
Tuesday, July 30, 13
Multi-dimensional lattice decompositionMulti GPU Parallelization
faceexchange
wraparound
faceexchange
wraparound
Tuesday, July 12, 2011
Tuesday, July 30, 13
• Non-overlapping blocks - simply have to switch off inter-GPU communication
• Preconditioner is a gross approximation– Use an iterative solver to solve
each domain system– Require only 10 iterations of
domain solver ⟹ 16-bit – Need to use a flexible solver ⟹ GCR
• Block-diagonal preconditoner impose λ cutoff• Finer Blocks lose long-wavelength/low-energy modes
– keep wavelengths of ~ O(ΛQCD-1), ΛQCD -1 ~ 1fm
• Aniso clover: (as=0.125fm, at=0.035fm) ⟹ 83x32 blocks are ideal– 483x512 lattice: 83x32 blocks ⟹ 3456 GPUs
Domain Decomposition
Tuesday, July 30, 13
0 512 1024 1536 2048 2560 3072 3584 4096 4608Titan Nodes (GPUs)
0
50
100
150
200
250
300
350
400
450
TFLO
PSBiCGStab: 723x256DD+GCR: 723x256BiCGStab: 963x256DD+GCR: 963x256
Clover Propagator Benchmark on Titan: Strong Scaling, QUDA+Chroma+QDP-JIT(PTX)
B. Joo, F. Winter (JLab), M. Clark (NVIDIA)
Tuesday, July 30, 13
Future Directions - Communication
• Only scratched the surface of domain-decomposition algorithms
– Disjoint additive– Overlapping additive– Alternating boundary conditions– Random boundary conditions– Multiplicative Schwarz– Precision truncation– Random Schwarz
Tuesday, July 30, 13
Flexibility
• There is lots of variation is what constitutes heterogenous
• Want the algorithm to auto tune to both platforms and achieve optimal load balancing
Tuesday, July 30, 13
Future Directions - Latency
• Global sums are bad– Global synchronizations– Performance fluctuations
• New algorithms are required– S-step CG / BiCGstab, etc.– E.g., Pipeline CG vs. Naive
• One-sided communication– MPI-3 expands one-sided communications– Cray Gemini has hardware support– Asynchronous algorithms?
• Random Schwarz has exponential convergence
4 8 16 32 64Temporal Extent
80
100
120
140
160
180
GFL
OPS
Naive CGPipeline CG
Tuesday, July 30, 13
46
GPU Roadmap
2012 20142008 2010
DP
GFL
OPS
per
Wat
t Kepler
Tesla
Fermi
Maxwell
VoltaStacked DRAM
Unified Virtual Memory
Dynamic Parallelism
FP64
CUDA
32
16
8
4
2
1
0.5
Tuesday, July 30, 13
Future Directions - Precision
• Mixed-precision methods have become de facto– Mixed-precision Krylov solvers– Low-precision preconditioners
• Exploit closer coupling of precision and algorithm– Domain decomposition, Adaptive Multigrid– Hierarchical-precision algorithms– 128-bit <-> 64-bit <-> 32-bit <-> 16-bit <-> 8-bit
•Low precision is lossy compression• Low-precision tolerance is fault tolerance
Tuesday, July 30, 13