Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
GTC 2014 | Mathias Wagner | Indiana University |
GPU-Based Lattice QCD Simulations as Thermometer for Heavy-Ion Collisions
GTC 2014 | Mathias Wagner | Indiana University |
Outline•Strong interaction and QCD phase transition
•Lattice QCD
•Optimizations
•Increasing flop / byte ratios
•Hiding latencies
•Exploring different strategies for parallelism
•GPU Boost and energy efficiency
GTC 2014 | Mathias Wagner | Indiana University |
Phase transitions
•water at different temperatures
•ice (solid)
•water (liquid)
•vapor (gas)
•boiling point of water depends on pressure → phase diagram
GTC 2014 | Mathias Wagner | Indiana University |
Phase transitions
•water at different temperatures
•ice (solid)
•water (liquid)
•vapor (gas)
•boiling point of water depends on pressure → phase diagram
arX
iv:0
801.
4256
v2 [
hep-
ph]
28 Ju
l 200
9
The Phase Diagram of Strongly Interacting Matter
P. Braun-Munzinger1, 2 and J. Wambach1, 2
1GSI Helmholtzzentrum fur Schwerionenforschung mbH, Planckstr, 1, D64291 Darmstadt, Germany2Technical University Darmstadt, Schlossgartenstr. 9, D64287 Darmstadt, Germany
A fundamental question of physics is what ultimately happens to matter as it is heated or com-pressed. In the realm of very high temperature and density the fundamental degrees of freedom ofthe strong interaction, quarks and gluons, come into play and a transition from matter consistingof confined baryons and mesons to a state with ’liberated’ quarks and gluons is expected. Thestudy of the possible phases of strongly interacting matter is at the focus of many research activi-ties worldwide. In this article we discuss physical aspects of the phase diagram, its relation to theevolution of the early universe as well as the inner core of neutron stars. We also summarize recentprogress in the experimental study of hadronic or quark-gluon matter under extreme conditionswith ultrarelativistic nucleus-nucleus collisions.
PACS numbers: 21.60.Cs,24.60.Lz,21.10.Hw,24.60.Ky
Contents
I. Introduction 1
II. Strongly Interacting Matter under ExtremeConditions 2A. Quantum Chromodynamics 2B. Models of the phase diagram 4
III. Results from Lattice QCD 7
IV. Experiments with Heavy Ions 8A. Opaque fireballs and the ideal liquid scenario 9B. Hadro-Chemistry 10C. Medium modifications of vector mesons 12D. Quarkonia–messengers of deconfinement 15
V. Phases at High Baryon Density 17A. Color Superconductivity 17
VI. Summary and Conclusions 18
Acknowledgments 19
References 19
I. INTRODUCTION
Matter that surrounds us comes in a variety of phaseswhich can be transformed into each other by a change ofexternal conditions such as temperature, pressure, com-position etc. Transitions from one phase to another areoften accompanied by drastic changes in the physicalproperties of a material, such as its elastic properties,light transmission, or electrical conductivity. A good ex-ample is water whose phases are (partly) accessible toeveryday experience. Changes in external pressure andtemperature result in a rich phase diagram which, be-sides the familiar liquid and gaseous phases, features avariety of solid (ice) phases in which the H20 moleculesarrange themselves in spatial lattices of certain symme-tries (Fig. 1).
Twelve of such crystalline (ice) phases are known atpresent. In addition, three amorphous (glass) phases
FIG. 1 The phase diagram of H20 (Chaplin, 2007). Be-sides the liquid and gaseous phases a variety of crystallineand amorphous phases occur. Of special importance in thecontext of strongly interacting matter is the critical endpointbetween the vapor and liquid phase.
have been identified. Famous points in the phase dia-gram are the triple point where the solid, liquid, and gasphases coexist and the critical endpoint at which there isno distinction between the liquid and gas phase. This isthe endpoint of a line of first-order liquid-gas transitions;at this point the transition is of second order.
Under sufficient heating water and, for that matter anyother substance, goes over into a new state, a ’plasma’,consisting of ions and free electrons. This transitionis mediated by molecular or atomic collisions. It iscontinuous 1 and hence not a phase transition in thestrict thermodynamic sense. On the other hand, the
1 Under certain conditions there may also be a true plasma phasetransition, for recent evidence see (Fortov et al., 2007).
GTC 2014 | Mathias Wagner | Indiana University |
Phases of Quantum ChromoDynamics
•extreme conditions (temperatures, densities) are necessary to investigate properties of QCD
•important for understanding the evolution of the universe after the Big Bang
hadron gas dense hadronic matter quark gluon plasma !!! !!!!!!
cold hot !
cold nuclear matter phase transition or quarks and gluons are Quarks and gluons are crossover at Tc the degrees of freedom
confined inside hadrons (asymptotically) free
GTC 2014 | Mathias Wagner | Indiana University |
Accelerators ... the big ones
LHC @ CERN RHIC @ Brookhaven National Lab
GTC 2014 | Mathias Wagner | Indiana University |
Accelerators ... the big ones
LHC @ CERN RHIC @ Brookhaven National Lab
GTC 2014 | Mathias Wagner | Indiana University |
Heavy Ion Experiments
•phase transition occurs in heavy-ion collisions
•What thermometer can we use at ?
•detectors measure created particles
•to interpret the data theoretical input is required
•ab-initio approach: Lattice QCD
Heavy Ion Collision QGP Expansion+Cooling Hadronization
1012 K
GTC 2014 | Mathias Wagner | Indiana University |
Lattice QCD ingredients: Configuration generation
•sequential process
•use RHMC algorithm to evaluate the system in simulation time
•two dominant parts of the calculation (90% of the runtime)
•fermion force~50% for improved actions (HISQ)
•fermion matrix inversion~90% for standard action
Steps in a Lattice QCD calculation
1. Generate an ensemble of gluon field configurations,
2. Compute quark propagators in these fixed backgrounds by solving the Dirac equation for various right-hand sides
!"#$%&'()*$+%",-"#$.#(/01,(-23$$4$$5678$95:;;$$4$$8&1)*$;<$=>;; ?
!"#$%&'(&)&*)""'+#&,-.&+)*+/*)"'0(
;@$A0#01&-0$&#$0#,0B'C0$"D$ECF"#$D(0CG$)"#D(EF1&-("#,<
=@$9"BHF-0$IF&1J$H1"H&E&-"1,$(#$-*0,0$D(K0G$'&)JE1"F#G,$'2$,"C/(#E$-*0$L(1&)$0IF&-("#$D"1$/&1("F,$1(E*-M*&#G$,(G0,@
"1$N7KO'P
!"#$%&'()*$+%",-"#$.#(/01,(-23$$4$$5678$95:;;$$4$$8&1)*$;<$=>;; ?
!"#$%&'(&)&*)""'+#&,-.&+)*+/*)"'0(
;@$A0#01&-0$&#$0#,0B'C0$"D$ECF"#$D(0CG$)"#D(EF1&-("#,<
=@$9"BHF-0$IF&1J$H1"H&E&-"1,$(#$-*0,0$D(K0G$'&)JE1"F#G,$'2$,"C/(#E$-*0$L(1&)$0IF&-("#$D"1$/&1("F,$1(E*-M*&#G$,(G0,@
"1$N7KO'P!"#$%&'()*$+%",-"#$.#(/01,(-23$$4$$5678$95:;;$$4$$8&1)*$;<$=>;; ?
!"#$%&'(&)&*)""'+#&,-.&+)*+/*)"'0(
;@$A0#01&-0$&#$0#,0B'C0$"D$ECF"#$D(0CG$)"#D(EF1&-("#,<
=@$9"BHF-0$IF&1J$H1"H&E&-"1,$(#$-*0,0$D(K0G$'&)JE1"F#G,$'2$,"C/(#E$-*0$L(1&)$0IF&-("#$D"1$/&1("F,$1(E*-M*&#G$,(G0,@
"1$N7KO'P
Friday, 11 March 2011
P = �@H
@Q= � @S
@Q= �
✓@Sg
@Q+
@Sf
@Q
◆Q =
@H
@P= P
Dynamical Simulations with Highly Improved Staggered Quarks R. M. Woloshyn
Figure 1: Paths used in the ASQTAD and HISQ actions. Path coefficients for ASQTAD can be foundin Ref. [4]. The HISQ effective links are constructed by first applying a Fat7 fattening to the baselinks (U →UF with coefficients 1-link:1/8, 3-staple:1/16, 5-staple:1/64, 7-staple:1/384), then a SU(3)-projection (UF → UR), and finally an “ASQ” smearing (UR → Uef f with coefficients 1-link:1+ ε/8, 3-staple:1/16, 5-staple:1/64, 7-staple:1/384, Lepage:−1/8, Naik:−(1+ ε)/24, where the parameter ε is in-troduced to remove (am)4 errors [9]).
The pseudo-fermion field Φ is defined on even lattice sites only to avoid a doubling of flavors fromusing M†M instead of M in the action. This procedure is valid since M†M has no matrix elementconnecting even and odd lattice sites.
A key component in dynamical simulations using molecular dynamics evolution is the com-putation of fermion force — derivative of the fermion action with respect to the base links
fx,µ =∂Sf∂Ux,µ
=∂
∂Ux,µ
⟨
Φ∣
∣
∣
[
M†[U ]M[U ]]−nf /4
∣
∣
∣Φ⟩
. (2.3)
The derivative can be computed straightforwardly if nf is a multiple of 4; for other numbers offermion flavors the 4th-root of M†M can be approximated by a rational expansion (the RHMCalgorithm [17, 18, 19])
[M†M]−nf /4 ≈ α0+∑l
αlM†M+βl
, (2.4)
where αl and βl are constants. The derivative becomes
∂Sf∂Ux,µ
= −∑lαl
⟨
Φ[M†M+βl]−1∣
∣
∣
∣
∂∂Ux,µ
(
M†[U ]M[U ])
∣
∣
∣
∣
[M†M+βl]−1Φ
⟩
= −∑lαl
(⟨
Xl∣
∣
∣
∣
∂D†[U ]
∂Ux,µ
∣
∣
∣
∣
Y l⟩
+
⟨
Y l∣
∣
∣
∣
∂D[U ]
∂Ux,µ
∣
∣
∣
∣
Xl⟩)
, (2.5)
with |Xl⟩ = [M†M+βl]−1|Φ⟩ and |Y l⟩ = D|Xl⟩. Note that Xl and Y l are defined on even and oddsites respectively. Taking the derivatives of D, D† with respect to Uef f , Uef f† and writing out thematrix indices we have
[
fx,µ]
ab =∂Sf
∂[
Ux,µ]
ab=∑
y,ν(−1)yηy,ν
(
∂ [Uef fy,ν ]mn
∂ [Ux,µ ]ab
[
f (0)y,ν
]
mn+∂ [Uef f†
y,ν ]mn∂ [Ux,µ ]ab
[
f (0)†y,ν
]
mn
)
, (2.6)
where f (0)y,ν is the vector outer product of the field variables at y and y+ν
[
f (0)y,ν
]
mn=
⎧
⎪
⎨
⎪
⎩
∑lαl[Y ly+ν ]n[X
l∗y ]m for even y
∑lαl[Xl
y+ν ]n[Yl∗y ]m for odd y
, (2.7)
3
GTC 2014 | Mathias Wagner | Indiana University |
Measurements: Get some physical observables
•Fluctuations are a great observable to explore the QCD phase diagram
!
!
•noisy estimators → large number of random vectors η (~1500 / configuration)
•up to 20000 configurations for each temperature
•dominant operation: fermion matrix inversion (~ 99%)
@(ln detM)
@µ=Tr
✓M�1 @M
@µ
◆
@2(ln detM)
@µ2=Tr
✓M�1 @
2M
@µ2
◆� Tr
✓M�1 @M
@µM�1 @M
@µ
◆
...
Tr
✓@n1M
@µn1M�1 @
n2M
@µn2. . .M�1
◆= lim
N!1
1
N
NX
k=1
⌘†k@n1M
@µn1M�1 @
n2M
@µn2. . .M�1⌘k
GTC 2014 | Mathias Wagner | Indiana University |
Mapping the Wilson-Clover operator to CUDA
• Each thread must
• Load the neighboring spinor (24 numbers x8)
• Load the color matrix connecting the sites (18 numbers x8)
• Load the clover matrix (72 numbers)
• Save the result (24 numbers)
• Arithmetic intensity
• 3696 floating point operations per site
• 2976 bytes per site (single precision)
• 1.24 naive arithmetic intensity
review basic details of the LQCD application and of NVIDIAGPU hardware. We then briefly consider some related workin Section IV before turning to a general description of theQUDA library in Section V. Our parallelization of the quarkinteraction matrix is described in VI, and we present anddiscuss our performance data for the parallelized solver inSection VII. We finish with conclusions and a discussion offuture work in Section VIII.
II. LATTICE QCDThe necessity for a lattice discretized formulation of QCD
arises due to the failure of perturbative approaches commonlyused for calculations in other quantum field theories, such aselectrodynamics. Quarks, the fundamental particles that are atthe heart of QCD, are described by the Dirac operator actingin the presence of a local SU(3) symmetry. On the lattice,the Dirac operator becomes a large sparse matrix, M , and thecalculation of quark physics is essentially reduced to manysolutions to systems of linear equations given by
Mx = b. (1)
The form of M on which we focus in this work is theSheikholeslami-Wohlert [6] (colloquially known as Wilson-clover) form, which is a central difference discretization of theDirac operator. When acting in a vector space that is the tensorproduct of a 4-dimensional discretized Euclidean spacetime,spin space, and color space it is given by
Mx,x0 = �12
4⇤
µ=1
�P�µ ⇤ Uµ
x �x+µ,x0 + P+µ ⇤ Uµ†x�µ �x�µ,x0
⇥
+ (4 + m + Ax)�x,x0
⌅ �12Dx,x0 + (4 + m + Ax)�x,x0 . (2)
Here �x,y is the Kronecker delta; P±µ are 4 ⇥ 4 matrixprojectors in spin space; U is the QCD gauge field whichis a field of special unitary 3⇥ 3 (i.e., SU(3)) matrices actingin color space that live between the spacetime sites (and henceare referred to as link matrices); Ax is the 12⇥12 clover matrixfield acting in both spin and color space,1 corresponding toa first order discretization correction; and m is the quarkmass parameter. The indices x and x⇥ are spacetime indices(the spin and color indices have been suppressed for brevity).This matrix acts on a vector consisting of a complex-valued12-component color-spinor (or just spinor) for each point inspacetime. We refer to the complete lattice vector as a spinorfield.
Since M is a large sparse matrix, an iterative Krylovsolver is typically used to obtain solutions to (1), requiringmany repeated evaluations of the sparse matrix-vector product.The matrix is non-Hermitian, so either Conjugate Gradients[7] on the normal equations (CGNE or CGNR) is used, ormore commonly, the system is solved directly using a non-symmetric method, e.g., BiCGstab [8]. Even-odd (also known
1Each clover matrix has a Hermitian block diagonal, anti-Hermitian blockoff-diagonal structure, and can be fully described by 72 real numbers.
Fig. 1. The nearest neighbor stencil part of the lattice Dirac operator D,as defined in (2), in the µ� � plane. The color-spinor fields are located onthe sites. The SU(3) color matrices Uµ
x are associated with the links. Thenearest neighbor nature of the stencil suggests a natural even-odd (red-black)coloring for the sites.
as red-black) preconditioning is used to accelerate the solutionfinding process, where the nearest neighbor property of theDx,x0 matrix (see Fig. 1) is exploited to solve the Schur com-plement system [9]. This has no effect on the overall efficiencysince the fields are reordered such that all components ofa given parity are contiguous. The quark mass controls thecondition number of the matrix, and hence the convergence ofsuch iterative solvers. Unfortunately, physical quark massescorrespond to nearly indefinite matrices. Given that currentleading lattice volumes are 323 ⇥ 256, for > 108 degrees offreedom in total, this represents an extremely computationallydemanding task.
III. GRAPHICS PROCESSING UNITS
In the context of general-purpose computing, a GPU iseffectively an independent parallel processor with its ownlocally-attached memory, herein referred to as device memory.The GPU relies on the host, however, to schedule blocks ofcode (or kernels) for execution, as well as for I/O. Data isexchanged between the GPU and the host via explicit memorycopies, which take place over the PCI-Express bus. The low-level details of the data transfers, as well as management ofthe execution environment, are handled by the GPU devicedriver and the runtime system.
It follows that a GPU cluster embodies an inherently het-erogeneous architecture. Each node consists of one or moreprocessors (the CPU) that is optimized for serial or moderatelyparallel code and attached to a relatively large amount ofmemory capable of tens of GB/s of sustained bandwidth. Atthe same time, each node incorporates one or more processors(the GPU) optimized for highly parallel code attached to arelatively small amount of very fast memory, capable of 150GB/s or more of sustained bandwidth. The challenge we face isthat these two powerful subsystems are connected by a narrowcommunications channel, the PCI-E bus, which sustains atmost 6 GB/s and often less. As a consequence, it is criticalto avoid unnecessary transfers between the GPU and the host.
Dx,x
0 = A =
Friday, 11 March 2011
Standard staggered Fermion Matrix (Dslash)
•Krylov space inversion of fermion matrix dominates runtime
•within inversion application of sparse Matrix dominates (>80%)
•memory: 8 SU(3) matrices input, 8 color vectors input, 1 color vector output
•8 x ( 72 + 24) + 24 bytes = 792 bytes ( 1584 for double precision)
•Flops: (CM = complex mult, CA = complex add)
•4 x ( 2 x 3 x (3 CM + 2 CA) + 3 CA) + 3 x 3 CA = 570 flops
•flops / byte ratios: 0.72
wx
= Dx,x
0vx
0 =3
X
µ=0
n
Ux,µ
vx+µ
� U †x�µ,µ
vx�µ
o
GTC 2014 | Mathias Wagner | Indiana University |
Solvers for multiple right hand sides
•need inversions for many (1500) ‘source’-vectors for a fixed gauge field (matrix)
•instead of calculate
•Bytes for n vectors
•Flops for n vectors
16 · (72 + n · 24) bytes + n · 24 bytes = 1152 bytes + 408 bytes · n .
1146 flops · n
# r.h.s. 1 2 3 4 5flops/ byte 0.73 1.16 1.45 1.65 1.8
Mapping the Wilson-Clover operator to CUDA
• Each thread must
• Load the neighboring spinor (24 numbers x8)
• Load the color matrix connecting the sites (18 numbers x8)
• Load the clover matrix (72 numbers)
• Save the result (24 numbers)
• Arithmetic intensity
• 3696 floating point operations per site
• 2976 bytes per site (single precision)
• 1.24 naive arithmetic intensity
review basic details of the LQCD application and of NVIDIAGPU hardware. We then briefly consider some related workin Section IV before turning to a general description of theQUDA library in Section V. Our parallelization of the quarkinteraction matrix is described in VI, and we present anddiscuss our performance data for the parallelized solver inSection VII. We finish with conclusions and a discussion offuture work in Section VIII.
II. LATTICE QCDThe necessity for a lattice discretized formulation of QCD
arises due to the failure of perturbative approaches commonlyused for calculations in other quantum field theories, such aselectrodynamics. Quarks, the fundamental particles that are atthe heart of QCD, are described by the Dirac operator actingin the presence of a local SU(3) symmetry. On the lattice,the Dirac operator becomes a large sparse matrix, M , and thecalculation of quark physics is essentially reduced to manysolutions to systems of linear equations given by
Mx = b. (1)
The form of M on which we focus in this work is theSheikholeslami-Wohlert [6] (colloquially known as Wilson-clover) form, which is a central difference discretization of theDirac operator. When acting in a vector space that is the tensorproduct of a 4-dimensional discretized Euclidean spacetime,spin space, and color space it is given by
Mx,x0 = �12
4⇤
µ=1
�P�µ ⇤ Uµ
x �x+µ,x0 + P+µ ⇤ Uµ†x�µ �x�µ,x0
⇥
+ (4 + m + Ax)�x,x0
⌅ �12Dx,x0 + (4 + m + Ax)�x,x0 . (2)
Here �x,y is the Kronecker delta; P±µ are 4 ⇥ 4 matrixprojectors in spin space; U is the QCD gauge field whichis a field of special unitary 3⇥ 3 (i.e., SU(3)) matrices actingin color space that live between the spacetime sites (and henceare referred to as link matrices); Ax is the 12⇥12 clover matrixfield acting in both spin and color space,1 corresponding toa first order discretization correction; and m is the quarkmass parameter. The indices x and x⇥ are spacetime indices(the spin and color indices have been suppressed for brevity).This matrix acts on a vector consisting of a complex-valued12-component color-spinor (or just spinor) for each point inspacetime. We refer to the complete lattice vector as a spinorfield.
Since M is a large sparse matrix, an iterative Krylovsolver is typically used to obtain solutions to (1), requiringmany repeated evaluations of the sparse matrix-vector product.The matrix is non-Hermitian, so either Conjugate Gradients[7] on the normal equations (CGNE or CGNR) is used, ormore commonly, the system is solved directly using a non-symmetric method, e.g., BiCGstab [8]. Even-odd (also known
1Each clover matrix has a Hermitian block diagonal, anti-Hermitian blockoff-diagonal structure, and can be fully described by 72 real numbers.
Fig. 1. The nearest neighbor stencil part of the lattice Dirac operator D,as defined in (2), in the µ� � plane. The color-spinor fields are located onthe sites. The SU(3) color matrices Uµ
x are associated with the links. Thenearest neighbor nature of the stencil suggests a natural even-odd (red-black)coloring for the sites.
as red-black) preconditioning is used to accelerate the solutionfinding process, where the nearest neighbor property of theDx,x0 matrix (see Fig. 1) is exploited to solve the Schur com-plement system [9]. This has no effect on the overall efficiencysince the fields are reordered such that all components ofa given parity are contiguous. The quark mass controls thecondition number of the matrix, and hence the convergence ofsuch iterative solvers. Unfortunately, physical quark massescorrespond to nearly indefinite matrices. Given that currentleading lattice volumes are 323 ⇥ 256, for > 108 degrees offreedom in total, this represents an extremely computationallydemanding task.
III. GRAPHICS PROCESSING UNITS
In the context of general-purpose computing, a GPU iseffectively an independent parallel processor with its ownlocally-attached memory, herein referred to as device memory.The GPU relies on the host, however, to schedule blocks ofcode (or kernels) for execution, as well as for I/O. Data isexchanged between the GPU and the host via explicit memorycopies, which take place over the PCI-Express bus. The low-level details of the data transfers, as well as management ofthe execution environment, are handled by the GPU devicedriver and the runtime system.
It follows that a GPU cluster embodies an inherently het-erogeneous architecture. Each node consists of one or moreprocessors (the CPU) that is optimized for serial or moderatelyparallel code and attached to a relatively large amount ofmemory capable of tens of GB/s of sustained bandwidth. Atthe same time, each node incorporates one or more processors(the GPU) optimized for highly parallel code attached to arelatively small amount of very fast memory, capable of 150GB/s or more of sustained bandwidth. The challenge we face isthat these two powerful subsystems are connected by a narrowcommunications channel, the PCI-E bus, which sustains atmost 6 GB/s and often less. As a consequence, it is criticalto avoid unnecessary transfers between the GPU and the host.
Dx,x
0 = A =
Friday, 11 March 2011
Mapping the Wilson-Clover operator to CUDA
• Each thread must
• Load the neighboring spinor (24 numbers x8)
• Load the color matrix connecting the sites (18 numbers x8)
• Load the clover matrix (72 numbers)
• Save the result (24 numbers)
• Arithmetic intensity
• 3696 floating point operations per site
• 2976 bytes per site (single precision)
• 1.24 naive arithmetic intensity
review basic details of the LQCD application and of NVIDIAGPU hardware. We then briefly consider some related workin Section IV before turning to a general description of theQUDA library in Section V. Our parallelization of the quarkinteraction matrix is described in VI, and we present anddiscuss our performance data for the parallelized solver inSection VII. We finish with conclusions and a discussion offuture work in Section VIII.
II. LATTICE QCDThe necessity for a lattice discretized formulation of QCD
arises due to the failure of perturbative approaches commonlyused for calculations in other quantum field theories, such aselectrodynamics. Quarks, the fundamental particles that are atthe heart of QCD, are described by the Dirac operator actingin the presence of a local SU(3) symmetry. On the lattice,the Dirac operator becomes a large sparse matrix, M , and thecalculation of quark physics is essentially reduced to manysolutions to systems of linear equations given by
Mx = b. (1)
The form of M on which we focus in this work is theSheikholeslami-Wohlert [6] (colloquially known as Wilson-clover) form, which is a central difference discretization of theDirac operator. When acting in a vector space that is the tensorproduct of a 4-dimensional discretized Euclidean spacetime,spin space, and color space it is given by
Mx,x0 = �12
4⇤
µ=1
�P�µ ⇤ Uµ
x �x+µ,x0 + P+µ ⇤ Uµ†x�µ �x�µ,x0
⇥
+ (4 + m + Ax)�x,x0
⌅ �12Dx,x0 + (4 + m + Ax)�x,x0 . (2)
Here �x,y is the Kronecker delta; P±µ are 4 ⇥ 4 matrixprojectors in spin space; U is the QCD gauge field whichis a field of special unitary 3⇥ 3 (i.e., SU(3)) matrices actingin color space that live between the spacetime sites (and henceare referred to as link matrices); Ax is the 12⇥12 clover matrixfield acting in both spin and color space,1 corresponding toa first order discretization correction; and m is the quarkmass parameter. The indices x and x⇥ are spacetime indices(the spin and color indices have been suppressed for brevity).This matrix acts on a vector consisting of a complex-valued12-component color-spinor (or just spinor) for each point inspacetime. We refer to the complete lattice vector as a spinorfield.
Since M is a large sparse matrix, an iterative Krylovsolver is typically used to obtain solutions to (1), requiringmany repeated evaluations of the sparse matrix-vector product.The matrix is non-Hermitian, so either Conjugate Gradients[7] on the normal equations (CGNE or CGNR) is used, ormore commonly, the system is solved directly using a non-symmetric method, e.g., BiCGstab [8]. Even-odd (also known
1Each clover matrix has a Hermitian block diagonal, anti-Hermitian blockoff-diagonal structure, and can be fully described by 72 real numbers.
Fig. 1. The nearest neighbor stencil part of the lattice Dirac operator D,as defined in (2), in the µ� � plane. The color-spinor fields are located onthe sites. The SU(3) color matrices Uµ
x are associated with the links. Thenearest neighbor nature of the stencil suggests a natural even-odd (red-black)coloring for the sites.
as red-black) preconditioning is used to accelerate the solutionfinding process, where the nearest neighbor property of theDx,x0 matrix (see Fig. 1) is exploited to solve the Schur com-plement system [9]. This has no effect on the overall efficiencysince the fields are reordered such that all components ofa given parity are contiguous. The quark mass controls thecondition number of the matrix, and hence the convergence ofsuch iterative solvers. Unfortunately, physical quark massescorrespond to nearly indefinite matrices. Given that currentleading lattice volumes are 323 ⇥ 256, for > 108 degrees offreedom in total, this represents an extremely computationallydemanding task.
III. GRAPHICS PROCESSING UNITS
In the context of general-purpose computing, a GPU iseffectively an independent parallel processor with its ownlocally-attached memory, herein referred to as device memory.The GPU relies on the host, however, to schedule blocks ofcode (or kernels) for execution, as well as for I/O. Data isexchanged between the GPU and the host via explicit memorycopies, which take place over the PCI-Express bus. The low-level details of the data transfers, as well as management ofthe execution environment, are handled by the GPU devicedriver and the runtime system.
It follows that a GPU cluster embodies an inherently het-erogeneous architecture. Each node consists of one or moreprocessors (the CPU) that is optimized for serial or moderatelyparallel code and attached to a relatively large amount ofmemory capable of tens of GB/s of sustained bandwidth. Atthe same time, each node incorporates one or more processors(the GPU) optimized for highly parallel code attached to arelatively small amount of very fast memory, capable of 150GB/s or more of sustained bandwidth. The challenge we face isthat these two powerful subsystems are connected by a narrowcommunications channel, the PCI-E bus, which sustains atmost 6 GB/s and often less. As a consequence, it is criticalto avoid unnecessary transfers between the GPU and the host.
Dx,x
0 = A =
Friday, 11 March 2011
+
⇣w(1)
x
, w(2)x
, . . . , w(2)x
⌘= D
x,x
0
⇣v(1)x
0 , v(2)x
0 , . . . , v(n)x
⌘
GFl
ops
0
150
300
450
600
1 2 3 4 5
M2075 est. GTX 580 est.K20 est. GTX Titan est.M2075 measured K20 measured
GTC 2014 | Mathias Wagner | Indiana University |
Multiple right hand sides using cache and registers
•obvious solution: store matrix in registers
•possible issue: more registers / thread → occupancy / spilling
__global__'Dslashreg'(w1,'w2,'w3,'v1,'v2,'v3'){...for(xp=...){' w1(x)'='D(x,xp)'*'v1(xp);' w2(x)'='D(x,xp)'*'v2(xp);' w3(x)'='D(x,xp)'*'v3(xp);'' }}
__global__'Dslashcache'(w,'v)...offset'='threadIdx.y;for(xp=...)' w(x,'offset)'+='D(x,xp)'*'v(x,'offset)}
__global__'Dslashregcache'(w1,'w2,'w3,'v1,'v2,'v3'){...offset'='threadIdx.y;for(xp=...){' w1(x,'offset)'='D(x,xp)'*'v1(xp,'offset);' w2(x,'offset)'='D(x,xp)'*'v2(xp,'offset);' w3(x,'offset)'='D(x,xp)'*'v3(xp,'offset);'' }}
⋮
GTC 2014 | Mathias Wagner | Indiana University |
Multiple right hand sides using cache and registers
•obvious solution: store matrix in registers
•possible issue: more registers / thread → occupancy / spilling
•exploit texture cache → avoid register pressure
•load matrix through texture cache
•should hit in texture cache → only one global load
__global__'Dslashreg'(w1,'w2,'w3,'v1,'v2,'v3'){...for(xp=...){' w1(x)'='D(x,xp)'*'v1(xp);' w2(x)'='D(x,xp)'*'v2(xp);' w3(x)'='D(x,xp)'*'v3(xp);'' }}
__global__'Dslashcache'(w,'v)...offset'='threadIdx.y;for(xp=...)' w(x,'offset)'+='D(x,xp)'*'v(x,'offset)}
__global__'Dslashregcache'(w1,'w2,'w3,'v1,'v2,'v3'){...offset'='threadIdx.y;for(xp=...){' w1(x,'offset)'='D(x,xp)'*'v1(xp,'offset);' w2(x,'offset)'='D(x,xp)'*'v2(xp,'offset);' w3(x,'offset)'='D(x,xp)'*'v3(xp,'offset);'' }}
⋮
x=0 v1
x=1 v1
x=BS-1 v1
x=0 v2
x=1 v2
x=BS-1 v2
x=0 v3
x=1 v3
x=BS-1 v3
�
�
�
GTC 2014 | Mathias Wagner | Indiana University |
Multiple right hand sides using cache and registers
•obvious solution: store matrix in registers
•possible issue: more registers / thread → occupancy / spilling
•exploit texture cache → avoid register pressure
•load matrix through texture cache
•should hit in texture cache → only one global load
•combine both and explore best possible combinations
⋮
x=0 v1
x=1 v1
x=BS-1 v1
x=0 v2
x=1 v2
x=BS-1 v2
x=0 v3
x=1 v3
x=BS-1 v3
�
�
�
__global__'Dslashreg'(w1,'w2,'w3,'v1,'v2,'v3'){...for(xp=...){' w1(x)'='D(x,xp)'*'v1(xp);' w2(x)'='D(x,xp)'*'v2(xp);' w3(x)'='D(x,xp)'*'v3(xp);'' }}
__global__'Dslashcache'(w,'v)...offset'='threadIdx.y;for(xp=...)' w(x,'offset)'+='D(x,xp)'*'v(x,'offset)}
__global__'Dslashregcache'(w1,'w2,'w3,'v1,'v2,'v3'){...offset'='threadIdx.y;for(xp=...){' w1(x,'offset)'='D(x,xp)'*'v1(xp,'offset);' w2(x,'offset)'='D(x,xp)'*'v2(xp,'offset);' w3(x,'offset)'='D(x,xp)'*'v3(xp,'offset);'' }}
GTC 2014 | Mathias Wagner | Indiana University |
Multiple right hand sides using cache and registers
•obvious solution: store matrix in registers
•possible issue: more registers / thread → occupancy / spilling
•exploit texture cache → avoid register pressure
•load matrix through texture cache
•should hit in texture cache → only one global load
•combine both and explore best possible combinations
⋮
x=0 v1
x=1 v1
x=BS-1 v1
x=0 v2
x=1 v2
x=BS-1 v2
x=0 v3
x=1 v3
x=BS-1 v3
�
�
�
__global__'Dslashreg'(w1,'w2,'w3,'v1,'v2,'v3'){...for(xp=...){' w1(x)'='D(x,xp)'*'v1(xp);' w2(x)'='D(x,xp)'*'v2(xp);' w3(x)'='D(x,xp)'*'v3(xp);'' }}
__global__'Dslashcache'(w,'v)...offset'='threadIdx.y;for(xp=...)' w(x,'offset)'+='D(x,xp)'*'v(x,'offset)}
__global__'Dslashregcache'(w1,'w2,'w3,'v1,'v2,'v3'){...offset'='threadIdx.y;for(xp=...){' w1(x,'offset)'='D(x,xp)'*'v1(xp,'offset);' w2(x,'offset)'='D(x,xp)'*'v2(xp,'offset);' w3(x,'offset)'='D(x,xp)'*'v3(xp,'offset);'' }}
Tim
e / #
r.h.
s [m
s]
0
4
8
12
16
# r.h.s.0 3 6 9 12
1 r.h.s. in reg2 r.h.s. in reg3 r.h.s. in reg
GTC 2014 | Mathias Wagner | Indiana University |
Multiple right hand sides using cache and registers
•obvious solution: store matrix in registers
•possible issue: more registers / thread → occupancy / spilling
•exploit texture cache → avoid register pressure
•load matrix through texture cache
•should hit in texture cache → only one global load
•combine both and explore best possible combinations
⋮
x=0 v1
x=1 v1
x=BS-1 v1
x=0 v2
x=1 v2
x=BS-1 v2
x=0 v3
x=1 v3
x=BS-1 v3
�
�
�
__global__'Dslashreg'(w1,'w2,'w3,'v1,'v2,'v3'){...for(xp=...){' w1(x)'='D(x,xp)'*'v1(xp);' w2(x)'='D(x,xp)'*'v2(xp);' w3(x)'='D(x,xp)'*'v3(xp);'' }}
__global__'Dslashcache'(w,'v)...offset'='threadIdx.y;for(xp=...)' w(x,'offset)'+='D(x,xp)'*'v(x,'offset)}
__global__'Dslashregcache'(w1,'w2,'w3,'v1,'v2,'v3'){...offset'='threadIdx.y;for(xp=...){' w1(x,'offset)'='D(x,xp)'*'v1(xp,'offset);' w2(x,'offset)'='D(x,xp)'*'v2(xp,'offset);' w3(x,'offset)'='D(x,xp)'*'v3(xp,'offset);'' }}
GFlops
0
125
250
375
500
0 3 6 9 12
GTC 2014 | Mathias Wagner | Indiana University |
Initialization
loop
Fermion Matrix
↵ =X
i
|~pA~p|i
~r = ~r � !~p
~x = ~x+ !~p
�i = |ri|
� =X
i
�i
~p = ~r + �~p
Importance of linear algebra•using multiple r.h.s. provides massive speedup for Dslash
•linear algebra scales linearly
•more important for smaller lattice sizes
0,000
0,070
0,140
0,210
0,280
1 2 3
fraction LA
0
25
50
75
100
Nt=8 Nt=6
Dslash Linear AlgebraRest
GTC 2014 | Mathias Wagner | Indiana University |
Initialization
loop
Fermion Matrix
↵ =X
i
|~pA~p|i
~r = ~r � !~p
~x = ~x+ !~p
�i = |ri|
� =X
i
�i
~p = ~r + �~p
Importance of linear algebra•using multiple r.h.s. provides massive speedup for Dslash
•linear algebra scales linearly
•more important for smaller lattice sizes
0,000
0,070
0,140
0,210
0,280
1 2 3
fraction LA
0
25
50
75
100
Nt=8 Nt=6
Dslash Linear AlgebraRest
GTC 2014 | Mathias Wagner | Indiana University |
A linear algebra setup for multiple r.h.s.
Dslash
reduction
reduction
reduction
Linear Algebra
Linear Algebra
Linear Algebra
reduction
reduction
reduction
D2H
WaitEvent
Linear Algebra
Linear Algebra
Linear Algebra
D2H D2HSync to check
for convergence
reduceSinglePassAsync(1,1stream=Stream1);//1Event1will1be1recorded1once1reduction1Kernel1finishedcudaEventRecord(ReductionDone1,1Stream1);//1make1copyStream1Wait1until1reduce1is1done1G1signaled1by1Event1//1this1does1not1block1the1hostcudaStreamWaitEvent(CopyStream,1ReductionDone1,10);//1already1issue1async1copy1command1to1CopyStream1//1it1will1only1execute1after1StreamWaitEventcudaMemcpyAsync(,1CopyStream);
GTC 2014 | Mathias Wagner | Indiana University |
A linear algebra setup for multiple r.h.s.
Dslash
reduction
reduction
reduction
Linear Algebra
Linear Algebra
Linear Algebra
reduction
reduction
reduction
D2H
WaitEvent
Linear Algebra
Linear Algebra
Linear Algebra
D2H D2HSync to check
for convergence
reduceSinglePassAsync(1,1stream=Stream1);//1Event1will1be1recorded1once1reduction1Kernel1finishedcudaEventRecord(ReductionDone1,1Stream1);//1make1copyStream1Wait1until1reduce1is1done1G1signaled1by1Event1//1this1does1not1block1the1hostcudaStreamWaitEvent(CopyStream,1ReductionDone1,10);//1already1issue1async1copy1command1to1CopyStream1//1it1will1only1execute1after1StreamWaitEventcudaMemcpyAsync(,1CopyStream);
GTC 2014 | Mathias Wagner | Indiana University |
0,8
0,85
0,9
0,95
1
sync streams shuffle
Nt=6 Nt=8
A linear algebra setup for multiple r.h.s.
Dslash
reduction
reduction
reduction
Linear Algebra
Linear Algebra
Linear Algebra
reduction
reduction
reduction
D2H
WaitEvent
Linear Algebra
Linear Algebra
Linear Algebra
D2H D2HSync to check
for convergence
reduceSinglePassAsync(1,1stream=Stream1);//1Event1will1be1recorded1once1reduction1Kernel1finishedcudaEventRecord(ReductionDone1,1Stream1);//1make1copyStream1Wait1until1reduce1is1done1G1signaled1by1Event1//1this1does1not1block1the1hostcudaStreamWaitEvent(CopyStream,1ReductionDone1,10);//1already1issue1async1copy1command1to1CopyStream1//1it1will1only1execute1after1StreamWaitEventcudaMemcpyAsync(,1CopyStream);
GTC 2014 | Mathias Wagner | Indiana University |
•Fermion Force includes a lot of SU(3) matrix multiplications
•loop over possible orientations in 4 dimensional space-time
•each interior function
•multiplies up to 7 SU(3) matrices
•may contain a sum over up to 56 of these multiplications
•How do different ways to perform the loops affect the performance ?
Optimizing the fermion force or ‘Where to put the loops ?’Dynamical Simulations with Highly Improved Staggered Quarks R. M. Woloshyn
Figure 1: Paths used in the ASQTAD and HISQ actions. Path coefficients for ASQTAD can be foundin Ref. [4]. The HISQ effective links are constructed by first applying a Fat7 fattening to the baselinks (U →UF with coefficients 1-link:1/8, 3-staple:1/16, 5-staple:1/64, 7-staple:1/384), then a SU(3)-projection (UF → UR), and finally an “ASQ” smearing (UR → Uef f with coefficients 1-link:1+ ε/8, 3-staple:1/16, 5-staple:1/64, 7-staple:1/384, Lepage:−1/8, Naik:−(1+ ε)/24, where the parameter ε is in-troduced to remove (am)4 errors [9]).
The pseudo-fermion field Φ is defined on even lattice sites only to avoid a doubling of flavors fromusing M†M instead of M in the action. This procedure is valid since M†M has no matrix elementconnecting even and odd lattice sites.
A key component in dynamical simulations using molecular dynamics evolution is the com-putation of fermion force — derivative of the fermion action with respect to the base links
fx,µ =∂Sf∂Ux,µ
=∂
∂Ux,µ
⟨
Φ∣
∣
∣
[
M†[U ]M[U ]]−nf /4
∣
∣
∣Φ⟩
. (2.3)
The derivative can be computed straightforwardly if nf is a multiple of 4; for other numbers offermion flavors the 4th-root of M†M can be approximated by a rational expansion (the RHMCalgorithm [17, 18, 19])
[M†M]−nf /4 ≈ α0+∑l
αlM†M+βl
, (2.4)
where αl and βl are constants. The derivative becomes
∂Sf∂Ux,µ
= −∑lαl
⟨
Φ[M†M+βl]−1∣
∣
∣
∣
∂∂Ux,µ
(
M†[U ]M[U ])
∣
∣
∣
∣
[M†M+βl]−1Φ
⟩
= −∑lαl
(⟨
Xl∣
∣
∣
∣
∂D†[U ]
∂Ux,µ
∣
∣
∣
∣
Y l⟩
+
⟨
Y l∣
∣
∣
∣
∂D[U ]
∂Ux,µ
∣
∣
∣
∣
Xl⟩)
, (2.5)
with |Xl⟩ = [M†M+βl]−1|Φ⟩ and |Y l⟩ = D|Xl⟩. Note that Xl and Y l are defined on even and oddsites respectively. Taking the derivatives of D, D† with respect to Uef f , Uef f† and writing out thematrix indices we have
[
fx,µ]
ab =∂Sf
∂[
Ux,µ]
ab=∑
y,ν(−1)yηy,ν
(
∂ [Uef fy,ν ]mn
∂ [Ux,µ ]ab
[
f (0)y,ν
]
mn+∂ [Uef f†
y,ν ]mn∂ [Ux,µ ]ab
[
f (0)†y,ν
]
mn
)
, (2.6)
where f (0)y,ν is the vector outer product of the field variables at y and y+ν
[
f (0)y,ν
]
mn=
⎧
⎪
⎨
⎪
⎩
∑lαl[Y ly+ν ]n[X
l∗y ]m for even y
∑lαl[Xl
y+ν ]n[Yl∗y ]m for odd y
, (2.7)
3
Fx,µ
=4X
⌫=0,⌫ 6=µ
4X
⇢=0,⇢ 6=µ,⌫
fx,µ,⌫,⇢
for(int(mu=0;(mu<4;(mu++)
( for(int(x=0;(x<N;(x++)
( ( for(int(nu=0;(nu++;(nu<4)
( ( ( //(ensure(nu(!=(mu
( ( ( for(int(rho=0;(rho<4;(rho++)
( ( ( ( //(ensure(rho(!=(nu,(rho(!=(mu
( ( ( ( f(x,mu,nu,rho)
//(grid(x
Kernel<<<(blocksize(,(gridsize(>>>((args,(x)
//(internal(mu,(nu,(rho
__global__(Kernel(args,(x){
( for(int(mu=0;(mu<4;(mu++)
( ( for(int(nu=0;(nu(<4;(nu++)
( ( ( //(ensure(nu(!=(mu
( ( ( for(int(rho=0;(rho(<4;(rho++)
( ( ( ( //ensure(rho(!=(mu,(rho(!=(nu
( ( ( ( f(x,(mu,(nu,(rho)
}
//(external(x,(mu
//(grid(x
for(int(mu=0;(mu<4;(mu++)
( Kernel<<<(blocksize(,(gridsize(>>>((args,(x,(mu)
//(internal(nu,(rho
__global__(Kernel(args,(x,(mu){
( for(int(nu=0;(nu(<4;(nu++)
( ( //(ensure(nu(!=(mu
( ( for(int(rho=0;(rho(<4;(rho++)
( ( ( //ensure(rho(!=(mu,(rho(!=(nu
( ( ( f(x,(mu,(nu,(rho)
}
//(external(x,(mu,(nu,(rho
//(grid(x
for(int(mu=0;(mu<4;(mu++)
( for(int(x=0;(x<N;(x++)
( ( for(int(nu=0;(nu(<4;(nu++)
( ( //(ensure(nu(!=(mu
( ( ( for(int(rho=0;(rho(<4;(rho++)
( ( ( //ensure(rho(!=(mu,(rho(!=(nu
( ( ( Kernel<<<(blocksize(,(gridsize(>>>((args,(x,(mu,(nu,(rho)
__global__(Kernel(args,(x,(mu,(nu,(rho){
( f(x,(mu,(nu,(rho)
}
//(external(x,mu,(nu,(rho
//(grid(x*mu
for(int(xmu=0;(x<N;(x++)
( for(int(nu=0;(nu(<4;(nu++)
( ( //(ensure(nu(!=(mu
( ( for(int(rho=0;(rho(<4;(rho++)
( ( ( //ensure(rho(!=(mu,(rho(!=(nu
( ( ( Kernel<<<(blocksize(,(gridsize(>>>((args,(xmu)
__global__(Kernel(args,(xmu){
( int(x(=(xmu(%(N;
( int(mu(=(xmu(/(N;
( f(x,(mu,(nu,(rho)
}
//(grid(xmu,(block((xmu,(nu,(rho)
//ensure(rho(!=(mu,(rho(!=(nu
//(3((2)(allowed(nu((rho)(L(values
Kernel<<<((blocksize,3,2)(,(gridsize(>>>((args,(xmu)
//(internal(nu,(rho
__global__(Kernel(args,(xmu){
( int(x(=(xmu(%(N;
( int(mu(=(xmu(/(N;
( int(nu(=(threadIdx.y
( int(rho(=(threadIdx.z
( f(x,(mu,(nu,(rho)
}
reduceSinglePassAsync((...,(stream=Stream1);
//(Event(will(be(recorded(once(reduction(Kernel(finished
cudaEventRecord(ReductionDone1,(Stream1);
//(make(copyStream(Wait(until(reduce(is(done(L(signaled(by(Event(
//(this(does(not(block(the(host
cudaStreamWaitEvent(CopyStream,(ReductionDone1,(0);
//(already(issue(async(copy(command(to(CopyStream(
//(it(will(only(execute(after(StreamWaitEvent
cudaMemcpyAsync(...,(CopyStream);
GTC 2014 | Mathias Wagner | Indiana University |
Consider some strategies for parallelism and loops
•What comes to your mind ?
•let threads run over x
Fx,µ
=4X
⌫=0,⌫ 6=µ
4X
⇢=0,⇢ 6=µ,⌫
fx,µ,⌫,⇢ x=0 x=1 x=BS-1
x=BS x=BS+1 x=2*BS-1
x=N-BS x=1 x=N-1
⋮
�
�
for(int(mu=0;(mu<4;(mu++)
( for(int(x=0;(x<N;(x++)
( ( for(int(nu=0;(nu++;(nu<4)
( ( ( //(ensure(nu(!=(mu
( ( ( for(int(rho=0;(rho<4;(rho++)
( ( ( ( //(ensure(rho(!=(nu,(rho(!=(mu
( ( ( ( f(x,mu,nu,rho)
//(grid(x
Kernel<<<(blocksize(,(gridsize(>>>((args,(x)
//(internal(mu,(nu,(rho
__global__(Kernel(args,(x){
( for(int(mu=0;(mu<4;(mu++)
( ( for(int(nu=0;(nu(<4;(nu++)
( ( ( //(ensure(nu(!=(mu
( ( ( for(int(rho=0;(rho(<4;(rho++)
( ( ( ( //ensure(rho(!=(mu,(rho(!=(nu
( ( ( ( f(x,(mu,(nu,(rho)
}
//(external(x,(mu
//(grid(x
for(int(mu=0;(mu<4;(mu++)
( Kernel<<<(blocksize(,(gridsize(>>>((args,(x,(mu)
//(internal(nu,(rho
__global__(Kernel(args,(x,(mu){
( for(int(nu=0;(nu(<4;(nu++)
( ( //(ensure(nu(!=(mu
( ( for(int(rho=0;(rho(<4;(rho++)
( ( ( //ensure(rho(!=(mu,(rho(!=(nu
( ( ( f(x,(mu,(nu,(rho)
}
//(external(x,(mu,(nu,(rho
//(grid(x
for(int(mu=0;(mu<4;(mu++)
( for(int(x=0;(x<N;(x++)
( ( for(int(nu=0;(nu(<4;(nu++)
( ( //(ensure(nu(!=(mu
( ( ( for(int(rho=0;(rho(<4;(rho++)
( ( ( //ensure(rho(!=(mu,(rho(!=(nu
( ( ( Kernel<<<(blocksize(,(gridsize(>>>((args,(x,(mu,(nu,(rho)
__global__(Kernel(args,(x,(mu,(nu,(rho){
( f(x,(mu,(nu,(rho)
}
//(external(x,mu,(nu,(rho
//(grid(x*mu
for(int(xmu=0;(x<N;(x++)
( for(int(nu=0;(nu(<4;(nu++)
( ( //(ensure(nu(!=(mu
( ( for(int(rho=0;(rho(<4;(rho++)
( ( ( //ensure(rho(!=(mu,(rho(!=(nu
( ( ( Kernel<<<(blocksize(,(gridsize(>>>((args,(xmu)
__global__(Kernel(args,(xmu){
( int(x(=(xmu(%(N;
( int(mu(=(xmu(/(N;
( f(x,(mu,(nu,(rho)
}
//(grid(xmu,(block((xmu,(nu,(rho)
//ensure(rho(!=(mu,(rho(!=(nu
//(3((2)(allowed(nu((rho)(L(values
Kernel<<<((blocksize,3,2)(,(gridsize(>>>((args,(xmu)
//(internal(nu,(rho
__global__(Kernel(args,(xmu){
( int(x(=(xmu(%(N;
( int(mu(=(xmu(/(N;
( int(nu(=(threadIdx.y
( int(rho(=(threadIdx.z
( f(x,(mu,(nu,(rho)
}
reduceSinglePassAsync((...,(stream=Stream1);
//(Event(will(be(recorded(once(reduction(Kernel(finished
cudaEventRecord(ReductionDone1,(Stream1);
//(make(copyStream(Wait(until(reduce(is(done(L(signaled(by(Event(
//(this(does(not(block(the(host
cudaStreamWaitEvent(CopyStream,(ReductionDone1,(0);
//(already(issue(async(copy(command(to(CopyStream(
//(it(will(only(execute(after(StreamWaitEvent
cudaMemcpyAsync(...,(CopyStream);
GTC 2014 | Mathias Wagner | Indiana University |
Consider some strategies for parallelism and loops
•What comes to your mind ?
•let threads run over x
•use external µ
Fx,µ
=4X
⌫=0,⌫ 6=µ
4X
⇢=0,⇢ 6=µ,⌫
fx,µ,⌫,⇢ x=0 x=1 x=BS-1
x=BS x=BS+1 x=2*BS-1
x=N-BS x=1 x=N-1
⋮
�
�
for(int(mu=0;(mu<4;(mu++)
( for(int(x=0;(x<N;(x++)
( ( for(int(nu=0;(nu++;(nu<4)
( ( ( //(ensure(nu(!=(mu
( ( ( for(int(rho=0;(rho<4;(rho++)
( ( ( ( //(ensure(rho(!=(nu,(rho(!=(mu
( ( ( ( f(x,mu,nu,rho)
//(grid(x
Kernel<<<(blocksize(,(gridsize(>>>((args,(x)
//(internal(mu,(nu,(rho
__global__(Kernel(args,(x){
( for(int(mu=0;(mu<4;(mu++)
( ( for(int(nu=0;(nu(<4;(nu++)
( ( ( //(ensure(nu(!=(mu
( ( ( for(int(rho=0;(rho(<4;(rho++)
( ( ( ( //ensure(rho(!=(mu,(rho(!=(nu
( ( ( ( f(x,(mu,(nu,(rho)
}
//(external(mu
//(grid(x
for(int(mu=0;(mu<4;(mu++)
( Kernel<<<(blocksize(,(gridsize(>>>((args,(x,(mu)
//(internal(nu,(rho
__global__(Kernel(args,(x,(mu){
( for(int(nu=0;(nu(<4;(nu++)
( ( //(ensure(nu(!=(mu
( ( for(int(rho=0;(rho(<4;(rho++)
( ( ( //ensure(rho(!=(mu,(rho(!=(nu
( ( ( f(x,(mu,(nu,(rho)
}
//(external(mu,(nu,(rho
//(grid(x
for(int(mu=0;(mu<4;(mu++)
( for(int(nu=0;(nu(<4;(nu++)
( //(ensure(nu(!=(mu
( ( for(int(rho=0;(rho(<4;(rho++)
( ( //ensure(rho(!=(mu,(rho(!=(nu
( ( Kernel<<<(blocksize(,(gridsize(>>>((args,(x,(mu,(nu,(rho)
__global__(Kernel(args,(x,(mu,(nu,(rho){
( f(x,(mu,(nu,(rho)
}
//(external(nu,(rho
//(grid(x*mu
for(int(nu=0;(nu(<4;(nu++)
( //(ensure(nu(!=(mu
( for(int(rho=0;(rho(<4;(rho++)
( ( //ensure(rho(!=(mu,(rho(!=(nu
( ( Kernel<<<(blocksize(,(gridsize(>>>((args,(xmu)
__global__(Kernel(args,(xmu){
( int(x(=(xmu(%(N;
( int(mu(=(xmu(/(N;
( f(x,(mu,(nu,(rho)
}
//(grid(xmu,(block((xmu,(nu,(rho)
//ensure(rho(!=(mu,(rho(!=(nu
//(3((2)(allowed(nu((rho)(L(values
Kernel<<<((blocksize,3,2)(,(gridsize(>>>((args,(xmu)
//(internal(nu,(rho
__global__(Kernel(args,(xmu){
( int(x(=(xmu(%(N;
( int(mu(=(xmu(/(N;
( int(nu(=(threadIdx.y
( int(rho(=(threadIdx.z
( f(x,(mu,(nu,(rho)
}
reduceSinglePassAsync((...,(stream=Stream1);
//(Event(will(be(recorded(once(reduction(Kernel(finished
cudaEventRecord(ReductionDone1,(Stream1);
//(make(copyStream(Wait(until(reduce(is(done(L(signaled(by(Event(
//(this(does(not(block(the(host
cudaStreamWaitEvent(CopyStream,(ReductionDone1,(0);
//(already(issue(async(copy(command(to(CopyStream(
//(it(will(only(execute(after(StreamWaitEvent
cudaMemcpyAsync(...,(CopyStream);
GTC 2014 | Mathias Wagner | Indiana University |
Consider some strategies for parallelism and loops
•What comes to your mind ?
•let threads run over x
•use external µ
•use external µ, ν, ρ
Fx,µ
=4X
⌫=0,⌫ 6=µ
4X
⇢=0,⇢ 6=µ,⌫
fx,µ,⌫,⇢ x=0 x=1 x=BS-1
x=BS x=BS+1 x=2*BS-1
x=N-BS x=1 x=N-1
⋮
�
�
for(int(mu=0;(mu<4;(mu++)
( for(int(x=0;(x<N;(x++)
( ( for(int(nu=0;(nu++;(nu<4)
( ( ( //(ensure(nu(!=(mu
( ( ( for(int(rho=0;(rho<4;(rho++)
( ( ( ( //(ensure(rho(!=(nu,(rho(!=(mu
( ( ( ( f(x,mu,nu,rho)
//(grid(x
Kernel<<<(blocksize(,(gridsize(>>>((args,(x)
//(internal(mu,(nu,(rho
__global__(Kernel(args,(x){
( for(int(mu=0;(mu<4;(mu++)
( ( for(int(nu=0;(nu(<4;(nu++)
( ( ( //(ensure(nu(!=(mu
( ( ( for(int(rho=0;(rho(<4;(rho++)
( ( ( ( //ensure(rho(!=(mu,(rho(!=(nu
( ( ( ( f(x,(mu,(nu,(rho)
}
//(external(mu
//(grid(x
for(int(mu=0;(mu<4;(mu++)
( Kernel<<<(blocksize(,(gridsize(>>>((args,(x,(mu)
//(internal(nu,(rho
__global__(Kernel(args,(x,(mu){
( for(int(nu=0;(nu(<4;(nu++)
( ( //(ensure(nu(!=(mu
( ( for(int(rho=0;(rho(<4;(rho++)
( ( ( //ensure(rho(!=(mu,(rho(!=(nu
( ( ( f(x,(mu,(nu,(rho)
}
//(external(mu,(nu,(rho
//(grid(x
for(int(mu=0;(mu<4;(mu++)
( for(int(nu=0;(nu(<4;(nu++)
( //(ensure(nu(!=(mu
( ( for(int(rho=0;(rho(<4;(rho++)
( ( //ensure(rho(!=(mu,(rho(!=(nu
( ( Kernel<<<(blocksize(,(gridsize(>>>((args,(x,(mu,(nu,(rho)
__global__(Kernel(args,(x,(mu,(nu,(rho){
( f(x,(mu,(nu,(rho)
}
//(external(nu,(rho
//(grid(x*mu
for(int(nu=0;(nu(<4;(nu++)
( //(ensure(nu(!=(mu
( for(int(rho=0;(rho(<4;(rho++)
( ( //ensure(rho(!=(mu,(rho(!=(nu
( ( Kernel<<<(blocksize(,(gridsize(>>>((args,(xmu)
__global__(Kernel(args,(xmu){
( int(x(=(xmu(%(N;
( int(mu(=(xmu(/(N;
( f(x,(mu,(nu,(rho)
}
//(grid(xmu,(block((xmu,(nu,(rho)
//ensure(rho(!=(mu,(rho(!=(nu
//(3((2)(allowed(nu((rho)(L(values
Kernel<<<((blocksize,3,2)(,(gridsize(>>>((args,(xmu)
//(internal(nu,(rho
__global__(Kernel(args,(xmu){
( int(x(=(xmu(%(N;
( int(mu(=(xmu(/(N;
( int(nu(=(threadIdx.y
( int(rho(=(threadIdx.z
( f(x,(mu,(nu,(rho)
}
reduceSinglePassAsync((...,(stream=Stream1);
//(Event(will(be(recorded(once(reduction(Kernel(finished
cudaEventRecord(ReductionDone1,(Stream1);
//(make(copyStream(Wait(until(reduce(is(done(L(signaled(by(Event(
//(this(does(not(block(the(host
cudaStreamWaitEvent(CopyStream,(ReductionDone1,(0);
//(already(issue(async(copy(command(to(CopyStream(
//(it(will(only(execute(after(StreamWaitEvent
cudaMemcpyAsync(...,(CopyStream);
GTC 2014 | Mathias Wagner | Indiana University |
Benchmark some strategies for parallelism and loopstim
e [m
s]
0
450
900
1350
1800
x-grid ext µ ext µ ν ρ xµ-grid block ν ρ
Nt=8 k20 Nt6 k20 Nt8 M2075 Nt6 M2075
rela
tive
perfo
man
ce
0,75
0,875
1
1,125
1,25
1,375
1,5
x-grid ext µ ext µ ν ρ xµ-grid block ν ρ
Nt=8 k20 Nt6 k20 Nt8 M2075 Nt6 M2075
GTC 2014 | Mathias Wagner | Indiana University |
Consider some strategies for parallelism and loops
•What comes to your mind ?
•let threads run over x
•use external µ
•use external µ, ν, ρ
•let threads run over x and μkeep ν, ρ external
Fx,µ
=4X
⌫=0,⌫ 6=µ
4X
⇢=0,⇢ 6=µ,⌫
fx,µ,⌫,⇢
x=0 µ=0
x=1 µ=0
x=BS-1 µ=0
x=BS µ=0
x=BS+1 µ=0
x=2*BS-1 µ=0
x=N-BS µ=3
x=N-BS+1 µ=3
x=N-1 µ=3
x=N-BS µ=0
x=N-BS+1 µ=0
x=N-1 µ=0
x=0 µ=1
x=1 µ=1
x=BS-1 µ=1
⋮
⋮
�
�
�
�
�
for(int(mu=0;(mu<4;(mu++)
( for(int(x=0;(x<N;(x++)
( ( for(int(nu=0;(nu++;(nu<4)
( ( ( //(ensure(nu(!=(mu
( ( ( for(int(rho=0;(rho<4;(rho++)
( ( ( ( //(ensure(rho(!=(nu,(rho(!=(mu
( ( ( ( f(x,mu,nu,rho)
//(grid(x
Kernel<<<(blocksize(,(gridsize(>>>((args,(x)
//(internal(mu,(nu,(rho
__global__(Kernel(args,(x){
( for(int(mu=0;(mu<4;(mu++)
( ( for(int(nu=0;(nu(<4;(nu++)
( ( ( //(ensure(nu(!=(mu
( ( ( for(int(rho=0;(rho(<4;(rho++)
( ( ( ( //ensure(rho(!=(mu,(rho(!=(nu
( ( ( ( f(x,(mu,(nu,(rho)
}
//(external(mu
//(grid(x
for(int(mu=0;(mu<4;(mu++)
( Kernel<<<(blocksize(,(gridsize(>>>((args,(x,(mu)
//(internal(nu,(rho
__global__(Kernel(args,(x,(mu){
( for(int(nu=0;(nu(<4;(nu++)
( ( //(ensure(nu(!=(mu
( ( for(int(rho=0;(rho(<4;(rho++)
( ( ( //ensure(rho(!=(mu,(rho(!=(nu
( ( ( f(x,(mu,(nu,(rho)
}
//(external(mu,(nu,(rho
//(grid(x
for(int(mu=0;(mu<4;(mu++)
( for(int(nu=0;(nu(<4;(nu++)
( //(ensure(nu(!=(mu
( ( for(int(rho=0;(rho(<4;(rho++)
( ( //ensure(rho(!=(mu,(rho(!=(nu
( ( Kernel<<<(blocksize(,(gridsize(>>>((args,(x,(mu,(nu,(rho)
__global__(Kernel(args,(x,(mu,(nu,(rho){
( f(x,(mu,(nu,(rho)
}
//(external(nu,(rho
//(grid(x*mu
for(int(nu=0;(nu(<4;(nu++)
( //(ensure(nu(!=(mu
( for(int(rho=0;(rho(<4;(rho++)
( ( //ensure(rho(!=(mu,(rho(!=(nu
( ( Kernel<<<(blocksize(,(gridsize(>>>((args,(xmu)
__global__(Kernel(args,(xmu){
( int(x(=(xmu(%(N;
( int(mu(=(xmu(/(N;
( f(x,(mu,(nu,(rho)
}
//(grid(xmu,(block((xmu,(nu,(rho)
//ensure(rho(!=(mu,(rho(!=(nu
//(3((2)(allowed(nu((rho)(L(values
Kernel<<<((blocksize,3,2)(,(gridsize(>>>((args,(xmu)
//(internal(nu,(rho
__global__(Kernel(args,(xmu){
( int(x(=(xmu(%(N;
( int(mu(=(xmu(/(N;
( int(nu(=(threadIdx.y
( int(rho(=(threadIdx.z
( f(x,(mu,(nu,(rho)
}
reduceSinglePassAsync((...,(stream=Stream1);
//(Event(will(be(recorded(once(reduction(Kernel(finished
cudaEventRecord(ReductionDone1,(Stream1);
//(make(copyStream(Wait(until(reduce(is(done(L(signaled(by(Event(
//(this(does(not(block(the(host
cudaStreamWaitEvent(CopyStream,(ReductionDone1,(0);
//(already(issue(async(copy(command(to(CopyStream(
//(it(will(only(execute(after(StreamWaitEvent
cudaMemcpyAsync(...,(CopyStream);
GTC 2014 | Mathias Wagner | Indiana University |
Consider some strategies for parallelism and loops
•What comes to your mind ?
•let threads run over x
•use external µ
•use external µ, ν, ρ
•let threads run over x and μkeep ν, ρ external
•let threads run over x and μuse 3d-block for ν, ρ
Fx,µ
=4X
⌫=0,⌫ 6=µ
4X
⇢=0,⇢ 6=µ,⌫
fx,µ,⌫,⇢
⋮
⋮
x=0 (0,1,2)
x=1 (0,1,2)
x=BS-1 (0,1,2)
x=0 (0,2,1)
x=1 (0,2,1)
x=BS-1 (0,2,1)
x=0 (0,3,1)
x=1 (0,3,1)
x=BS-1 (0,3,1)
�
�
�
x=N-BS(3,0,1)
x=N-BS+1 (3,0,1)
x=N-1 (3,0,1)
x=N-BS(3,1,0)
x=N-BS+1 (3,1,0)
x=N-1 (3,1,0)
x=N-BS(3,2,0
x=N-BS+1 (3,2,1)
x=N-1 (3,2,0)
�
�
�
for(int(mu=0;(mu<4;(mu++)
( for(int(x=0;(x<N;(x++)
( ( for(int(nu=0;(nu++;(nu<4)
( ( ( //(ensure(nu(!=(mu
( ( ( for(int(rho=0;(rho<4;(rho++)
( ( ( ( //(ensure(rho(!=(nu,(rho(!=(mu
( ( ( ( f(x,mu,nu,rho)
//(grid(x
Kernel<<<(blocksize(,(gridsize(>>>((args,(x)
//(internal(mu,(nu,(rho
__global__(Kernel(args,(x){
( for(int(mu=0;(mu<4;(mu++)
( ( for(int(nu=0;(nu(<4;(nu++)
( ( ( //(ensure(nu(!=(mu
( ( ( for(int(rho=0;(rho(<4;(rho++)
( ( ( ( //ensure(rho(!=(mu,(rho(!=(nu
( ( ( ( f(x,(mu,(nu,(rho)
}
//(external(mu
//(grid(x
for(int(mu=0;(mu<4;(mu++)
( Kernel<<<(blocksize(,(gridsize(>>>((args,(x,(mu)
//(internal(nu,(rho
__global__(Kernel(args,(x,(mu){
( for(int(nu=0;(nu(<4;(nu++)
( ( //(ensure(nu(!=(mu
( ( for(int(rho=0;(rho(<4;(rho++)
( ( ( //ensure(rho(!=(mu,(rho(!=(nu
( ( ( f(x,(mu,(nu,(rho)
}
//(external(mu,(nu,(rho
//(grid(x
for(int(mu=0;(mu<4;(mu++)
( for(int(nu=0;(nu(<4;(nu++)
( //(ensure(nu(!=(mu
( ( for(int(rho=0;(rho(<4;(rho++)
( ( //ensure(rho(!=(mu,(rho(!=(nu
( ( Kernel<<<(blocksize(,(gridsize(>>>((args,(x,(mu,(nu,(rho)
__global__(Kernel(args,(x,(mu,(nu,(rho){
( f(x,(mu,(nu,(rho)
}
//(external(nu,(rho
//(grid(x*mu
for(int(nu=0;(nu(<4;(nu++)
( //(ensure(nu(!=(mu
( for(int(rho=0;(rho(<4;(rho++)
( ( //ensure(rho(!=(mu,(rho(!=(nu
( ( Kernel<<<(blocksize(,(gridsize(>>>((args,(xmu)
__global__(Kernel(args,(xmu){
( int(x(=(xmu(%(N;
( int(mu(=(xmu(/(N;
( f(x,(mu,(nu,(rho)
}
//(grid(xmu,(block((xmu,(nu,(rho)
//ensure(rho(!=(mu,(rho(!=(nu
//(3((2)(allowed(nu((rho)(L(values
Kernel<<<((blocksize,3,2)(,(gridsize(>>>((args,(xmu)
//(internal(nu,(rho
__global__(Kernel(args,(xmu){
( int(x(=(xmu(%(N;
( int(mu(=(xmu(/(N;
( int(nu(=(threadIdx.y
( int(rho(=(threadIdx.z
( f(x,(mu,(nu,(rho)
}
reduceSinglePassAsync((...,(stream=Stream1);
//(Event(will(be(recorded(once(reduction(Kernel(finished
cudaEventRecord(ReductionDone1,(Stream1);
//(make(copyStream(Wait(until(reduce(is(done(L(signaled(by(Event(
//(this(does(not(block(the(host
cudaStreamWaitEvent(CopyStream,(ReductionDone1,(0);
//(already(issue(async(copy(command(to(CopyStream(
//(it(will(only(execute(after(StreamWaitEvent
cudaMemcpyAsync(...,(CopyStream);
GTC 2014 | Mathias Wagner | Indiana University |
Benchmark some strategies for parallelism and loopstim
e [m
s]
0
450
900
1350
1800
x-grid ext µ ext µ ν ρ xµ-grid block ν ρ
Nt=8 k20 Nt6 k20 Nt8 M2075 Nt6 M2075
rela
tive
perfo
man
ce
0,75
0,875
1
1,125
1,25
1,375
1,5
x-grid ext µ ext µ ν ρ xµ-grid block ν ρ
Nt=8 k20 Nt6 k20 Nt8 M2075 Nt6 M2075
GTC 2014 | Mathias Wagner | Indiana University |
Consider some strategies for parallelism and loops
•What comes to your mind ?
•let threads run over x
•use external µ
•use external µ, ν, ρ
•let threads run over x and μkeep ν, ρ external
•let threads run over x and μuse 3d-block for ν, ρ
Fx,µ
=4X
⌫=0,⌫ 6=µ
4X
⇢=0,⇢ 6=µ,⌫
fx,µ,⌫,⇢
⋮
⋮
x=0 (0,1,2)
x=1 (0,1,2)
x=BS-1 (0,1,2)
x=0 (0,2,1)
x=1 (0,2,1)
x=BS-1 (0,2,1)
x=0 (0,3,1)
x=1 (0,3,1)
x=BS-1 (0,3,1)
�
�
�
x=N-BS(3,0,1)
x=N-BS+1 (3,0,1)
x=N-1 (3,0,1)
x=N-BS(3,1,0)
x=N-BS+1 (3,1,0)
x=N-1 (3,1,0)
x=N-BS(3,2,0
x=N-BS+1 (3,2,1)
x=N-1 (3,2,0)
�
�
�
x=0 (0,1,3)
x=1 (0,1,3)
x=BS-1 (0,1,3)
x=0 (0,2,3)
x=1 (0,2,3)
x=BS-1 (0,2,3)
x=0 (0,3,2)
x=1 (0,3,2)
x=BS-1 (0,3,2)
�
�
�
x=N-BS(3,0,2)
x=N-BS+1 (3,0,2)
x=N-1 (3,0,2)
x=N-BS(3,1,2)
x=N-BS+1 (3,0,2)
x=N-1 (3,0,2)
x=N-BS(3,2,1)
x=N-BS+1 (3,0,2)
x=N-1 (3,0,2)
�
�
�
for(int(mu=0;(mu<4;(mu++)
( for(int(x=0;(x<N;(x++)
( ( for(int(nu=0;(nu++;(nu<4)
( ( ( //(ensure(nu(!=(mu
( ( ( for(int(rho=0;(rho<4;(rho++)
( ( ( ( //(ensure(rho(!=(nu,(rho(!=(mu
( ( ( ( f(x,mu,nu,rho)
//(grid(x
Kernel<<<(blocksize(,(gridsize(>>>((args,(x)
//(internal(mu,(nu,(rho
__global__(Kernel(args,(x){
( for(int(mu=0;(mu<4;(mu++)
( ( for(int(nu=0;(nu(<4;(nu++)
( ( ( //(ensure(nu(!=(mu
( ( ( for(int(rho=0;(rho(<4;(rho++)
( ( ( ( //ensure(rho(!=(mu,(rho(!=(nu
( ( ( ( f(x,(mu,(nu,(rho)
}
//(external(mu
//(grid(x
for(int(mu=0;(mu<4;(mu++)
( Kernel<<<(blocksize(,(gridsize(>>>((args,(x,(mu)
//(internal(nu,(rho
__global__(Kernel(args,(x,(mu){
( for(int(nu=0;(nu(<4;(nu++)
( ( //(ensure(nu(!=(mu
( ( for(int(rho=0;(rho(<4;(rho++)
( ( ( //ensure(rho(!=(mu,(rho(!=(nu
( ( ( f(x,(mu,(nu,(rho)
}
//(external(mu,(nu,(rho
//(grid(x
for(int(mu=0;(mu<4;(mu++)
( for(int(nu=0;(nu(<4;(nu++)
( //(ensure(nu(!=(mu
( ( for(int(rho=0;(rho(<4;(rho++)
( ( //ensure(rho(!=(mu,(rho(!=(nu
( ( Kernel<<<(blocksize(,(gridsize(>>>((args,(x,(mu,(nu,(rho)
__global__(Kernel(args,(x,(mu,(nu,(rho){
( f(x,(mu,(nu,(rho)
}
//(external(nu,(rho
//(grid(x*mu
for(int(nu=0;(nu(<4;(nu++)
( //(ensure(nu(!=(mu
( for(int(rho=0;(rho(<4;(rho++)
( ( //ensure(rho(!=(mu,(rho(!=(nu
( ( Kernel<<<(blocksize(,(gridsize(>>>((args,(xmu)
__global__(Kernel(args,(xmu){
( int(x(=(xmu(%(N;
( int(mu(=(xmu(/(N;
( f(x,(mu,(nu,(rho)
}
//(grid(xmu,(block((xmu,(nu,(rho)
//ensure(rho(!=(mu,(rho(!=(nu
//(3((2)(allowed(nu((rho)(L(values
Kernel<<<((blocksize,3,2)(,(gridsize(>>>((args,(xmu)
//(internal(nu,(rho
__global__(Kernel(args,(xmu){
( int(x(=(xmu(%(N;
( int(mu(=(xmu(/(N;
( int(nu(=(threadIdx.y
( int(rho(=(threadIdx.z
( f(x,(mu,(nu,(rho)
}
reduceSinglePassAsync((...,(stream=Stream1);
//(Event(will(be(recorded(once(reduction(Kernel(finished
cudaEventRecord(ReductionDone1,(Stream1);
//(make(copyStream(Wait(until(reduce(is(done(L(signaled(by(Event(
//(this(does(not(block(the(host
cudaStreamWaitEvent(CopyStream,(ReductionDone1,(0);
//(already(issue(async(copy(command(to(CopyStream(
//(it(will(only(execute(after(StreamWaitEvent
cudaMemcpyAsync(...,(CopyStream);
GTC 2014 | Mathias Wagner | Indiana University |
Benchmark some strategies for parallelism and loopstim
e [m
s]
0
450
900
1350
1800
x-grid ext µ ext µ ν ρ xµ-grid block ν ρ
Nt=8 k20 Nt6 k20 Nt8 M2075 Nt6 M2075
rela
tive
perfo
man
ce
0,75
0,875
1
1,125
1,25
1,375
1,5
x-grid ext µ ext µ ν ρ xµ-grid block ν ρ
Nt=8 k20 Nt6 k20 Nt8 M2075 Nt6 M2075
GTC 2014 | Mathias Wagner | Indiana University |
‘nvidia-smi -ac 3004,875’ or Unleashing the K40
•K40 features GPU Boost
•increase clock from 745 to 875 MHz (~15%)
•within power headroom
•memory clock stays constant
GTC 2014 | Mathias Wagner | Indiana University |
‘nvidia-smi -ac 3004,875’ or Unleashing the K40
•K40 features GPU Boost
•increase clock from 745 to 875 MHz (~15%)
•within power headroom
•memory clock stays constant
•How does it affect bandwidth-bound applications ?
rel.
spee
dup
0
0,25
0,5
0,75
1
M2075 K20 K20 X K40 K40 @ 875
Trajectory Inversion F.Force
GTC 2014 | Mathias Wagner | Indiana University |
‘nvidia-smi -ac 3004,875’ or Unleashing the K40
•K40 features GPU Boost
•increase clock from 745 to 875 MHz (~15%)
•within power headroom
•memory clock stays constant
•How does it affect bandwidth-bound applications ?
rel.
spee
dup
0
0,25
0,5
0,75
1
M2075 K20 K20 X K40 K40 @ 875
Trajectory Inversion F.Force
GTC 2014 | Mathias Wagner | Indiana University |
A closer look at the speedups•Inversion: matrix-vector multiplication
•rather simple memory access pattern
•Force: long chain of small matrix-multiplications
•Inversions scale with frequency up to 810 MHz
•Force shows weaker dependency
•but scales up to 875 MHz
•apply K20’s bandwidth/flops ratio to K40→ 845 MHz
0,8
0,9
1
1,1
1,2
K40 @ 666 K40 @ 745 K40 @ 810 K40 @ 875
Inversion F.Force 1/Freq
0,9
1
1,1
K20 @ 666 K20 @ 705 K20 @ 758
GTC 2014 | Mathias Wagner | Indiana University |
Is it worth to replace Fermi cards ?
•K40 is about a factor two faster than M2075
•still not fully exploiting Kepler
•For example:
•replace 100 M2075
•will need 54 K40 (~ $250,000)
•power savings ?
GTC 2014 | Mathias Wagner | Indiana University |
Real world power consumption of a RHMC trajectory
•Kepler allows for identification of code parts •code stays within power limits
60
80
100
120
140
160
180
200
0 100 200 300 400 500 600
pow
er
[W]
time [s]
K40K40 @ 875
40
60
80
100
120
140
160
180
200
0 100 200 300 400 500 600 700 800 900 1000
pow
er
[W]
time [s]
M2075K40K20
GTC 2014 | Mathias Wagner | Indiana University |
Real world power consumption of a RHMC trajectory
•Kepler allows for identification of code parts •code stays within power limits
•GPU boost can be scaled
40
60
80
100
120
140
160
180
200
0 100 200 300 400 500 600 700 800 900 1000
pow
er
[W]
time [s]
M2075K40K20
60
70
80
90
100
110
120
130
140
150
160
0 100 200 300 400 500 600
pow
er
[W]
time [s]
K40K40 @ 875 (scaled)
GTC 2014 | Mathias Wagner | Indiana University |
Assume universities would have business administration …
•Kepler is two times more efficient M2075
K20
K40 @ 745
K40 @ 810
K40 @ 875
avg Power [W]0 80 160 240 320 400
GPU per GPU in a 2 GPU System
M2075
K20
K40 @ 745
K40 @ 810
K40 @ 875
Time / Trajectrory [s]
0 250 500 750 1000
Ener
gy /
Traj
[kW
h]
0
0,02
0,04
0,06
0,08
M2075 K20 K40 @ 745 K40 @ 810 K40 @ 875
GPU Host System
0,0220,020,020,0210,041
GTC 2014 | Mathias Wagner | Indiana University |
Assume universities would have business administration …
•Kepler is two times more efficient
•GPUs environment: ~ 300W / node
•K40 consumes less host power→ ‘run to idle’
M2075
K20
K40 @ 745
K40 @ 810
K40 @ 875
avg Power [W]0 80 160 240 320 400
GPU per GPU in a 2 GPU System
M2075
K20
K40 @ 745
K40 @ 810
K40 @ 875
Time / Trajectrory [s]
0 250 500 750 1000
Ener
gy /
Traj
[kW
h]
0
0,02
0,04
0,06
0,08
M2075 K20 K40 @ 745 K40 @ 810 K40 @ 875
GPU Host System
0,020,020,0220,028
0,038
0,0220,020,020,0210,041
GTC 2014 | Mathias Wagner | Indiana University |
Assume universities would have business administration …
•Kepler is two times more efficient
•GPUs environment: ~ 300W / node
•K40 consumes less host power→ ‘run to idle’
•still, for 100 M2075 → 54 K40:
•Energy savings: 130 MWh / year
•Germany € 26,000 / year
•US $19,000 / year
•ignores cooling …
M2075
K20
K40 @ 745
K40 @ 810
K40 @ 875
avg Power [W]0 80 160 240 320 400
GPU per GPU in a 2 GPU System
M2075
K20
K40 @ 745
K40 @ 810
K40 @ 875
Time / Trajectrory [s]
0 250 500 750 1000
Ener
gy /
Traj
[kW
h]
0
0,02
0,04
0,06
0,08
M2075 K20 K40 @ 745 K40 @ 810 K40 @ 875
GPU Host System
0,020,020,0220,028
0,038
0,0220,020,020,0210,041
GTC 2014 | Mathias Wagner | Indiana University |
Status of lattice data
0
0.1
0.2
0.3 r2B
Tc=154(9)MeV
No=68
0
0.05
0.1r4
B BNL-Bielefeldpreliminary
-0.5
0
0.5
140 160 180 200 220 240
r6B
T [MeV]
0
0.2
0.4
0.6 r2Q
Tc=154(9)MeV
No=68
0
0.1
0.2
0.3 r4Q BNL-Bielefeld
preliminary
-1-0.5
0 0.5
1
140 160 180 200 220 240
r6Q
T [MeV]
0
0.2
0.4
0.6
0.8 r2S
Tc=154(9)MeV
No=68
0
0.2
0.4
0.6 r4S
BNL-Bielefeldpreliminary
-1-0.5
0 0.5
1 1.5
140 160 180 200 220 240
r6S
T [MeV]
•highly-improved staggered quarks, close to physical pion mass ( )ml/ms = 1/20
GTC 2014 | Mathias Wagner | Indiana University |
Status of lattice data
•highly-improved staggered quarks, close to physical pion mass ( )ml/ms = 1/20
0
0.1
0.2
0.3 r2B
Tc=154(9)MeV
No=68
0
0.05
0.1r4
B BNL-Bielefeldpreliminary
-0.5
0
0.5
140 160 180 200 220 240
r6B
T [MeV]
0
0.2
0.4
0.6 r2Q
Tc=154(9)MeV
No=68
0
0.1
0.2
0.3 r4Q BNL-Bielefeld
preliminary
-1-0.5
0 0.5
1
140 160 180 200 220 240
r6Q
T [MeV]
0
0.2
0.4
0.6
0.8 r2S
Tc=154(9)MeV
No=68
0
0.2
0.4
0.6 r4S
BNL-Bielefeldpreliminary
-1-0.5
0 0.5
1 1.5
140 160 180 200 220 240
r6S
T [MeV]
GTC 2014 | Mathias Wagner | Indiana University |
•How can one build a thermometer?
•Oldest idea: Height of mercury column
•relation between a length and temperature
•Construct from fluctuations observables
•with monotonic and strong dependence
•can be / have been measured experimentally
•can be calculated from first principles
Thermometer for freeze-out in heavy-ion collisions
RQ31(T, µB) = RQ,0
31 + RQ,231 µ2
B
µB/T=1µB/T=0No=6No=8
0.0
0.5
1.0
1.5
2.0
2.5
3.0
140 150 160 170 180 190 200 210 220 230 240
T [MeV]
R31Q
HRG
free
Bands: Continuum estimate
BNL-BI, PRL 109 (2012) 192302
GTC 2014 | Mathias Wagner | Indiana University |
•How can one build a thermometer?
•Oldest idea: Height of mercury column
•relation between a length and temperature
•Construct from fluctuations observables
•with monotonic and strong dependence
•can be / have been measured experimentally
•can be calculated from first principles
Thermometer for freeze-out in heavy-ion collisions
RQ31(T, µB) = RQ,0
31 + RQ,231 µ2
B
µB/T=1µB/T=0No=6No=8
0.0
0.5
1.0
1.5
2.0
2.5
3.0
140 150 160 170 180 190 200 210 220 230 240
T [MeV]
R31Q
HRG
free
!S. Mukherjee, MW, CPOD 2013, arXiv:1307.6255
GTC 2014 | Mathias Wagner | Indiana University |GTC 2014 | Mathias Wagner | Indiana University |
Summary
•Increasing the flop / byte ratio and don’t leave linear algebra behind
•Optimizing parallelism and loops
•Kepler GPU Boost and energy consumption
•GPUs allow to push the borders of Lattice QCD
•multiple right hand sides solvers using mixed precision
•overcome memory limitations for multiple r.h.s with pipelining
•Runs on Titan:Configuration generation on CPUs and ‘live-measurements’ on the GPU
GTC 2014 | Mathias Wagner | Indiana University |
Thank you and our supporters
Financial support NSF PHY-1212389 DFG GSI BILAER
GPU Clusters at Bielefeld University (Ger) Paderborn University (Ger) Indiana University (US) Jefferson Laboratory (US)
Contact: [email protected]