GPU-Based Lattice QCD Simulations as Thermometer for Heavy … · 2014. 4. 7. · Quarks and gluons are crossover at Tc the degrees of freedom confined inside hadrons (asymptotically)

GTC 2014 | Mathias Wagner | Indiana University |

GPU-Based Lattice QCD Simulations as Thermometer for Heavy-Ion Collisions


Outline•Strong interaction and QCD phase transition

•Lattice QCD

•Optimizations

•Increasing flop / byte ratios

•Hiding latencies

•Exploring different strategies for parallelism

•GPU Boost and energy efficiency


Phase transitions

•water at different temperatures

•ice (solid)

•water (liquid)

•vapor (gas)

•boiling point of water depends on pressure → phase diagram


Phase transitions

•water at different temperatures

•ice (solid)

•water (liquid)

•vapor (gas)

•boiling point of water depends on pressure → phase diagram

arX

iv:0

801.

4256

v2 [

hep-

ph]

28 Ju

l 200

9

The Phase Diagram of Strongly Interacting Matter

P. Braun-Munzinger1, 2 and J. Wambach1, 2

1GSI Helmholtzzentrum fur Schwerionenforschung mbH, Planckstr, 1, D64291 Darmstadt, Germany2Technical University Darmstadt, Schlossgartenstr. 9, D64287 Darmstadt, Germany

A fundamental question of physics is what ultimately happens to matter as it is heated or com-pressed. In the realm of very high temperature and density the fundamental degrees of freedom ofthe strong interaction, quarks and gluons, come into play and a transition from matter consistingof confined baryons and mesons to a state with ’liberated’ quarks and gluons is expected. Thestudy of the possible phases of strongly interacting matter is at the focus of many research activi-ties worldwide. In this article we discuss physical aspects of the phase diagram, its relation to theevolution of the early universe as well as the inner core of neutron stars. We also summarize recentprogress in the experimental study of hadronic or quark-gluon matter under extreme conditionswith ultrarelativistic nucleus-nucleus collisions.

PACS numbers: 21.60.Cs,24.60.Lz,21.10.Hw,24.60.Ky

Contents

I. Introduction 1

II. Strongly Interacting Matter under ExtremeConditions 2A. Quantum Chromodynamics 2B. Models of the phase diagram 4

III. Results from Lattice QCD 7

IV. Experiments with Heavy Ions 8A. Opaque fireballs and the ideal liquid scenario 9B. Hadro-Chemistry 10C. Medium modifications of vector mesons 12D. Quarkonia–messengers of deconfinement 15

V. Phases at High Baryon Density 17A. Color Superconductivity 17

VI. Summary and Conclusions 18

Acknowledgments 19

References 19

I. INTRODUCTION

Matter that surrounds us comes in a variety of phaseswhich can be transformed into each other by a change ofexternal conditions such as temperature, pressure, com-position etc. Transitions from one phase to another areoften accompanied by drastic changes in the physicalproperties of a material, such as its elastic properties,light transmission, or electrical conductivity. A good ex-ample is water whose phases are (partly) accessible toeveryday experience. Changes in external pressure andtemperature result in a rich phase diagram which, be-sides the familiar liquid and gaseous phases, features avariety of solid (ice) phases in which the H20 moleculesarrange themselves in spatial lattices of certain symme-tries (Fig. 1).

Twelve of such crystalline (ice) phases are known atpresent. In addition, three amorphous (glass) phases

FIG. 1 The phase diagram of H20 (Chaplin, 2007). Be-sides the liquid and gaseous phases a variety of crystallineand amorphous phases occur. Of special importance in thecontext of strongly interacting matter is the critical endpointbetween the vapor and liquid phase.

have been identified. Famous points in the phase dia-gram are the triple point where the solid, liquid, and gasphases coexist and the critical endpoint at which there isno distinction between the liquid and gas phase. This isthe endpoint of a line of first-order liquid-gas transitions;at this point the transition is of second order.

Under sufficient heating water and, for that matter anyother substance, goes over into a new state, a ’plasma’,consisting of ions and free electrons. This transitionis mediated by molecular or atomic collisions. It iscontinuous 1 and hence not a phase transition in thestrict thermodynamic sense. On the other hand, the

1 Under certain conditions there may also be a true plasma phasetransition, for recent evidence see (Fortov et al., 2007).


Phases of Quantum ChromoDynamics

•extreme conditions (temperatures, densities) are necessary to investigate properties of QCD

•important for understanding the evolution of the universe after the Big Bang

hadron gas dense hadronic matter quark gluon plasma !!! !!!!!!

cold hot !

cold nuclear matter phase transition or quarks and gluons are Quarks and gluons are crossover at Tc the degrees of freedom

confined inside hadrons (asymptotically) free


Accelerators ... the big ones

LHC @ CERN RHIC @ Brookhaven National Lab


Accelerators ... the big ones

LHC @ CERN RHIC @ Brookhaven National Lab


Heavy Ion Experiments

•phase transition occurs in heavy-ion collisions

•What thermometer can we use at ?

•detectors measure created particles

•to interpret the data theoretical input is required

•ab-initio approach: Lattice QCD

Heavy Ion Collision QGP Expansion+Cooling Hadronization

1012 K


Lattice QCD ingredients: Configuration generation

•sequential process

•use RHMC algorithm to evaluate the system in simulation time

•two dominant parts of the calculation (90% of the runtime)

•fermion force~50% for improved actions (HISQ)

•fermion matrix inversion~90% for standard action

Steps in a Lattice QCD calculation

1. Generate an ensemble of gluon field configurations,

2. Compute quark propagators in these fixed backgrounds by solving the Dirac equation for various right-hand sides

!"#$%&'()*$+%",-"#$.#(/01,(-23$$4$$5678$95:;;$$4$$8&1)*$;<$=>;; ?

!"#$%&'(&)&*)""'+#&,-.&+)*+/*)"'0(

;@$A0#01&-0$&#$0#,0B'C0$"D$ECF"#$D(0CG$)"#D(EF1&-("#,<

=@$9"BHF-0$IF&1J$H1"H&E&-"1,$(#$-*0,0$D(K0G$'&)JE1"F#G,$'2$,"C/(#E$-*0$L(1&)$0IF&-("#$D"1$/&1("F,$1(E*-M*&#G$,(G0,@

"1$N7KO'P

!"#$%&'()*$+%",-"#$.#(/01,(-23$$4$$5678$95:;;$$4$$8&1)*$;<$=>;; ?

!"#$%&'(&)&*)""'+#&,-.&+)*+/*)"'0(

;@$A0#01&-0$&#$0#,0B'C0$"D$ECF"#$D(0CG$)"#D(EF1&-("#,<


"1$N7KO'P!"#$%&'()*$+%",-"#$.#(/01,(-23$$4$$5678$95:;;$$4$$8&1)*$;<$=>;; ?

!"#$%&'(&)&*)""'+#&,-.&+)*+/*)"'0(

;@$A0#01&-0$&#$0#,0B'C0$"D$ECF"#$D(0CG$)"#D(EF1&-("#,<


"1$N7KO'P

Friday, 11 March 2011

P = �@H

@Q= � @S

@Q= �

✓@Sg

@Q+

@Sf

@Q

◆Q =

@H

@P= P

Dynamical Simulations with Highly Improved Staggered Quarks R. M. Woloshyn

Figure 1: Paths used in the ASQTAD and HISQ actions. Path coefficients for ASQTAD can be foundin Ref. [4]. The HISQ effective links are constructed by first applying a Fat7 fattening to the baselinks (U →UF with coefficients 1-link:1/8, 3-staple:1/16, 5-staple:1/64, 7-staple:1/384), then a SU(3)-projection (UF → UR), and finally an “ASQ” smearing (UR → Uef f with coefficients 1-link:1+ ε/8, 3-staple:1/16, 5-staple:1/64, 7-staple:1/384, Lepage:−1/8, Naik:−(1+ ε)/24, where the parameter ε is in-troduced to remove (am)4 errors [9]).

The pseudo-fermion field Φ is defined on even lattice sites only to avoid a doubling of flavors fromusing M†M instead of M in the action. This procedure is valid since M†M has no matrix elementconnecting even and odd lattice sites.

A key component in dynamical simulations using molecular dynamics evolution is the com-putation of fermion force — derivative of the fermion action with respect to the base links

fx,µ =∂Sf∂Ux,µ

=∂

∂Ux,µ

⟨

Φ∣

∣

∣

[

M†[U ]M[U ]]−nf /4

∣

∣

∣Φ⟩

. (2.3)

The derivative can be computed straightforwardly if nf is a multiple of 4; for other numbers offermion flavors the 4th-root of M†M can be approximated by a rational expansion (the RHMCalgorithm [17, 18, 19])

[M†M]−nf /4 ≈ α0+∑l

αlM†M+βl

, (2.4)

where αl and βl are constants. The derivative becomes

∂Sf∂Ux,µ

= −∑lαl

⟨

Φ[M†M+βl]−1∣

∣

∣

∣

∂∂Ux,µ

(

M†[U ]M[U ])

∣

∣

∣

∣

[M†M+βl]−1Φ

⟩

= −∑lαl

(⟨

Xl∣

∣

∣

∣

∂D†[U ]

∂Ux,µ

∣

∣

∣

∣

Y l⟩

+

⟨

Y l∣

∣

∣

∣

∂D[U ]

∂Ux,µ

∣

∣

∣

∣

Xl⟩)

, (2.5)

with |Xl⟩ = [M†M+βl]−1|Φ⟩ and |Y l⟩ = D|Xl⟩. Note that Xl and Y l are defined on even and oddsites respectively. Taking the derivatives of D, D† with respect to Uef f , Uef f† and writing out thematrix indices we have

[

fx,µ]

ab =∂Sf

∂[

Ux,µ]

ab=∑

y,ν(−1)yηy,ν

(

∂ [Uef fy,ν ]mn

∂ [Ux,µ ]ab

[

f (0)y,ν

]

mn+∂ [Uef f†

y,ν ]mn∂ [Ux,µ ]ab

[

f (0)†y,ν

]

mn

)

, (2.6)

where f (0)y,ν is the vector outer product of the field variables at y and y+ν

[

f (0)y,ν

]

mn=

⎧

⎪

⎨

⎪

⎩

∑lαl[Y ly+ν ]n[X

l∗y ]m for even y

∑lαl[Xl

y+ν ]n[Yl∗y ]m for odd y

, (2.7)

3


Measurements: Get some physical observables

•Fluctuations are a great observable to explore the QCD phase diagram

!

!

•noisy estimators → large number of random vectors η (~1500 / configuration)

•up to 20000 configurations for each temperature

•dominant operation: fermion matrix inversion (~ 99%)

@(ln detM)

@µ=Tr

✓M�1 @M

@µ

◆

@2(ln detM)

@µ2=Tr

✓M�1 @

2M

@µ2

◆� Tr

✓M�1 @M

@µM�1 @M

@µ

◆

...

Tr

✓@n1M

@µn1M�1 @

n2M

@µn2. . .M�1

◆= lim

N!1

1

N

NX

k=1

⌘†k@n1M

@µn1M�1 @

n2M

@µn2. . .M�1⌘k


Mapping the Wilson-Clover operator to CUDA

• Each thread must

• Load the neighboring spinor (24 numbers x8)

• Load the color matrix connecting the sites (18 numbers x8)

• Load the clover matrix (72 numbers)

• Save the result (24 numbers)

• Arithmetic intensity

• 3696 floating point operations per site

• 2976 bytes per site (single precision)

• 1.24 naive arithmetic intensity

review basic details of the LQCD application and of NVIDIAGPU hardware. We then briefly consider some related workin Section IV before turning to a general description of theQUDA library in Section V. Our parallelization of the quarkinteraction matrix is described in VI, and we present anddiscuss our performance data for the parallelized solver inSection VII. We finish with conclusions and a discussion offuture work in Section VIII.

II. LATTICE QCDThe necessity for a lattice discretized formulation of QCD

arises due to the failure of perturbative approaches commonlyused for calculations in other quantum field theories, such aselectrodynamics. Quarks, the fundamental particles that are atthe heart of QCD, are described by the Dirac operator actingin the presence of a local SU(3) symmetry. On the lattice,the Dirac operator becomes a large sparse matrix, M , and thecalculation of quark physics is essentially reduced to manysolutions to systems of linear equations given by

Mx = b. (1)

The form of M on which we focus in this work is theSheikholeslami-Wohlert [6] (colloquially known as Wilson-clover) form, which is a central difference discretization of theDirac operator. When acting in a vector space that is the tensorproduct of a 4-dimensional discretized Euclidean spacetime,spin space, and color space it is given by

Mx,x0 = �12

4⇤

µ=1

�P�µ ⇤ Uµ

x �x+µ,x0 + P+µ ⇤ Uµ†x�µ �x�µ,x0

⇥

+ (4 + m + Ax)�x,x0

⌅ �12Dx,x0 + (4 + m + Ax)�x,x0 . (2)

Here �x,y is the Kronecker delta; P±µ are 4 ⇥ 4 matrixprojectors in spin space; U is the QCD gauge field whichis a field of special unitary 3⇥ 3 (i.e., SU(3)) matrices actingin color space that live between the spacetime sites (and henceare referred to as link matrices); Ax is the 12⇥12 clover matrixfield acting in both spin and color space,1 corresponding toa first order discretization correction; and m is the quarkmass parameter. The indices x and x⇥ are spacetime indices(the spin and color indices have been suppressed for brevity).This matrix acts on a vector consisting of a complex-valued12-component color-spinor (or just spinor) for each point inspacetime. We refer to the complete lattice vector as a spinorfield.

Since M is a large sparse matrix, an iterative Krylovsolver is typically used to obtain solutions to (1), requiringmany repeated evaluations of the sparse matrix-vector product.The matrix is non-Hermitian, so either Conjugate Gradients[7] on the normal equations (CGNE or CGNR) is used, ormore commonly, the system is solved directly using a non-symmetric method, e.g., BiCGstab [8]. Even-odd (also known

1Each clover matrix has a Hermitian block diagonal, anti-Hermitian blockoff-diagonal structure, and can be fully described by 72 real numbers.

Fig. 1. The nearest neighbor stencil part of the lattice Dirac operator D,as defined in (2), in the µ� � plane. The color-spinor fields are located onthe sites. The SU(3) color matrices Uµ

x are associated with the links. Thenearest neighbor nature of the stencil suggests a natural even-odd (red-black)coloring for the sites.

as red-black) preconditioning is used to accelerate the solutionfinding process, where the nearest neighbor property of theDx,x0 matrix (see Fig. 1) is exploited to solve the Schur com-plement system [9]. This has no effect on the overall efficiencysince the fields are reordered such that all components ofa given parity are contiguous. The quark mass controls thecondition number of the matrix, and hence the convergence ofsuch iterative solvers. Unfortunately, physical quark massescorrespond to nearly indefinite matrices. Given that currentleading lattice volumes are 323 ⇥ 256, for > 108 degrees offreedom in total, this represents an extremely computationallydemanding task.

III. GRAPHICS PROCESSING UNITS

In the context of general-purpose computing, a GPU iseffectively an independent parallel processor with its ownlocally-attached memory, herein referred to as device memory.The GPU relies on the host, however, to schedule blocks ofcode (or kernels) for execution, as well as for I/O. Data isexchanged between the GPU and the host via explicit memorycopies, which take place over the PCI-Express bus. The low-level details of the data transfers, as well as management ofthe execution environment, are handled by the GPU devicedriver and the runtime system.

It follows that a GPU cluster embodies an inherently het-erogeneous architecture. Each node consists of one or moreprocessors (the CPU) that is optimized for serial or moderatelyparallel code and attached to a relatively large amount ofmemory capable of tens of GB/s of sustained bandwidth. Atthe same time, each node incorporates one or more processors(the GPU) optimized for highly parallel code attached to arelatively small amount of very fast memory, capable of 150GB/s or more of sustained bandwidth. The challenge we face isthat these two powerful subsystems are connected by a narrowcommunications channel, the PCI-E bus, which sustains atmost 6 GB/s and often less. As a consequence, it is criticalto avoid unnecessary transfers between the GPU and the host.

Dx,x

0 = A =


Standard staggered Fermion Matrix (Dslash)

•Krylov space inversion of fermion matrix dominates runtime

•within inversion application of sparse Matrix dominates (>80%)

•memory: 8 SU(3) matrices input, 8 color vectors input, 1 color vector output

•8 x ( 72 + 24) + 24 bytes = 792 bytes ( 1584 for double precision)

•Flops: (CM = complex mult, CA = complex add)

•4 x ( 2 x 3 x (3 CM + 2 CA) + 3 CA) + 3 x 3 CA = 570 flops

•flops / byte ratios: 0.72

wx

= Dx,x

0vx

0 =3

X

µ=0

n

Ux,µ

vx+µ

� U †x�µ,µ

vx�µ

o


Solvers for multiple right hand sides

•need inversions for many (1500) ‘source’-vectors for a fixed gauge field (matrix)

•instead of calculate

•Bytes for n vectors

•Flops for n vectors

16 · (72 + n · 24) bytes + n · 24 bytes = 1152 bytes + 408 bytes · n .

1146 flops · n

# r.h.s. 1 2 3 4 5flops/ byte 0.73 1.16 1.45 1.65 1.8














Mx = b. (1)


Mx,x0 = �12

4⇤

µ=1

�P�µ ⇤ Uµ


⇥

+ (4 + m + Ax)�x,x0

⌅ �12Dx,x0 + (4 + m + Ax)�x,x0 . (2)










Dx,x

0 = A =















Mx = b. (1)


Mx,x0 = �12

4⇤

µ=1

�P�µ ⇤ Uµ


⇥

+ (4 + m + Ax)�x,x0

⌅ �12Dx,x0 + (4 + m + Ax)�x,x0 . (2)










Dx,x

0 = A =


+

⇣w(1)

x

, w(2)x

, . . . , w(2)x

⌘= D

x,x

0

⇣v(1)x

0 , v(2)x

0 , . . . , v(n)x

⌘

GFl

ops

0

150

300

450

600

1 2 3 4 5

M2075 est. GTX 580 est.K20 est. GTX Titan est.M2075 measured K20 measured


Multiple right hand sides using cache and registers

•obvious solution: store matrix in registers

•possible issue: more registers / thread → occupancy / spilling

__global__'Dslashreg'(w1,'w2,'w3,'v1,'v2,'v3'){...for(xp=...){' w1(x)'='D(x,xp)'*'v1(xp);' w2(x)'='D(x,xp)'*'v2(xp);' w3(x)'='D(x,xp)'*'v3(xp);'' }}

__global__'Dslashcache'(w,'v)...offset'='threadIdx.y;for(xp=...)' w(x,'offset)'+='D(x,xp)'*'v(x,'offset)}

__global__'Dslashregcache'(w1,'w2,'w3,'v1,'v2,'v3'){...offset'='threadIdx.y;for(xp=...){' w1(x,'offset)'='D(x,xp)'*'v1(xp,'offset);' w2(x,'offset)'='D(x,xp)'*'v2(xp,'offset);' w3(x,'offset)'='D(x,xp)'*'v3(xp,'offset);'' }}

⋮





•exploit texture cache → avoid register pressure

•load matrix through texture cache

•should hit in texture cache → only one global load




⋮

x=0 v1

x=1 v1

x=BS-1 v1

x=0 v2

x=1 v2

x=BS-1 v2

x=0 v3

x=1 v3

x=BS-1 v3

�

�

�








•combine both and explore best possible combinations

⋮

x=0 v1

x=1 v1

x=BS-1 v1

x=0 v2

x=1 v2

x=BS-1 v2

x=0 v3

x=1 v3

x=BS-1 v3

�

�

�












⋮

x=0 v1

x=1 v1

x=BS-1 v1

x=0 v2

x=1 v2

x=BS-1 v2

x=0 v3

x=1 v3

x=BS-1 v3

�

�

�




Tim

e / #

r.h.

s [m

s]

0

4

8

12

16

# r.h.s.0 3 6 9 12

1 r.h.s. in reg2 r.h.s. in reg3 r.h.s. in reg









⋮

x=0 v1

x=1 v1

x=BS-1 v1

x=0 v2

x=1 v2

x=BS-1 v2

x=0 v3

x=1 v3

x=BS-1 v3

�

�

�




GFlops

0

125

250

375

500

0 3 6 9 12


Initialization

loop

Fermion Matrix

↵ =X

i

|~pA~p|i

~r = ~r � !~p

~x = ~x+ !~p

�i = |ri|

� =X

i

�i

~p = ~r + �~p

Importance of linear algebra•using multiple r.h.s. provides massive speedup for Dslash

•linear algebra scales linearly

•more important for smaller lattice sizes

0,000

0,070

0,140

0,210

0,280

1 2 3

fraction LA

0

25

50

75

100

Nt=8 Nt=6

Dslash Linear AlgebraRest


Initialization

loop

Fermion Matrix

↵ =X

i

|~pA~p|i

~r = ~r � !~p

~x = ~x+ !~p

�i = |ri|

� =X

i

�i

~p = ~r + �~p

Importance of linear algebra•using multiple r.h.s. provides massive speedup for Dslash

•linear algebra scales linearly

•more important for smaller lattice sizes

0,000

0,070

0,140

0,210

0,280

1 2 3

fraction LA

0

25

50

75

100

Nt=8 Nt=6

Dslash Linear AlgebraRest


A linear algebra setup for multiple r.h.s.

Dslash

reduction

reduction

reduction

Linear Algebra

Linear Algebra

Linear Algebra

reduction

reduction

reduction

D2H

WaitEvent

Linear Algebra

Linear Algebra

Linear Algebra

D2H D2HSync to check

for convergence

reduceSinglePassAsync(1,1stream=Stream1);//1Event1will1be1recorded1once1reduction1Kernel1finishedcudaEventRecord(ReductionDone1,1Stream1);//1make1copyStream1Wait1until1reduce1is1done1G1signaled1by1Event1//1this1does1not1block1the1hostcudaStreamWaitEvent(CopyStream,1ReductionDone1,10);//1already1issue1async1copy1command1to1CopyStream1//1it1will1only1execute1after1StreamWaitEventcudaMemcpyAsync(,1CopyStream);



Dslash

reduction

reduction

reduction

Linear Algebra

Linear Algebra

Linear Algebra

reduction

reduction

reduction

D2H

WaitEvent

Linear Algebra

Linear Algebra

Linear Algebra


for convergence



0,8

0,85

0,9

0,95

1

sync streams shuffle

Nt=6 Nt=8


Dslash

reduction

reduction

reduction

Linear Algebra

Linear Algebra

Linear Algebra

reduction

reduction

reduction

D2H

WaitEvent

Linear Algebra

Linear Algebra

Linear Algebra


for convergence



•Fermion Force includes a lot of SU(3) matrix multiplications

•loop over possible orientations in 4 dimensional space-time

•each interior function

•multiplies up to 7 SU(3) matrices

•may contain a sum over up to 56 of these multiplications

•How do different ways to perform the loops affect the performance ?

Optimizing the fermion force or ‘Where to put the loops ?’Dynamical Simulations with Highly Improved Staggered Quarks R. M. Woloshyn

Figure 1: Paths used in the ASQTAD and HISQ actions. Path coefficients for ASQTAD can be foundin Ref. [4]. The HISQ effective links are constructed by first applying a Fat7 fattening to the baselinks (U →UF with coefficients 1-link:1/8, 3-staple:1/16, 5-staple:1/64, 7-staple:1/384), then a SU(3)-projection (UF → UR), and finally an “ASQ” smearing (UR → Uef f with coefficients 1-link:1+ ε/8, 3-staple:1/16, 5-staple:1/64, 7-staple:1/384, Lepage:−1/8, Naik:−(1+ ε)/24, where the parameter ε is in-troduced to remove (am)4 errors [9]).

The pseudo-fermion field Φ is defined on even lattice sites only to avoid a doubling of flavors fromusing M†M instead of M in the action. This procedure is valid since M†M has no matrix elementconnecting even and odd lattice sites.

A key component in dynamical simulations using molecular dynamics evolution is the com-putation of fermion force — derivative of the fermion action with respect to the base links

fx,µ =∂Sf∂Ux,µ

=∂

∂Ux,µ

⟨

Φ∣

∣

∣

[

M†[U ]M[U ]]−nf /4

∣

∣

∣Φ⟩

. (2.3)

The derivative can be computed straightforwardly if nf is a multiple of 4; for other numbers offermion flavors the 4th-root of M†M can be approximated by a rational expansion (the RHMCalgorithm [17, 18, 19])

[M†M]−nf /4 ≈ α0+∑l

αlM†M+βl

, (2.4)

where αl and βl are constants. The derivative becomes

∂Sf∂Ux,µ

= −∑lαl

⟨

Φ[M†M+βl]−1∣

∣

∣

∣

∂∂Ux,µ

(

M†[U ]M[U ])

∣

∣

∣

∣

[M†M+βl]−1Φ

⟩

= −∑lαl

(⟨

Xl∣

∣

∣

∣

∂D†[U ]

∂Ux,µ

∣

∣

∣

∣

Y l⟩

+

⟨

Y l∣

∣

∣

∣

∂D[U ]

∂Ux,µ

∣

∣

∣

∣

Xl⟩)

, (2.5)

with |Xl⟩ = [M†M+βl]−1|Φ⟩ and |Y l⟩ = D|Xl⟩. Note that Xl and Y l are defined on even and oddsites respectively. Taking the derivatives of D, D† with respect to Uef f , Uef f† and writing out thematrix indices we have

[

fx,µ]

ab =∂Sf

∂[

Ux,µ]

ab=∑

y,ν(−1)yηy,ν

(

∂ [Uef fy,ν ]mn

∂ [Ux,µ ]ab

[

f (0)y,ν

]

mn+∂ [Uef f†

y,ν ]mn∂ [Ux,µ ]ab

[

f (0)†y,ν

]

mn

)

, (2.6)

where f (0)y,ν is the vector outer product of the field variables at y and y+ν

[

f (0)y,ν

]

mn=

⎧

⎪

⎨

⎪

⎩

∑lαl[Y ly+ν ]n[X

l∗y ]m for even y

∑lαl[Xl

y+ν ]n[Yl∗y ]m for odd y

, (2.7)

3

Fx,µ

=4X

⌫=0,⌫ 6=µ

4X

⇢=0,⇢ 6=µ,⌫

fx,µ,⌫,⇢

for(int(mu=0;(mu<4;(mu++)

( for(int(x=0;(x<N;(x++)

( ( for(int(nu=0;(nu++;(nu<4)

( ( ( //(ensure(nu(!=(mu

( ( ( for(int(rho=0;(rho<4;(rho++)

( ( ( ( //(ensure(rho(!=(nu,(rho(!=(mu

( ( ( ( f(x,mu,nu,rho)

//(grid(x

Kernel<<<(blocksize(,(gridsize(>>>((args,(x)

//(internal(mu,(nu,(rho

__global__(Kernel(args,(x){

( for(int(mu=0;(mu<4;(mu++)

( ( for(int(nu=0;(nu(<4;(nu++)


( ( ( for(int(rho=0;(rho(<4;(rho++)

( ( ( ( //ensure(rho(!=(mu,(rho(!=(nu

( ( ( ( f(x,(mu,(nu,(rho)

}

//(external(x,(mu

//(grid(x


( Kernel<<<(blocksize(,(gridsize(>>>((args,(x,(mu)

//(internal(nu,(rho

__global__(Kernel(args,(x,(mu){

( for(int(nu=0;(nu(<4;(nu++)

( ( //(ensure(nu(!=(mu

( ( for(int(rho=0;(rho(<4;(rho++)

( ( ( //ensure(rho(!=(mu,(rho(!=(nu

( ( ( f(x,(mu,(nu,(rho)

}

//(external(x,(mu,(nu,(rho

//(grid(x







( ( ( Kernel<<<(blocksize(,(gridsize(>>>((args,(x,(mu,(nu,(rho)

__global__(Kernel(args,(x,(mu,(nu,(rho){

( f(x,(mu,(nu,(rho)

}

//(external(x,mu,(nu,(rho

//(grid(x*mu

for(int(xmu=0;(x<N;(x++)





( ( ( Kernel<<<(blocksize(,(gridsize(>>>((args,(xmu)

__global__(Kernel(args,(xmu){

( int(x(=(xmu(%(N;

( int(mu(=(xmu(/(N;

( f(x,(mu,(nu,(rho)

}

//(grid(xmu,(block((xmu,(nu,(rho)

//ensure(rho(!=(mu,(rho(!=(nu

//(3((2)(allowed(nu((rho)(L(values

Kernel<<<((blocksize,3,2)(,(gridsize(>>>((args,(xmu)

//(internal(nu,(rho


( int(x(=(xmu(%(N;

( int(mu(=(xmu(/(N;

( int(nu(=(threadIdx.y

( int(rho(=(threadIdx.z

( f(x,(mu,(nu,(rho)

}

reduceSinglePassAsync((...,(stream=Stream1);

//(Event(will(be(recorded(once(reduction(Kernel(finished

cudaEventRecord(ReductionDone1,(Stream1);

//(make(copyStream(Wait(until(reduce(is(done(L(signaled(by(Event(

//(this(does(not(block(the(host

cudaStreamWaitEvent(CopyStream,(ReductionDone1,(0);

//(already(issue(async(copy(command(to(CopyStream(

//(it(will(only(execute(after(StreamWaitEvent

cudaMemcpyAsync(...,(CopyStream);


Consider some strategies for parallelism and loops

•What comes to your mind ?

•let threads run over x

Fx,µ

=4X

⌫=0,⌫ 6=µ

4X

⇢=0,⇢ 6=µ,⌫

fx,µ,⌫,⇢ x=0 x=1 x=BS-1

x=BS x=BS+1 x=2*BS-1

x=N-BS x=1 x=N-1

⋮

�

�








//(grid(x









( ( ( ( f(x,(mu,(nu,(rho)

}

//(external(x,(mu

//(grid(x



//(internal(nu,(rho






( ( ( f(x,(mu,(nu,(rho)

}

//(external(x,(mu,(nu,(rho

//(grid(x







( ( ( Kernel<<<(blocksize(,(gridsize(>>>((args,(x,(mu,(nu,(rho)


( f(x,(mu,(nu,(rho)

}

//(external(x,mu,(nu,(rho

//(grid(x*mu

for(int(xmu=0;(x<N;(x++)





( ( ( Kernel<<<(blocksize(,(gridsize(>>>((args,(xmu)


( int(x(=(xmu(%(N;

( int(mu(=(xmu(/(N;

( f(x,(mu,(nu,(rho)

}





//(internal(nu,(rho


( int(x(=(xmu(%(N;

( int(mu(=(xmu(/(N;



( f(x,(mu,(nu,(rho)

}














•use external µ

Fx,µ

=4X

⌫=0,⌫ 6=µ

4X

⇢=0,⇢ 6=µ,⌫

fx,µ,⌫,⇢ x=0 x=1 x=BS-1


x=N-BS x=1 x=N-1

⋮

�

�








//(grid(x









( ( ( ( f(x,(mu,(nu,(rho)

}

//(external(mu

//(grid(x



//(internal(nu,(rho






( ( ( f(x,(mu,(nu,(rho)

}

//(external(mu,(nu,(rho

//(grid(x



( //(ensure(nu(!=(mu


( ( //ensure(rho(!=(mu,(rho(!=(nu

( ( Kernel<<<(blocksize(,(gridsize(>>>((args,(x,(mu,(nu,(rho)


( f(x,(mu,(nu,(rho)

}

//(external(nu,(rho

//(grid(x*mu

for(int(nu=0;(nu(<4;(nu++)


( for(int(rho=0;(rho(<4;(rho++)


( ( Kernel<<<(blocksize(,(gridsize(>>>((args,(xmu)


( int(x(=(xmu(%(N;

( int(mu(=(xmu(/(N;

( f(x,(mu,(nu,(rho)

}





//(internal(nu,(rho


( int(x(=(xmu(%(N;

( int(mu(=(xmu(/(N;



( f(x,(mu,(nu,(rho)

}














•use external µ

•use external µ, ν, ρ

Fx,µ

=4X

⌫=0,⌫ 6=µ

4X

⇢=0,⇢ 6=µ,⌫

fx,µ,⌫,⇢ x=0 x=1 x=BS-1


x=N-BS x=1 x=N-1

⋮

�

�








//(grid(x









( ( ( ( f(x,(mu,(nu,(rho)

}

//(external(mu

//(grid(x



//(internal(nu,(rho






( ( ( f(x,(mu,(nu,(rho)

}


//(grid(x








( f(x,(mu,(nu,(rho)

}

//(external(nu,(rho

//(grid(x*mu







( int(x(=(xmu(%(N;

( int(mu(=(xmu(/(N;

( f(x,(mu,(nu,(rho)

}





//(internal(nu,(rho


( int(x(=(xmu(%(N;

( int(mu(=(xmu(/(N;



( f(x,(mu,(nu,(rho)

}











Benchmark some strategies for parallelism and loopstim

e [m

s]

0

450

900

1350

1800

x-grid ext µ ext µ ν ρ xµ-grid block ν ρ

Nt=8 k20 Nt6 k20 Nt8 M2075 Nt6 M2075

rela

tive

perfo

man

ce

0,75

0,875

1

1,125

1,25

1,375

1,5


Nt=8 k20 Nt6 k20 Nt8 M2075 Nt6 M2075





•use external µ


•let threads run over x and μkeep ν, ρ external

Fx,µ

=4X

⌫=0,⌫ 6=µ

4X

⇢=0,⇢ 6=µ,⌫

fx,µ,⌫,⇢

x=0 µ=0

x=1 µ=0

x=BS-1 µ=0

x=BS µ=0

x=BS+1 µ=0

x=2*BS-1 µ=0

x=N-BS µ=3

x=N-BS+1 µ=3

x=N-1 µ=3

x=N-BS µ=0

x=N-BS+1 µ=0

x=N-1 µ=0

x=0 µ=1

x=1 µ=1

x=BS-1 µ=1

⋮

⋮

�

�

�

�

�








//(grid(x









( ( ( ( f(x,(mu,(nu,(rho)

}

//(external(mu

//(grid(x



//(internal(nu,(rho






( ( ( f(x,(mu,(nu,(rho)

}


//(grid(x








( f(x,(mu,(nu,(rho)

}

//(external(nu,(rho

//(grid(x*mu







( int(x(=(xmu(%(N;

( int(mu(=(xmu(/(N;

( f(x,(mu,(nu,(rho)

}





//(internal(nu,(rho


( int(x(=(xmu(%(N;

( int(mu(=(xmu(/(N;



( f(x,(mu,(nu,(rho)

}














•use external µ



•let threads run over x and μuse 3d-block for ν, ρ

Fx,µ

=4X

⌫=0,⌫ 6=µ

4X

⇢=0,⇢ 6=µ,⌫

fx,µ,⌫,⇢

⋮

⋮

x=0 (0,1,2)

x=1 (0,1,2)

x=BS-1 (0,1,2)

x=0 (0,2,1)

x=1 (0,2,1)

x=BS-1 (0,2,1)

x=0 (0,3,1)

x=1 (0,3,1)

x=BS-1 (0,3,1)

�

�

�

x=N-BS(3,0,1)

x=N-BS+1 (3,0,1)

x=N-1 (3,0,1)

x=N-BS(3,1,0)

x=N-BS+1 (3,1,0)

x=N-1 (3,1,0)

x=N-BS(3,2,0

x=N-BS+1 (3,2,1)

x=N-1 (3,2,0)

�

�

�








//(grid(x









( ( ( ( f(x,(mu,(nu,(rho)

}

//(external(mu

//(grid(x



//(internal(nu,(rho






( ( ( f(x,(mu,(nu,(rho)

}


//(grid(x








( f(x,(mu,(nu,(rho)

}

//(external(nu,(rho

//(grid(x*mu







( int(x(=(xmu(%(N;

( int(mu(=(xmu(/(N;

( f(x,(mu,(nu,(rho)

}





//(internal(nu,(rho


( int(x(=(xmu(%(N;

( int(mu(=(xmu(/(N;



( f(x,(mu,(nu,(rho)

}












e [m

s]

0

450

900

1350

1800


Nt=8 k20 Nt6 k20 Nt8 M2075 Nt6 M2075

rela

tive

perfo

man

ce

0,75

0,875

1

1,125

1,25

1,375

1,5


Nt=8 k20 Nt6 k20 Nt8 M2075 Nt6 M2075





•use external µ



•let threads run over x and μuse 3d-block for ν, ρ

Fx,µ

=4X

⌫=0,⌫ 6=µ

4X

⇢=0,⇢ 6=µ,⌫

fx,µ,⌫,⇢

⋮

⋮

x=0 (0,1,2)

x=1 (0,1,2)

x=BS-1 (0,1,2)

x=0 (0,2,1)

x=1 (0,2,1)

x=BS-1 (0,2,1)

x=0 (0,3,1)

x=1 (0,3,1)

x=BS-1 (0,3,1)

�

�

�

x=N-BS(3,0,1)

x=N-BS+1 (3,0,1)

x=N-1 (3,0,1)

x=N-BS(3,1,0)

x=N-BS+1 (3,1,0)

x=N-1 (3,1,0)

x=N-BS(3,2,0

x=N-BS+1 (3,2,1)

x=N-1 (3,2,0)

�

�

�

x=0 (0,1,3)

x=1 (0,1,3)

x=BS-1 (0,1,3)

x=0 (0,2,3)

x=1 (0,2,3)

x=BS-1 (0,2,3)

x=0 (0,3,2)

x=1 (0,3,2)

x=BS-1 (0,3,2)

�

�

�

x=N-BS(3,0,2)

x=N-BS+1 (3,0,2)

x=N-1 (3,0,2)

x=N-BS(3,1,2)

x=N-BS+1 (3,0,2)

x=N-1 (3,0,2)

x=N-BS(3,2,1)

x=N-BS+1 (3,0,2)

x=N-1 (3,0,2)

�

�

�








//(grid(x









( ( ( ( f(x,(mu,(nu,(rho)

}

//(external(mu

//(grid(x



//(internal(nu,(rho






( ( ( f(x,(mu,(nu,(rho)

}


//(grid(x








( f(x,(mu,(nu,(rho)

}

//(external(nu,(rho

//(grid(x*mu







( int(x(=(xmu(%(N;

( int(mu(=(xmu(/(N;

( f(x,(mu,(nu,(rho)

}





//(internal(nu,(rho


( int(x(=(xmu(%(N;

( int(mu(=(xmu(/(N;



( f(x,(mu,(nu,(rho)

}












e [m

s]

0

450

900

1350

1800


Nt=8 k20 Nt6 k20 Nt8 M2075 Nt6 M2075

rela

tive

perfo

man

ce

0,75

0,875

1

1,125

1,25

1,375

1,5


Nt=8 k20 Nt6 k20 Nt8 M2075 Nt6 M2075


‘nvidia-smi -ac 3004,875’ or Unleashing the K40

•K40 features GPU Boost

•increase clock from 745 to 875 MHz (~15%)

•within power headroom

•memory clock stays constant







•How does it affect bandwidth-bound applications ?

rel.

spee

dup

0

0,25

0,5

0,75

1

M2075 K20 K20 X K40 K40 @ 875

Trajectory Inversion F.Force







•How does it affect bandwidth-bound applications ?

rel.

spee

dup

0

0,25

0,5

0,75

1

M2075 K20 K20 X K40 K40 @ 875

Trajectory Inversion F.Force


A closer look at the speedups•Inversion: matrix-vector multiplication

•rather simple memory access pattern

•Force: long chain of small matrix-multiplications

•Inversions scale with frequency up to 810 MHz

•Force shows weaker dependency

•but scales up to 875 MHz

•apply K20’s bandwidth/flops ratio to K40→ 845 MHz

0,8

0,9

1

1,1

1,2

K40 @ 666 K40 @ 745 K40 @ 810 K40 @ 875

Inversion F.Force 1/Freq

0,9

1

1,1

K20 @ 666 K20 @ 705 K20 @ 758


Is it worth to replace Fermi cards ?

•K40 is about a factor two faster than M2075

•still not fully exploiting Kepler

•For example:

•replace 100 M2075

•will need 54 K40 (~ $250,000)

•power savings ?


Real world power consumption of a RHMC trajectory

•Kepler allows for identification of code parts •code stays within power limits

60

80

100

120

140

160

180

200

0 100 200 300 400 500 600

pow

er

[W]

time [s]

K40K40 @ 875

40

60

80

100

120

140

160

180

200

0 100 200 300 400 500 600 700 800 900 1000

pow

er

[W]

time [s]

M2075K40K20


Real world power consumption of a RHMC trajectory

•Kepler allows for identification of code parts •code stays within power limits

•GPU boost can be scaled

40

60

80

100

120

140

160

180

200

0 100 200 300 400 500 600 700 800 900 1000

pow

er

[W]

time [s]

M2075K40K20

60

70

80

90

100

110

120

130

140

150

160

0 100 200 300 400 500 600

pow

er

[W]

time [s]

K40K40 @ 875 (scaled)


Assume universities would have business administration …

•Kepler is two times more efficient M2075

K20

K40 @ 745

K40 @ 810

K40 @ 875

avg Power [W]0 80 160 240 320 400

GPU per GPU in a 2 GPU System

M2075

K20

K40 @ 745

K40 @ 810

K40 @ 875

Time / Trajectrory [s]

0 250 500 750 1000

Ener

gy /

Traj

[kW

h]

0

0,02

0,04

0,06

0,08

M2075 K20 K40 @ 745 K40 @ 810 K40 @ 875

GPU Host System

0,0220,020,020,0210,041



•Kepler is two times more efficient

•GPUs environment: ~ 300W / node

•K40 consumes less host power→ ‘run to idle’

M2075

K20

K40 @ 745

K40 @ 810

K40 @ 875

avg Power [W]0 80 160 240 320 400


M2075

K20

K40 @ 745

K40 @ 810

K40 @ 875


0 250 500 750 1000

Ener

gy /

Traj

[kW

h]

0

0,02

0,04

0,06

0,08

M2075 K20 K40 @ 745 K40 @ 810 K40 @ 875

GPU Host System

0,020,020,0220,028

0,038

0,0220,020,020,0210,041



•Kepler is two times more efficient

•GPUs environment: ~ 300W / node

•K40 consumes less host power→ ‘run to idle’

•still, for 100 M2075 → 54 K40:

•Energy savings: 130 MWh / year

•Germany € 26,000 / year

•US $19,000 / year

•ignores cooling …

M2075

K20

K40 @ 745

K40 @ 810

K40 @ 875

avg Power [W]0 80 160 240 320 400


M2075

K20

K40 @ 745

K40 @ 810

K40 @ 875


0 250 500 750 1000

Ener

gy /

Traj

[kW

h]

0

0,02

0,04

0,06

0,08

M2075 K20 K40 @ 745 K40 @ 810 K40 @ 875

GPU Host System

0,020,020,0220,028

0,038

0,0220,020,020,0210,041


Status of lattice data

0

0.1

0.2

0.3 r2B

Tc=154(9)MeV

No=68

0

0.05

0.1r4

B BNL-Bielefeldpreliminary

-0.5

0

0.5

140 160 180 200 220 240

r6B

T [MeV]

0

0.2

0.4

0.6 r2Q

Tc=154(9)MeV

No=68

0

0.1

0.2

0.3 r4Q BNL-Bielefeld

preliminary

-1-0.5

0 0.5

1

140 160 180 200 220 240

r6Q

T [MeV]

0

0.2

0.4

0.6

0.8 r2S

Tc=154(9)MeV

No=68

0

0.2

0.4

0.6 r4S

BNL-Bielefeldpreliminary

-1-0.5

0 0.5

1 1.5

140 160 180 200 220 240

r6S

T [MeV]

•highly-improved staggered quarks, close to physical pion mass ( )ml/ms = 1/20


Status of lattice data

•highly-improved staggered quarks, close to physical pion mass ( )ml/ms = 1/20

0

0.1

0.2

0.3 r2B

Tc=154(9)MeV

No=68

0

0.05

0.1r4

B BNL-Bielefeldpreliminary

-0.5

0

0.5

140 160 180 200 220 240

r6B

T [MeV]

0

0.2

0.4

0.6 r2Q

Tc=154(9)MeV

No=68

0

0.1

0.2

0.3 r4Q BNL-Bielefeld

preliminary

-1-0.5

0 0.5

1

140 160 180 200 220 240

r6Q

T [MeV]

0

0.2

0.4

0.6

0.8 r2S

Tc=154(9)MeV

No=68

0

0.2

0.4

0.6 r4S

BNL-Bielefeldpreliminary

-1-0.5

0 0.5

1 1.5

140 160 180 200 220 240

r6S

T [MeV]


•How can one build a thermometer?

•Oldest idea: Height of mercury column

•relation between a length and temperature

•Construct from fluctuations observables

•with monotonic and strong dependence

•can be / have been measured experimentally

•can be calculated from first principles

Thermometer for freeze-out in heavy-ion collisions

RQ31(T, µB) = RQ,0

31 + RQ,231 µ2

B

µB/T=1µB/T=0No=6No=8

0.0

0.5

1.0

1.5

2.0

2.5

3.0

140 150 160 170 180 190 200 210 220 230 240

T [MeV]

R31Q

HRG

free

Bands: Continuum estimate

BNL-BI, PRL 109 (2012) 192302


•How can one build a thermometer?

•Oldest idea: Height of mercury column

•relation between a length and temperature

•Construct from fluctuations observables

•with monotonic and strong dependence

•can be / have been measured experimentally

•can be calculated from first principles

Thermometer for freeze-out in heavy-ion collisions

RQ31(T, µB) = RQ,0

31 + RQ,231 µ2

B

µB/T=1µB/T=0No=6No=8

0.0

0.5

1.0

1.5

2.0

2.5

3.0

140 150 160 170 180 190 200 210 220 230 240

T [MeV]

R31Q

HRG

free

!S. Mukherjee, MW, CPOD 2013, arXiv:1307.6255

GTC 2014 | Mathias Wagner | Indiana University |GTC 2014 | Mathias Wagner | Indiana University |

Summary

•Increasing the flop / byte ratio and don’t leave linear algebra behind

•Optimizing parallelism and loops

•Kepler GPU Boost and energy consumption

•GPUs allow to push the borders of Lattice QCD

•multiple right hand sides solvers using mixed precision

•overcome memory limitations for multiple r.h.s with pipelining

•Runs on Titan:Configuration generation on CPUs and ‘live-measurements’ on the GPU


Thank you and our supporters

Financial support NSF PHY-1212389 DFG GSI BILAER

GPU Clusters at Bielefeld University (Ger) Paderborn University (Ger) Indiana University (US) Jefferson Laboratory (US)

Contact: [email protected]

Documents

GPU-Based Lattice QCD Simulations as Thermometer for Heavy … · 2014. 4. 7. · Quarks and gluons are crossover at Tc the degrees of freedom confined inside hadrons (asymptotically)