19
Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid Jeffrey Bolz, Ian Farmer, Eitan Grinspun, Peter Schröder Caltech ASCI Center

Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid

  • Upload
    dudley

  • View
    57

  • Download
    1

Embed Size (px)

DESCRIPTION

Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. Jeffrey Bolz, Ian Farmer, Eitan Grinspun, Peter Schr öder Caltech ASCI Center. Actual. Possible. Why Use the GPU?. Semiconductor trends cost wires vs. compute Stanford streaming supercomputer Parallelism - PowerPoint PPT Presentation

Citation preview

Page 1: Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid

Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid

Jeffrey Bolz, Ian Farmer, Eitan Grinspun, Peter Schröder

Caltech ASCI Center

Page 2: Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid

Why Use the GPU?

• Semiconductor trends– cost– wires vs. compute– Stanford streaming supercomputer

• Parallelism– many functional units– graphics is prime example

• Harvesting this power– what application suitable?– what abstractions useful?

• History– massively parallel SIMD machines– media processing

1e-4

1e-3

1e-2

1e-1

1e+0

1e+1

1e+2

1e+3

1e+4

1e+5

1e+6

1e+7

1980 1990 2000 2010 2020

Perf (ps/Inst)

Linear (ps/Inst)

Cha

rt c

ourt

esy

Bill

Dal

ly

Possible

Actual

Imagine stream processor; Bill Dally, Stanford Connection Machine CM2; Thinking Machines

Page 3: Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid

Contributions and Related Work

• Contributions– numerical algorithms on GPU

• unstructured grids: conjugate gradients• regular grids: multigrid

– what abstractions are needed?

• Numerical algorithms– Goodnight et al. 2003 (MG)– Hall et al. 2003 (cache)– Harris et al. 2002 (FD sim.)– Hillisland et al. 2003 (optimization)– Krueger & Westermann 2003 (NLA)– Strzodka (PDEs)

Page 4: Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid

Streaming Model

• Abstract model– Purcell, et al. 2002– data structures: streams– algorithms: kernels

• Concrete model– render a rectangle– data structures: textures– algorithms: fragment programs

Kernelinput

recordstream

outputrecordstream

globals

Rasterizer(set up textureindices and all

associated data)

Fragmentprogram

(for all pixelsin parallel)

Textureas read-only

memory

Output goes totexture

Bind buffer to texture

Kernel

globals

Page 5: Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid

Sparse Matrices: Geometric Flow

• Ubiquitous in numerical computing– discretization of PDEs: animation

• finite elements, difference, volumes

– optimization, editing, etc., etc.

• Example here:– processing of surfaces

• Canonical non-linear problem– mean curvature flow– implicit time discretization

• solve sequence of SPD systems

)(4

))cot()(cot(

iNj ijiii

ijijij

aAa

ta

)()()()( tntHttx iiiit

Velocity opposite meancurvature normal

ii xAx 1

Page 6: Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid

Conjugate Gradients

• High level code– inner loop– matrix-vector

multiply– sum-reduction– scalar-vector

MAD

• Inner product– fragment-wise multiply– followed by sum-reduction– odd dimensions can be handled

Page 7: Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid

y=Ax

Aj – off-diagonal matrix elements

R – pointers to segments

Page 8: Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid

Row-Vector Product

X – vector elements

R – pointers to segments

Ai – diagonal matrix elements

J – pointers to xj

Aj – off-diagonal matrix elements

Fragment program

Page 9: Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid

Apply to All Pixels

• Two extremes– one row at a time: setup overhead

– all rows at once: limited by worst row

• Middle ground– organize “batches” of work

• How to arrange batches?– order rows by non-zero entries

• optimal packing NP hard

• We choose fixed size rectangles– fragment pipe is quantized

– simple experiments reveal best size• 26 x 18 – 91% efficient

• wasted fragments on diagonal

Time

Area(pixels)

Page 10: Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid

Packing (Greedy)

9 9 8 8 8 8 8 7 715 13 13 12 12 11 10 9 9 7 7 7 7 7 7 7 7 6 5 5 4

15 13 13

12 12 11

10 9 9

9 9 8

8 8 8

8 7 7

7 7 7

7 7 7

7 7 6

non-zero entriesper row

each batchbound to anappropriate

fragment program All this setup doneonce only at the

beginning of time.Depends only onmesh connectivity

Page 11: Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid

Recomputing Matrix

• Matrix entries depend on surface– must “render” into matrix– two additional indirection textures

• previous and next

Page 12: Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid

Results (NV30@500MHz)

• 37k elements – matrix multiply

• 33 instructions, 120 per second

• only 13 flops

• latency limited

– reduction• 7 inst/frag/pass, 3400 per second

– CG solve: 20 per second

Page 13: Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid

Regular Grids

• Poisson solver as example– multigrid approach– this time variables on “pixel grid”

• e.g.: Navier-Stokes

buuuu

u

2)(

0

t

u p2after discretization:solve Poisson eq.at each time step

Page 14: Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid

Poisson Equation

• Appears all over the place– easy to discretize on regular grid– matrix multiply is

stencil application– FD Laplace stencil:

• Use iterative matrix solver– just need application of stencil

• easy: just like filtering

• incorporate geometry (Jacobian)

• variable coefficients

(i,j)-4

1

1

1 1

0

0

0

0

jijiji

jijiji

XXX

XXX

,1,1,

,1,1,2

4

Page 15: Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid

Multigrid

Relax

Relax

RelaxRelax

Relax

Projection Projection Interpolation Interpolation

• Fine to coarse to fine cycle– high freq. error removed quickly– lower frequency error takes longer

Relax, Project, Interpolate

Page 16: Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid

Computations and Storage Layout

• Lots of stencil applications– matrix multiply: 3x3 stencil

– projection: 3x3 stencil

– interpolation: 2x2(!)• floor op in indexing

• Storage for matrices and DOFs– variables in one texture

– matrices in 9(=3x3) textures

– all textures packed• exploit 4 channels

• domain decomp.

• padded boundary

1/16

1 1

1 1

2

2

2 24

21,0 2 2/)(41

d hh dii vv

xy

zw

Page 17: Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid

Coarser Matrices

• Operator at coarser level– needed for relaxation at all levels

• Triple matrix product…– work out terms and map to stencils

• exploit local support of stencils

• straightforward but t-e-d-i-o-u-s

Af

Ac

SP=

2

2

}1,0,1{,

}1,0,1{,

22

]2[]2[']['4/1

]2[4/1][

ge

gh

ge

gh

dgeedh

eiAdgeSeS

eiASSiA

Page 18: Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid

Results (NV30@500MHz)

• 257x257 grid– matrix multiply - 27 instructions

• 1370 per second

– interpolation 10 inst.– projection 19 inst.

• Overall performance– 257x257 at 80 fps!

Page 19: Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid

Conclusions

• Enhancements– global registers for reductions– texture fetch with offset– rectangular texture border– scalar versus vector problems

• Where are we now?– good streaming processor– twice as fast as CPU implementation– lots of room for improvement

• Scientific computing compiler– better languages! Brook? C*?– manage layout in a buffer