24
Performance of Multigrid Solvers on GPUs 1 Lehrstuhl für Informatik 10 (Systemsimulation) Universität Erlangen-Nürnberg www10.informatik.uni-erlangen.de Sydney, July 2010 WCCM Minisymposium Computational Mechanics on GPUs and modern many-core processors Harald Köstler, Daniel Ritter, Christian Feichtinger and U. Rüde (LSS Erlangen, [email protected])

Performance of Multigrid Solvers on GPUs - … Parallel High Performance FE Multigrid Parallelize „plain vanilla“ multigrid partition domain parallelize all operations on all grids

  • Upload
    donga

  • View
    227

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Performance of Multigrid Solvers on GPUs - … Parallel High Performance FE Multigrid Parallelize „plain vanilla“ multigrid partition domain parallelize all operations on all grids

Performance of Multigrid Solvers on GPUs

1

Lehrstuhl für Informatik 10 (Systemsimulation)

Universität Erlangen-Nürnberg

www10.informatik.uni-erlangen.de

Sydney, July 2010

WCCM Minisymposium Computational Mechanics on GPUsand modern many-core processors

Harald Köstler, Daniel Ritter, Christian Feichtinger and U. Rüde

(LSS Erlangen, [email protected])

Page 2: Performance of Multigrid Solvers on GPUs - … Parallel High Performance FE Multigrid Parallelize „plain vanilla“ multigrid partition domain parallelize all operations on all grids

2

Overview

Algorithms: Multigrid

Parallel Multigrid

Performance Engineering

Multigrid on GPUs

Applications for Image Processing

Multigrid on GPU-Clusters

Walberla Framework

Conclusions

Page 3: Performance of Multigrid Solvers on GPUs - … Parallel High Performance FE Multigrid Parallelize „plain vanilla“ multigrid partition domain parallelize all operations on all grids

3

Multigrid andLattice Boltzmann for CFD

Page 4: Performance of Multigrid Solvers on GPUs - … Parallel High Performance FE Multigrid Parallelize „plain vanilla“ multigrid partition domain parallelize all operations on all grids

4

Multigrid: V-Cycle

Relax on

Residual

Restrict

Correct

Solve

Interpolate

by recursion

… …

Goal: solve Ah uh = f h using a hierarchy of grids

Page 5: Performance of Multigrid Solvers on GPUs - … Parallel High Performance FE Multigrid Parallelize „plain vanilla“ multigrid partition domain parallelize all operations on all grids

5

Parallel High Performance FE MultigridParallelize „plain vanilla“ multigrid

partition domain

parallelize all operations on all grids

use clever data structures

Do not worry (so much) about Coarse Grids

idle processors?

short messages?

sequential dependency in grid hierarchy?

Multigrid vs. Domain Decomposition

DD without coarse grid does not scale

(algorithmically) and is inefficient for large

problems/ many processorsssors

DD with coarse grids is like multigrid and is as

difficult to parallelize

We get good results for parallel multigrid ...

Bey‘s Tetrahedral Refinement

Page 6: Performance of Multigrid Solvers on GPUs - … Parallel High Performance FE Multigrid Parallelize „plain vanilla“ multigrid partition domain parallelize all operations on all grids

6

#Cores #unkn. x 106 Ph.1: sec Ph. 2: sec Time to sol.

4 134.2 3.16 6.38* 37.9

8 268.4 3.27 6.67* 39.3

16 536.9 3.35 6.75* 40.3

32 1,073.7 3.38 6.80* 40.6

64 2,147.5 3.53 4.92 42.3

128 4,295.0 3.60 7.06* 43.2

252 8,455.7 3.87 7.39* 46.4

504 16,911.4 3.96 5.44 47.6

2040 68,451.0 4.92 5.60 59.0

3825 128,345.7 6.90 82.8

4080 136,902.0 5.68

6102 205,353.1 6.33

8152 273,535.7 7.43*

9170 307,694.1 7.75*

Parallel scalability of scalar elliptic problem in 3Ddiscretized by tetrahedral finite elements.

Times to solution on SGI Altix: Itanium-2 1.6 GHz(LRZ Garching)

Largest problem solved to date:3.07 x 1011 DOFS (1.8 trillion tets) on 9170 Procs in roughly 90 secs.

B. Bergen, F. Hülsemann, U. Rüde, G. Wellein: ISC Award 2006,

also: „Is 1.7! 1010 unknowns the largest finite element system that

can be solved today?“, SuperComputing, Nov‘ 2005.

? ? ?

Page 7: Performance of Multigrid Solvers on GPUs - … Parallel High Performance FE Multigrid Parallelize „plain vanilla“ multigrid partition domain parallelize all operations on all grids

7

Multigrid

for Image Processing on GPUs

Page 8: Performance of Multigrid Solvers on GPUs - … Parallel High Performance FE Multigrid Parallelize „plain vanilla“ multigrid partition domain parallelize all operations on all grids

Imaging in Gradient Space

Joint project with Siemens Medical Soluions

Applications like high dynamic range compression can be done efficiently in the gradient space

Same principle as for Fourier space, image has to be transformed to gradient space and back

Transformation to gradient space is fast (finite differences)

Transformation back requires fast solution of Poisson equation

8

Page 9: Performance of Multigrid Solvers on GPUs - … Parallel High Performance FE Multigrid Parallelize „plain vanilla“ multigrid partition domain parallelize all operations on all grids

Lehrstuhl für Systemsimulation

Red-black Splitting

Store red and black values in two different arrays

Doubles the performance

9

Page 10: Performance of Multigrid Solvers on GPUs - … Parallel High Performance FE Multigrid Parallelize „plain vanilla“ multigrid partition domain parallelize all operations on all grids

Lehrstuhl für Systemsimulation

Multigrid on GTX 295 with red-black splitting

10

Computer Science X - System Simulation Group

Harald Köstler ([email protected])

Multigrid on GTX 295 with red-black splitting

49

0

2

4

6

8

10

12

14

16

18

Ru

nti

me

V(2

,2)

in m

s

Image size

Page 11: Performance of Multigrid Solvers on GPUs - … Parallel High Performance FE Multigrid Parallelize „plain vanilla“ multigrid partition domain parallelize all operations on all grids

Lehrstuhl für Systemsimulation

Memory Bandwidth

11

Computer Science X - System Simulation Group

Harald Köstler ([email protected])

Memory Bandwidth

In percent from maximum measured (rounded) streaming bandwidth (100 GB / s)

50

0%

20%

40%

60%

80%

100%

Pe

rce

nt

of

me

mo

ry

ba

nd

wid

th

Image size

Page 12: Performance of Multigrid Solvers on GPUs - … Parallel High Performance FE Multigrid Parallelize „plain vanilla“ multigrid partition domain parallelize all operations on all grids

12

Using Clusters of GPUs with

waLBerla

Page 13: Performance of Multigrid Solvers on GPUs - … Parallel High Performance FE Multigrid Parallelize „plain vanilla“ multigrid partition domain parallelize all operations on all grids

waLBerla: Parallel LBM Frameworkfor CFD Applications

13

Page 14: Performance of Multigrid Solvers on GPUs - … Parallel High Performance FE Multigrid Parallelize „plain vanilla“ multigrid partition domain parallelize all operations on all grids

Moderately Parallel Computation: 1000 Bubbles

Simulation

1000 Bubbles510x510x530 =

1.4!108 lattice cells70,000 time steps77 GB64 processes72 hours

4,608 core hours

Visualization770 images

Approx. 12,000 core hours for rendering

Best Paper Award, Stefan Donath (LSS Erlangen) at ParCFD, May 2009

14

Page 15: Performance of Multigrid Solvers on GPUs - … Parallel High Performance FE Multigrid Parallelize „plain vanilla“ multigrid partition domain parallelize all operations on all grids

Simulation of Segregation Processes

15

Segregation simulation of 12 013 objects with two different shapes in different

time steps simulated on 2 048 cores in a box. Density values of 0.8 kg/dm3 and

1.2 kg/dm3 are used for the objects in water with density 1 kg/dm3 and a

gravitation field. Lighter particles are rising to the top of the box, while heavier

particle sink to the bottom.

Page 16: Performance of Multigrid Solvers on GPUs - … Parallel High Performance FE Multigrid Parallelize „plain vanilla“ multigrid partition domain parallelize all operations on all grids

100 1000 10000 300000Number of Cores

0.5

0.6

0.7

0.8

0.9

1

Eff

icie

ncy

40x40x40 lattice cells per core

80x80x80 lattice cells per core

Weak Scaling

16

Scaling 64 to 294 912 cores

sparsely packed particles

150 994 944 000 lattice cells

83 804 982 rigid spherical objects

Largest

simulation to

date :

8 Trillion (1012)

variables per

time step

(LBM alone)

50 TByte

Jugene Blue Gene/P

JülichSupercomputer

Center

Page 17: Performance of Multigrid Solvers on GPUs - … Parallel High Performance FE Multigrid Parallelize „plain vanilla“ multigrid partition domain parallelize all operations on all grids

waLBerla Software Architecture for GPU Usage

Patch Architecture

Only LBM on GPU

no free surfaces

no FSI

NEC NehalemXeon E5560

2.8 GHz

12 GB per Node

2 GPUs per Node

nVIDIA TESLA S1070

30 Nodes

up to 60 GPUs

17

Page 18: Performance of Multigrid Solvers on GPUs - … Parallel High Performance FE Multigrid Parallelize „plain vanilla“ multigrid partition domain parallelize all operations on all grids

GPU Performance Results and Comparison

How far is it to do „Real Time CFD“?

25 GLups would compute

25 Frames per second for a LBM grid with

resolution 1000 x 1000 x 1000

18

Up to 500 MLup/s on a single GPU for plain LBM kernel (SP)

250 MLups/s for GPU in cluster

Compares to 75 MLup/s for Nehalem Node (8 cores)

A GPU node (2 GPUs) delivers performance like

6 Nehalem Nodes (48 cores)

75 IBM Blue Gene/P Nodes

30 GPU nodes (60 GPUs) are equivalent to

137 Nehalem nodes (1096 cores)

1275 Jugene/P nodes (5100 cores)

Page 19: Performance of Multigrid Solvers on GPUs - … Parallel High Performance FE Multigrid Parallelize „plain vanilla“ multigrid partition domain parallelize all operations on all grids

19

Results for Multigridwithin Walberla

Page 20: Performance of Multigrid Solvers on GPUs - … Parallel High Performance FE Multigrid Parallelize „plain vanilla“ multigrid partition domain parallelize all operations on all grids

Multi-GPU: Weak Scaling within waLBerla

20

0

0,5

1

1,5

2

2,5

0 2 4 6 8 10 12 14 16

Ru

nti

me f

or

a (

2,2

)-V

-Cycle

[s]

Number of Processing Units

Weak Scaling for 256! Unknowns per Processing Unit

CPU Version, single precision

CPU Version, double precision

GPU version, single precision

GPU version, double precision

Scalability for 3D-multigrid in waLBerLa looks good, but absolute performance is not yet satisfactory. Compare with stand alone 2D results: 0.019 sec for 40962 unknowns

Page 21: Performance of Multigrid Solvers on GPUs - … Parallel High Performance FE Multigrid Parallelize „plain vanilla“ multigrid partition domain parallelize all operations on all grids

21

Conclusions and

Outlook

Page 22: Performance of Multigrid Solvers on GPUs - … Parallel High Performance FE Multigrid Parallelize „plain vanilla“ multigrid partition domain parallelize all operations on all grids

SIAM CS&E 2011

22

CS&E Applications in disciplinary areas such as

physics, chemistry, biology, etc.multi-disciplinary and

emerging areasindustry

Modeling and Simulation multi-physics and multi-scale problems kinetic methods meshless methods molecular and particle based methods discrete and event driven models hybrid models validation and verification uncertainty quantification

Chairs:Padma Raghavan, Pennsylvania State UniversityUlrich Ruede, Erlangen

Page 23: Performance of Multigrid Solvers on GPUs - … Parallel High Performance FE Multigrid Parallelize „plain vanilla“ multigrid partition domain parallelize all operations on all grids

Acknowledgements

CollaboratorsIn Erlangen: WTM, LSE, LSTM, LGDV, RRZE, LME, Neurozentrum, Radiologie, Applied Mathematics, Theoretical Physics, etc.

Especially for foams: C. Körner (WTM)

International: Utah, Technion, Constanta, Ghent, Boulder, München, CAS, Zürich, Delhi, ...

Dissertationen ProjectsN. Thürey, T. Pohl, S. Donath, S. Bogner (LBM, free surfaces, 2-phase flows)

M. Mohr, B. Bergen, U. Fabricius, H. Köstler, C. Freundl, T. Gradl, B. Gmeiner (Massively parallel PDE-solvers)

M. Kowarschik, J. Treibig, M. Stürmer, J. Habich (architecture aware algorithms)

K. Iglberger, T. Preclik, K. Pickel (rigid body dynamics)

J. Götz, C. Feichtinger (Massively parallel LBM software, suspensions)

C. Mihoubi, D. Bartuschat (Complex geometries, parallel LBM)

(Long Term) Guests in 2009-2010:Dr. S. Ganguly, IIT Kharagpur (Humboldt) - Electroosmotic Flows

Prof. V. Buwa, IIT Delhi (Humboldt) - Gas-Fluid-Solid flows

Felipe Aristizabal, McGill Univ., Canada (LBM with Brownian Motion)

Prof. Popa, Constanta, Romania (DAAD) Numerical Linear Algebra

Prof. N. Zakaria, Universiti Petronas, Malaysia

Prof. Hanke, Prof. oppelstrup, KTH Stockholm (DAAD), Mathematical Modelling

~25 Diplom- /Master- Thesis, ~30 Bachelor Thesis

Funding by KONWIHR, DFG, BMBF, EU, Elitenetzwerk Bayern

23

Page 24: Performance of Multigrid Solvers on GPUs - … Parallel High Performance FE Multigrid Parallelize „plain vanilla“ multigrid partition domain parallelize all operations on all grids

24

Thanks for your attention!

Questions?

Slides, reports, thesis, animations available for download at:

www10.informatik.uni-erlangen.de