Performance of Multigrid Solvers on GPUs - … Parallel High Performance FE Multigrid Parallelize „plain vanilla“ multigrid partition domain parallelize all operations on all grids

Performance of Multigrid Solvers on GPUs

1

Lehrstuhl für Informatik 10 (Systemsimulation)

Universität Erlangen-Nürnberg

www10.informatik.uni-erlangen.de

Sydney, July 2010

WCCM Minisymposium Computational Mechanics on GPUsand modern many-core processors

Harald Köstler, Daniel Ritter, Christian Feichtinger and U. Rüde

(LSS Erlangen, [email protected])

2

Overview

Algorithms: Multigrid

Parallel Multigrid

Performance Engineering

Multigrid on GPUs

Applications for Image Processing

Multigrid on GPU-Clusters

Walberla Framework

Conclusions

3

Multigrid andLattice Boltzmann for CFD

4

Multigrid: V-Cycle

Relax on

Residual

Restrict

Correct

Solve

Interpolate

by recursion

… …

Goal: solve Ah uh = f h using a hierarchy of grids

5

Parallel High Performance FE MultigridParallelize „plain vanilla“ multigrid

partition domain

parallelize all operations on all grids

use clever data structures

Do not worry (so much) about Coarse Grids

idle processors?

short messages?

sequential dependency in grid hierarchy?

Multigrid vs. Domain Decomposition

DD without coarse grid does not scale

(algorithmically) and is inefficient for large

problems/ many processorsssors

DD with coarse grids is like multigrid and is as

difficult to parallelize

We get good results for parallel multigrid ...

Bey‘s Tetrahedral Refinement

6

#Cores #unkn. x 106 Ph.1: sec Ph. 2: sec Time to sol.

4 134.2 3.16 6.38* 37.9

8 268.4 3.27 6.67* 39.3

16 536.9 3.35 6.75* 40.3

32 1,073.7 3.38 6.80* 40.6

64 2,147.5 3.53 4.92 42.3

128 4,295.0 3.60 7.06* 43.2

252 8,455.7 3.87 7.39* 46.4

504 16,911.4 3.96 5.44 47.6

2040 68,451.0 4.92 5.60 59.0

3825 128,345.7 6.90 82.8

4080 136,902.0 5.68

6102 205,353.1 6.33

8152 273,535.7 7.43*

9170 307,694.1 7.75*

Parallel scalability of scalar elliptic problem in 3Ddiscretized by tetrahedral finite elements.

Times to solution on SGI Altix: Itanium-2 1.6 GHz(LRZ Garching)

Largest problem solved to date:3.07 x 1011 DOFS (1.8 trillion tets) on 9170 Procs in roughly 90 secs.

B. Bergen, F. Hülsemann, U. Rüde, G. Wellein: ISC Award 2006,

also: „Is 1.7! 1010 unknowns the largest finite element system that

can be solved today?“, SuperComputing, Nov‘ 2005.

? ? ?

7

Multigrid

for Image Processing on GPUs

Imaging in Gradient Space

Joint project with Siemens Medical Soluions

Applications like high dynamic range compression can be done efficiently in the gradient space

Same principle as for Fourier space, image has to be transformed to gradient space and back

Transformation to gradient space is fast (finite differences)

Transformation back requires fast solution of Poisson equation

8

Lehrstuhl für Systemsimulation

Red-black Splitting

Store red and black values in two different arrays

Doubles the performance

9


Multigrid on GTX 295 with red-black splitting

10

Computer Science X - System Simulation Group

Harald Köstler ([email protected])

Multigrid on GTX 295 with red-black splitting

49

0

2

4

6

8

10

12

14

16

18

Ru

nti

me

V(2

,2)

in m

s

Image size


Memory Bandwidth

11

Computer Science X - System Simulation Group

Harald Köstler ([email protected])

Memory Bandwidth

In percent from maximum measured (rounded) streaming bandwidth (100 GB / s)

50

0%

20%

40%

60%

80%

100%

Pe

rce

nt

of

me

mo

ry

ba

nd

wid

th

Image size

12

Using Clusters of GPUs with

waLBerla

waLBerla: Parallel LBM Frameworkfor CFD Applications

13

Moderately Parallel Computation: 1000 Bubbles

Simulation

1000 Bubbles510x510x530 =

1.4!108 lattice cells70,000 time steps77 GB64 processes72 hours

4,608 core hours

Visualization770 images

Approx. 12,000 core hours for rendering

Best Paper Award, Stefan Donath (LSS Erlangen) at ParCFD, May 2009

14

Simulation of Segregation Processes

15

Segregation simulation of 12 013 objects with two different shapes in different

time steps simulated on 2 048 cores in a box. Density values of 0.8 kg/dm3 and

1.2 kg/dm3 are used for the objects in water with density 1 kg/dm3 and a

gravitation field. Lighter particles are rising to the top of the box, while heavier

particle sink to the bottom.

100 1000 10000 300000Number of Cores

0.5

0.6

0.7

0.8

0.9

1

Eff

icie

ncy

40x40x40 lattice cells per core

80x80x80 lattice cells per core

Weak Scaling

16

Scaling 64 to 294 912 cores

sparsely packed particles

150 994 944 000 lattice cells

83 804 982 rigid spherical objects

Largest

simulation to

date :

8 Trillion (1012)

variables per

time step

(LBM alone)

50 TByte

Jugene Blue Gene/P

JülichSupercomputer

Center

waLBerla Software Architecture for GPU Usage

Patch Architecture

Only LBM on GPU

no free surfaces

no FSI

NEC NehalemXeon E5560

2.8 GHz

12 GB per Node

2 GPUs per Node

nVIDIA TESLA S1070

30 Nodes

up to 60 GPUs

17

GPU Performance Results and Comparison

How far is it to do „Real Time CFD“?

25 GLups would compute

25 Frames per second for a LBM grid with

resolution 1000 x 1000 x 1000

18

Up to 500 MLup/s on a single GPU for plain LBM kernel (SP)

250 MLups/s for GPU in cluster

Compares to 75 MLup/s for Nehalem Node (8 cores)

A GPU node (2 GPUs) delivers performance like

6 Nehalem Nodes (48 cores)

75 IBM Blue Gene/P Nodes

30 GPU nodes (60 GPUs) are equivalent to

137 Nehalem nodes (1096 cores)

1275 Jugene/P nodes (5100 cores)

19

Results for Multigridwithin Walberla

Multi-GPU: Weak Scaling within waLBerla

20

0

0,5

1

1,5

2

2,5

0 2 4 6 8 10 12 14 16

Ru

nti

me f

or

a (

2,2

)-V

-Cycle

[s]

Number of Processing Units

Weak Scaling for 256! Unknowns per Processing Unit

CPU Version, single precision

CPU Version, double precision

GPU version, single precision

GPU version, double precision

Scalability for 3D-multigrid in waLBerLa looks good, but absolute performance is not yet satisfactory. Compare with stand alone 2D results: 0.019 sec for 40962 unknowns

21

Conclusions and

Outlook

SIAM CS&E 2011

22

CS&E Applications in disciplinary areas such as

physics, chemistry, biology, etc.multi-disciplinary and

emerging areasindustry

Modeling and Simulation multi-physics and multi-scale problems kinetic methods meshless methods molecular and particle based methods discrete and event driven models hybrid models validation and verification uncertainty quantification

Chairs:Padma Raghavan, Pennsylvania State UniversityUlrich Ruede, Erlangen

Acknowledgements

CollaboratorsIn Erlangen: WTM, LSE, LSTM, LGDV, RRZE, LME, Neurozentrum, Radiologie, Applied Mathematics, Theoretical Physics, etc.

Especially for foams: C. Körner (WTM)

International: Utah, Technion, Constanta, Ghent, Boulder, München, CAS, Zürich, Delhi, ...

Dissertationen ProjectsN. Thürey, T. Pohl, S. Donath, S. Bogner (LBM, free surfaces, 2-phase flows)

M. Mohr, B. Bergen, U. Fabricius, H. Köstler, C. Freundl, T. Gradl, B. Gmeiner (Massively parallel PDE-solvers)

M. Kowarschik, J. Treibig, M. Stürmer, J. Habich (architecture aware algorithms)

K. Iglberger, T. Preclik, K. Pickel (rigid body dynamics)

J. Götz, C. Feichtinger (Massively parallel LBM software, suspensions)

C. Mihoubi, D. Bartuschat (Complex geometries, parallel LBM)

(Long Term) Guests in 2009-2010:Dr. S. Ganguly, IIT Kharagpur (Humboldt) - Electroosmotic Flows

Prof. V. Buwa, IIT Delhi (Humboldt) - Gas-Fluid-Solid flows

Felipe Aristizabal, McGill Univ., Canada (LBM with Brownian Motion)

Prof. Popa, Constanta, Romania (DAAD) Numerical Linear Algebra

Prof. N. Zakaria, Universiti Petronas, Malaysia

Prof. Hanke, Prof. oppelstrup, KTH Stockholm (DAAD), Mathematical Modelling

~25 Diplom- /Master- Thesis, ~30 Bachelor Thesis

Funding by KONWIHR, DFG, BMBF, EU, Elitenetzwerk Bayern

23

24

Thanks for your attention!

Questions?

Slides, reports, thesis, animations available for download at:

www10.informatik.uni-erlangen.de

Documents

Performance of Multigrid Solvers on GPUs - … Parallel High Performance FE Multigrid Parallelize „plain vanilla“ multigrid partition domain parallelize all operations on all grids