Upload
donga
View
227
Download
1
Embed Size (px)
Citation preview
Performance of Multigrid Solvers on GPUs
1
Lehrstuhl für Informatik 10 (Systemsimulation)
Universität Erlangen-Nürnberg
www10.informatik.uni-erlangen.de
Sydney, July 2010
WCCM Minisymposium Computational Mechanics on GPUsand modern many-core processors
Harald Köstler, Daniel Ritter, Christian Feichtinger and U. Rüde
(LSS Erlangen, [email protected])
2
Overview
Algorithms: Multigrid
Parallel Multigrid
Performance Engineering
Multigrid on GPUs
Applications for Image Processing
Multigrid on GPU-Clusters
Walberla Framework
Conclusions
3
Multigrid andLattice Boltzmann for CFD
4
Multigrid: V-Cycle
Relax on
Residual
Restrict
Correct
Solve
Interpolate
by recursion
… …
Goal: solve Ah uh = f h using a hierarchy of grids
5
Parallel High Performance FE MultigridParallelize „plain vanilla“ multigrid
partition domain
parallelize all operations on all grids
use clever data structures
Do not worry (so much) about Coarse Grids
idle processors?
short messages?
sequential dependency in grid hierarchy?
Multigrid vs. Domain Decomposition
DD without coarse grid does not scale
(algorithmically) and is inefficient for large
problems/ many processorsssors
DD with coarse grids is like multigrid and is as
difficult to parallelize
We get good results for parallel multigrid ...
Bey‘s Tetrahedral Refinement
6
#Cores #unkn. x 106 Ph.1: sec Ph. 2: sec Time to sol.
4 134.2 3.16 6.38* 37.9
8 268.4 3.27 6.67* 39.3
16 536.9 3.35 6.75* 40.3
32 1,073.7 3.38 6.80* 40.6
64 2,147.5 3.53 4.92 42.3
128 4,295.0 3.60 7.06* 43.2
252 8,455.7 3.87 7.39* 46.4
504 16,911.4 3.96 5.44 47.6
2040 68,451.0 4.92 5.60 59.0
3825 128,345.7 6.90 82.8
4080 136,902.0 5.68
6102 205,353.1 6.33
8152 273,535.7 7.43*
9170 307,694.1 7.75*
Parallel scalability of scalar elliptic problem in 3Ddiscretized by tetrahedral finite elements.
Times to solution on SGI Altix: Itanium-2 1.6 GHz(LRZ Garching)
Largest problem solved to date:3.07 x 1011 DOFS (1.8 trillion tets) on 9170 Procs in roughly 90 secs.
B. Bergen, F. Hülsemann, U. Rüde, G. Wellein: ISC Award 2006,
also: „Is 1.7! 1010 unknowns the largest finite element system that
can be solved today?“, SuperComputing, Nov‘ 2005.
? ? ?
7
Multigrid
for Image Processing on GPUs
Imaging in Gradient Space
Joint project with Siemens Medical Soluions
Applications like high dynamic range compression can be done efficiently in the gradient space
Same principle as for Fourier space, image has to be transformed to gradient space and back
Transformation to gradient space is fast (finite differences)
Transformation back requires fast solution of Poisson equation
8
Lehrstuhl für Systemsimulation
Red-black Splitting
Store red and black values in two different arrays
Doubles the performance
9
Lehrstuhl für Systemsimulation
Multigrid on GTX 295 with red-black splitting
10
Computer Science X - System Simulation Group
Harald Köstler ([email protected])
Multigrid on GTX 295 with red-black splitting
49
0
2
4
6
8
10
12
14
16
18
Ru
nti
me
V(2
,2)
in m
s
Image size
Lehrstuhl für Systemsimulation
Memory Bandwidth
11
Computer Science X - System Simulation Group
Harald Köstler ([email protected])
Memory Bandwidth
In percent from maximum measured (rounded) streaming bandwidth (100 GB / s)
50
0%
20%
40%
60%
80%
100%
Pe
rce
nt
of
me
mo
ry
ba
nd
wid
th
Image size
12
Using Clusters of GPUs with
waLBerla
waLBerla: Parallel LBM Frameworkfor CFD Applications
13
Moderately Parallel Computation: 1000 Bubbles
Simulation
1000 Bubbles510x510x530 =
1.4!108 lattice cells70,000 time steps77 GB64 processes72 hours
4,608 core hours
Visualization770 images
Approx. 12,000 core hours for rendering
Best Paper Award, Stefan Donath (LSS Erlangen) at ParCFD, May 2009
14
Simulation of Segregation Processes
15
Segregation simulation of 12 013 objects with two different shapes in different
time steps simulated on 2 048 cores in a box. Density values of 0.8 kg/dm3 and
1.2 kg/dm3 are used for the objects in water with density 1 kg/dm3 and a
gravitation field. Lighter particles are rising to the top of the box, while heavier
particle sink to the bottom.
100 1000 10000 300000Number of Cores
0.5
0.6
0.7
0.8
0.9
1
Eff
icie
ncy
40x40x40 lattice cells per core
80x80x80 lattice cells per core
Weak Scaling
16
Scaling 64 to 294 912 cores
sparsely packed particles
150 994 944 000 lattice cells
83 804 982 rigid spherical objects
Largest
simulation to
date :
8 Trillion (1012)
variables per
time step
(LBM alone)
50 TByte
Jugene Blue Gene/P
JülichSupercomputer
Center
waLBerla Software Architecture for GPU Usage
Patch Architecture
Only LBM on GPU
no free surfaces
no FSI
NEC NehalemXeon E5560
2.8 GHz
12 GB per Node
2 GPUs per Node
nVIDIA TESLA S1070
30 Nodes
up to 60 GPUs
17
GPU Performance Results and Comparison
How far is it to do „Real Time CFD“?
25 GLups would compute
25 Frames per second for a LBM grid with
resolution 1000 x 1000 x 1000
18
Up to 500 MLup/s on a single GPU for plain LBM kernel (SP)
250 MLups/s for GPU in cluster
Compares to 75 MLup/s for Nehalem Node (8 cores)
A GPU node (2 GPUs) delivers performance like
6 Nehalem Nodes (48 cores)
75 IBM Blue Gene/P Nodes
30 GPU nodes (60 GPUs) are equivalent to
137 Nehalem nodes (1096 cores)
1275 Jugene/P nodes (5100 cores)
19
Results for Multigridwithin Walberla
Multi-GPU: Weak Scaling within waLBerla
20
0
0,5
1
1,5
2
2,5
0 2 4 6 8 10 12 14 16
Ru
nti
me f
or
a (
2,2
)-V
-Cycle
[s]
Number of Processing Units
Weak Scaling for 256! Unknowns per Processing Unit
CPU Version, single precision
CPU Version, double precision
GPU version, single precision
GPU version, double precision
Scalability for 3D-multigrid in waLBerLa looks good, but absolute performance is not yet satisfactory. Compare with stand alone 2D results: 0.019 sec for 40962 unknowns
21
Conclusions and
Outlook
SIAM CS&E 2011
22
CS&E Applications in disciplinary areas such as
physics, chemistry, biology, etc.multi-disciplinary and
emerging areasindustry
Modeling and Simulation multi-physics and multi-scale problems kinetic methods meshless methods molecular and particle based methods discrete and event driven models hybrid models validation and verification uncertainty quantification
Chairs:Padma Raghavan, Pennsylvania State UniversityUlrich Ruede, Erlangen
Acknowledgements
CollaboratorsIn Erlangen: WTM, LSE, LSTM, LGDV, RRZE, LME, Neurozentrum, Radiologie, Applied Mathematics, Theoretical Physics, etc.
Especially for foams: C. Körner (WTM)
International: Utah, Technion, Constanta, Ghent, Boulder, München, CAS, Zürich, Delhi, ...
Dissertationen ProjectsN. Thürey, T. Pohl, S. Donath, S. Bogner (LBM, free surfaces, 2-phase flows)
M. Mohr, B. Bergen, U. Fabricius, H. Köstler, C. Freundl, T. Gradl, B. Gmeiner (Massively parallel PDE-solvers)
M. Kowarschik, J. Treibig, M. Stürmer, J. Habich (architecture aware algorithms)
K. Iglberger, T. Preclik, K. Pickel (rigid body dynamics)
J. Götz, C. Feichtinger (Massively parallel LBM software, suspensions)
C. Mihoubi, D. Bartuschat (Complex geometries, parallel LBM)
(Long Term) Guests in 2009-2010:Dr. S. Ganguly, IIT Kharagpur (Humboldt) - Electroosmotic Flows
Prof. V. Buwa, IIT Delhi (Humboldt) - Gas-Fluid-Solid flows
Felipe Aristizabal, McGill Univ., Canada (LBM with Brownian Motion)
Prof. Popa, Constanta, Romania (DAAD) Numerical Linear Algebra
Prof. N. Zakaria, Universiti Petronas, Malaysia
Prof. Hanke, Prof. oppelstrup, KTH Stockholm (DAAD), Mathematical Modelling
~25 Diplom- /Master- Thesis, ~30 Bachelor Thesis
Funding by KONWIHR, DFG, BMBF, EU, Elitenetzwerk Bayern
23
24
Thanks for your attention!
Questions?
Slides, reports, thesis, animations available for download at:
www10.informatik.uni-erlangen.de