Upload
phungthien
View
221
Download
2
Embed Size (px)
Citation preview
LBM BASED FLOW SIMULATION USING
GPU COMPUTING PROCESSOR
CETHILUMR5008
GPU COMPUTING PROCESSOR
Frédéric Kuznik, [email protected]
11LBE Workshop - 15/05/2008 - CEMAGREF Antony
Framework
Introduction
Hardware architecture
CUDA overviewCUDA overview
Implementation details
A simple case: the lid driven cavity problem
22LBE Workshop - 15/05/2008 - CEMAGREF Antony
Presentation of my activities
CETHIL: Thermal Sciences Center of Lyon
Current work: simulation of fluid flows with
heat transfer
CETHILUMR5008
33
Use/development of Thermal LBM1,2 LBM computation using GPU3
Objective: coupling efficiently the two approaches
1 A double population lattice Boltzman method with non-uniform mesh for the simulation of natural convection in a
square cavity, Int. J. Heat and Fluid Flow, vol. 28(5), pp. 862-870, 2007
2 Numerical prediction of natural convection occuring in building components: a double population lattice Boltzmann
method, Numerical Heat Transfer A, vol. 52(4), pp. 315-335, 2007
3 LBM based flow simulation using GPU computing processor, Computers & Mathematics with Applications, under
reviewLBE Workshop - 15/05/2008 - CEMAGREF Antony
Introduction
Due to the market (realtime and
high definition 3-D graphics), GPU
has evolved into highly parrallel,
multithreaded, multi-core
44
(from CUDA Programming Guide 06/07/2008)
GFLOPS=109 floating point operations per second
multithreaded, multi-core
processor.
LBE Workshop - 15/05/2008 - CEMAGREF Antony
Introduction
The bandwidth of GPU has also
evolved for faster graphics
purpose.
55
(from CUDA Programming Guide 06/07/2008)
purpose.
LBE Workshop - 15/05/2008 - CEMAGREF Antony
Previous works using GPU & LBM
GPU Cluster for high performance computing (Fan et al.
2004): 32 nodes cluster of GeForce 5800 ultra
LB-Stream Computing or over 1 Billion Lattice Updates per
second on a single PC (J. Tölke ICMMES 2007): GeForce 8800
ultra
66
ultra
LBE Workshop - 15/05/2008 - CEMAGREF Antony
Hardware
Commercial graphics card (can beincluded in a desktop PC)
240 processors running at1.35GHz 1.35GHz
1 Go DDR3 with 141.7GB/s bandwidth
Theoretical peak at 1000GFLOPS
Price around 500€
77
nVIDIA GeForce GTX 280
LBE Workshop - 15/05/2008 - CEMAGREF Antony
GPU Hardware Architecture
◊ 15 texture processors (TPC)
◊ 1 TPC is composed of 2
streaming multiprocessors (SM)
◊ 1 SM is composed of 8
88
◊ 1 SM is composed of 8
streaming processors (SP) and 2
special functions units
◊ Each SM has 16kB of shared
memory
LBE Workshop - 15/05/2008 - CEMAGREF Antony
GPU Programming InterfaceNVIDIA® CUDATM C language programming environment
(version 2.0)
Definitions:
Device=GPU
Host=CPU
Kernel=function that is called from the host and runs on
99
Kernel=function that is called from the host and runs onthe device
A cuda kernel is exucuted by an array of threads
LBE Workshop - 15/05/2008 - CEMAGREF Antony
GPU Programming Interface
A kernel is executed by a gridof thread block
1 thread block is executed by 1multiprocessor
A thread block contains a
1010
A thread block contains amaximum of 1024 threads
Threads are executed byprocessors within a singlemultiprocessor
=> Threads from differentblocks cannot cooperate
LBE Workshop - 15/05/2008 - CEMAGREF Antony
Kernel memory access
Registers
Shared memory: on-chip (fast),
small (16kB)
Global memory: off-chip, large
1111
Global memory: off-chip, large
The host can read or write
global memory only.
LBE Workshop - 15/05/2008 - CEMAGREF Antony
LBM ALGORITHM
Step 1: Collision (Local – 70% computational time)
Step 2: Propagation (Neighbors – 28% computational time)
1212
Step 2: Propagation (Neighbors – 28% computational time)
+ Boundary conditions
(from BERNSDORF J., How to make my LB-code faster – software planning,
implementation and performance tuning, ICMMES’08, Netherlands)
LBE Workshop - 15/05/2008 - CEMAGREF Antony
Implementation details
Thread block
1313
Grid of blocksLattice grid
1- Decomposition of the fluid domain into a lattice grid
2- Indexing the lattice nodes using thread ID
3- Decomposition of the CUDA domain into thread blocks
4- Execution of the kernel by the thread blocks
Translation of lattice grid
into CUDA topology
LBE Workshop - 15/05/2008 - CEMAGREF Antony
Pseudo-code
Combine collision and propagation steps:
for each thread block
for each thread
load fi in shared memory
1414
i
compute collision step
do the propagation step
end
end
Exchange informations across boundaries
LBE Workshop - 15/05/2008 - CEMAGREF Antony
Application: 2D lid driven cavity
Problem description u=U0, v=0
u=0, v=0u=0, v=0
D2Q9 MRT model of d’Humières (1992)
1515
y
x
u=0, v=0u=0, v=0
u=0, v=0
LBE Workshop - 15/05/2008 - CEMAGREF Antony
Results Re=1000
Reference data from Erturk et al. 2005:
0.1
0.2
0.3
0.7
0.8
0.9
1
1616
xv
velo
city
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
data from Erturk et al.LBM
u velocity
y
-0.25 0 0.25 0.5 0.75 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
data from Erturk et al.LBM
Vertical velocity profil for x=0.5 Horizontal velocity profil for y=0.5
LBE Workshop - 15/05/2008 - CEMAGREF Antony
Performances
Mesh
grid size
Number of Threads
16 32 64 128 256
256² 71.0 71.0 71.1 71.1 70.9
512² 284.2 282.7 284.3 284.3 283.7
1717
512² 284.2 282.7 284.3 284.3 283.7
1024² 346.4 690.2 953.0 997.5 1007.9
2048² 240.0 583.0 803.3 961.3 1001.5
3072² 289.3 572.4 845.2 941.2 974.2
Performances in MLUPS
LBE Workshop - 15/05/2008 - CEMAGREF Antony
Performances: CPU comparisons
80
100
120
140
160
180
200
256²
512²
1024²
GP
U/C
PU
1818
0
20
40
60
80
0 128 256
1024²
2048²
3072²
Number of Threads per block
GP
U/C
PU
CPU = Pentium IV, 3.0 GHz
LBE Workshop - 15/05/2008 - CEMAGREF Antony
Conclusions
The GPU allows a gain of about 180 for the calculation time
Multiple GPU server exists: NVIDIA TESLA S1070 composed of
1919
4 GTX 280 (theoretical gain 720)
The evolution of GPU is not finished !
But using double precision floating point, the gain falls to 20 !
LBE Workshop - 15/05/2008 - CEMAGREF Antony
Outlooks
A workgroup concerning LBM and GPU
1 master student with Prof. Bernard TOURANCHEAU (ENS Lyon –
INRIA) and a PhD student in 2009
1 master student with Prof. Eric Favier (ENISE – DIPI)
2020
A workgroup concerning Thermal LBM …?
Everybody is welcome to join the workgroup !
x
y
0 0.5 10
0.5
1
x
y
0 0.5 10
0.5
1
x
y
0 0.5 10
0.5
1
Ra=103 Ra=104
Ra=105 Ra=106x
y
0 0.5 10
0.5
1
LBE Workshop - 15/05/2008 - CEMAGREF Antony
THANK YOU FOR YOUR ATTENTION
2121
THANK YOU FOR YOUR ATTENTION
LBE Workshop - 15/05/2008 - CEMAGREF Antony