32
Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest and Dr. P.A. Ramachandran and Dr. M.P. Dudukovic Optimization, HPC, and Pre- and Post-Processing I Session. 6th OpenFOAM Workshop Penn State University. June 15th 2011 Chemical Reaction Engineering Laboratory (CREL) Department of Energy, Environmental, and Chemical Engineering. Washington University, St. Louis, MO.

Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest

Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA

Daniel P. Combest and Dr. P.A. Ramachandranand Dr. M.P. Dudukovic

Optimization, HPC, and Pre- and Post-Processing I Session.6th OpenFOAM Workshop Penn State University. June 15th 2011

Chemical Reaction Engineering Laboratory (CREL)Department of Energy, Environmental, and Chemical

Engineering. Washington University, St. Louis, MO.

Page 2: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest

Objectives

2

Page 3: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest

3

Introduction to The GPU and CUDA

What exactly is CUDA?Defined as: Compute Unified Device Architecture. I.e. a parallel computing architecture used in graphics processing units (GPU), developed by Nvidia.

Page 4: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest

4

Introduction to The GPU and CUDA

What exactly is CUDA?Defined as: Compute Unified Device Architecture. I.e. a parallel computing architecture used in graphics processing units (GPU), developed by Nvidia.

What is CUDA C/C++?A language that provides an interface so that parallel algorithms can be run on CUDA enabled Nvidia GPUs

Page 5: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest

5

Introduction to The GPU and CUDAGPU v.s CPU Calculations

CPU-GPU Comparison of Floating-point operations per second [1]

Page 6: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest

6

Introduction to The GPU and CUDA

Why are we interested?Larger problems require more computing resources (LES, coupled physics)

GPUs are fast when used properly

They are relatively cheap

Page 7: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest

7

Introduction to The GPU and CUDA

Why are we interested?Larger problems require more computing resources (LES, coupled physics)

GPUs are fast when used properly

They are relatively cheap

Where can GPUs be applied?Where parallel algorithms live

● Linear algebra i.e. sparse matrix math

Page 8: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest

8

Introduction to The GPU and CUDA

Why are we interested?Larger problems require more computing resources (LES, coupled physics)

GPUs are fast when used properly

They are relatively cheap

Where can GPUs be applied?Where parallel algorithms live

● Linear algebra i.e. sparse matrix math

Why don't we compile everything to work on the GPU? Only programs written in CUDA language can be parallelized on GPU. So we cannot just recompile OF.

Page 9: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest

9

Integrating CUSP into OpenFOAM

“Cusp is a library for sparse linear algebra and graph computations on CUDA. Cusp provides a flexible, high-level interface for manipulating sparse matrices and solving sparse linear systems.”[2]

Provided Template Solvers:• (Bi-) Conjugate Gradient (-Stabilized)• GMRES

Matrix Storage • CSR, COO, HYB, DIA

Provided Preconditioners• Jacobi (diagonal) preconditioners• Sparse Approximate inverse preconditioner• Smoothed-Aggregation Algebraic Multigrid preconditioner

cusp-Library http://code.google.com/p/cusp-library/

Page 10: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest

10

Integrating CUSP into OpenFOAM

“Thrust is a CUDA library of parallel algorithms with an interface resembling the C++ Standard Template Library (STL). Thrust provides a flexible high-levelinterface for GPU programming that greatly enhances developer productivity. “ [3]

http://code.google.com/p/thrust/

Page 11: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest

11

Integrating CUSP into OpenFOAM

AX b

=

AX b

=

OpenFOAM solve(…);

Cusp-based solver on GPU

Thrust Methods

cusp Methods

Page 12: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest

12

Integrating CUSP into OpenFOAM

AX b

=

lduMatrix is converted to COO Using thrust::copy() in C++

AX b

=

OpenFOAM solve(…);

Cusp-based solver on GPU

Thrust Methods

cusp Methods

Page 13: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest

13

Integrating CUSP into OpenFOAM

AX b

=

lduMatrix is converted to COO Using thrust::copy() in C++

COO is transferred to GPU In CUDA Code

AX b

=

OpenFOAM solve(…);

Cusp-based solver on GPU

Thrust Methods

cusp Methods

Page 14: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest

14

Integrating CUSP into OpenFOAM

AX b

=

lduMatrix is converted to COO Using thrust::copy() in C++

COO is transferred to GPU In CUDA Code

COO is converted to other formats on GPUAnd passed to CUSP-based solver with convergence criteriaA

X b

=

OpenFOAM solve(…);

Cusp-based solver on GPU

Thrust Methods

cusp Methods

Page 15: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest

15

Integrating CUSP into OpenFOAM

AX b

=

lduMatrix is converted to COO Using thrust::copy() in C++

COO is transferred to GPU In CUDA Code

COO is converted to other formats on GPUAnd passed to CUSP-based solver with convergence criteriaA

X b

= Residual calculated using OF normalized residual method

OpenFOAM solve(…);

Cusp-based solver on GPU

Thrust Methods

cusp Methods

Page 16: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest

16

Integrating CUSP into OpenFOAM

AX b

=

AX b

=

OpenFOAM solve(…);

Pass X vector and solver performance data back to OpenFOAM using thrust-methods

Thrust Methods

Cusp-based solver on GPU

Page 17: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest

17

Preliminary ResultsA test Problem.

02 =∇ T

2D Heat Equation

Vary N from 10-2000 where N2 = nCells

Page 18: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest

Preliminary ResultsSolver Settings

All CG solvers

Tolerance = 1e-10;MaxIter 1000;

solver GAMG; tolerance 1e-10; smoother GaussSeidel; nPreSweeps 0; nPostSweeps 2; cacheAgglomeration true; nCellsInCoarsestLevel sqrt(nCells); agglomerator faceAreaPair; mergeLevels 1;

Page 19: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest

Preliminary ResultsSetup

CUDA version 4.0CUSP version 0.2Thrust version 1.4Ubuntu 10.04

CPU: Dual Intel Xeon Quad Core E5430 2.66GHzMotherboard: Tyan S5396RAM: 24 gig

GPU: Tesla C2050 3GB DDR5515 Gflops peak double precision1.03 Tflops Peak single precision14 MP * 32 cores/MP = 448 coresHost-device memory bw = 1566 MB/sec (Motherboard specific)

Page 20: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest

20

Preliminary ResultsSolve Time

0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000 45000000

200

400

600

800

1000

1200

1400

Solve() Time Comparison

cusplink_SmAPCGGAMGcusplink_DPCGcusplink_CGDPCG-parallel4DPCG-parallel6-s231DPCGCG

nCells

Tim

e [s

eco

nd

s]

Page 21: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest

21

Preliminary ResultsSolution Speedup

0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000 45000000

2

4

6

8

10

12

14

16

18Speedup Comparison

DPCGDPCG-parallel4DPCG-parallel6-s231DPCG-parallel6-s161cusplink_DPCGcusplink_CG

nCells

Speedup

Speedup = Ts/Tp = TOFCG

/Tother

Page 22: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest

22

Preliminary ResultsSolution Speedup

0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000 45000000

20

40

60

80

100

120

140

Speedup Comparison

DPCGDPCG-parallel4DPCG-parallel6-s231DPCG-parallel6-s161cusplink_CGcusplink_DPCGGAMGGAMG6cusplink_SmAPCG

nCells

Speedup

Speedup = Ts/Tp = TOFCG

/Tother

Page 23: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest

23

Preliminary ResultsSolution Speedup

0 200000 400000 600000 800000 1000000 12000000

10

20

30

40

50

60

Speedup Comparison

DPCGDPCG-parallel4DPCG-parallel6-s231DPCG-parallel6-s161cusplink_CGcusplink_DPCGGAMG6GAMGcusplink_SmAPCG

nCells

Speedup

Speedup = Ts/Tp = TOFCG

/Tother

Page 24: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest

Preliminary Results

24

Page 25: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest

Important Considerations

25

Page 26: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest

Next Steps

26

Page 27: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest

Take Home Messages● The GPU only solves the Ax=b system● We have double precision● GPUs have been integrated into OpenFOAM using Thrust and CUSP● As cusp and thrust improve, nothing needs to be changed in this code, only to update cusp and thrust.● They have been shown to be faster in the cases provided, because it is mostly solving Ax = b.● Residuals are calculated the same as in OpenFOAM● Multi-GPU still needs attention.● The results show that memory bandwidth still is an issue with this particular setup and results could be faster with other setup.

Page 28: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest

Acknowledgements

Funding and SupportNvidia Professor Partnership Program

Chemical Reaction Engineering Laboratory (CREL) MRE Fund (http://crelonweb.eec.wustl.edu/)

OpenFOAM Developers Community

AdvisorsDr. Ramachandran

Dr. Dudukovic

28

Page 29: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest

Sources1. Nvidia CUDA Programming Guide, Version 4.0, 2011. Nvidia

Corporation. 2. Nathan Bell and Michael Garland, Cusp: Generic Parallel

Algorithms for Sparse Matrix and Graph Computations, 2010, http://cusp-library.googlecode.com,Version 0.1.0

3.  Jared Hoberock and Nathan Bell, Thrust: A Parallel Template Library, 2010, http://www.meganewtons.com/,Version 1.3.0

29

Page 30: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest

Questions?

30

Contact Info:Dan [email protected]

Thanks for your attention!

Page 31: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest

Preliminary ResultsSolution Speedup

0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000 45000000

20

40

60

80

100

120

140

Speedup

cusplink_CGcusplink_DPCGcusplink_SmAPCGDPCGDPCG-parallel4DPCG-parallel6-s231DPCG-parallel6-s161GAMGGAMG6

nCells

Sp

ee

du

p

Speedup = Ts/Tp = TOFCG

/Tother

Page 32: Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA · 2011-06-15 · Implementing Fast Parallel Linear System Solvers In OpenFOAM based on CUDA Daniel P. Combest

For matrix A x = b,

residual is defined as

res = b - Ax

We then apply residual scaling with the following normalisation factor procedure:

Type xRef = gAverage(x);

wA = A x; pA = A xRef;

NormFactor = gSum(cmptMag(wA - pA) + cmptMag(source - pA)) + matrix.small_;

and the scaled residual is:

residual = gSum(cmptMag(source - wA))/normFactor;

I will save you from complications with vectors and tensors in my block solver. :-)

Enjoy,

Hrv

Source: http://www.cfd-online.com/Forums/openfoam-solving/57903-residuals-convergence-segregated-solvers.html

Residual Scaling