Stan Posey NVIDIA, Santa Clara, CA, USA; [email protected]

Stan Posey

NVIDIA, Santa Clara, CA, USA; [email protected]

mailto:[email protected]

2

Introduction of GPUs in HPC Progress of CFD on GPUs Review of OpenFOAM on GPUs Discussion on WRF Developments

Agenda: GPU Progress and Directions for CAE

3

Explicit [usually

compressible]

Implicit [usually

incompressible]

Structured Grid Unstructured

CFD Speed-Ups Demonstrated in Range of Time Schemes and Spatial Discretization

~15x

CFD Algorithm Suitability for GPUs

ISVs

~5x

~5x ~2-3x

Stencil operations, uniform memory refs

Stencil operations, renumbering schemes

Linear algebra solver, uniform memory refs

Linear algebra solver, renumbering schemes

~x Factors Based

on Comparisons

with Xeon 8-core

Sandy Bridge CPU

Strategy: Directives Strategy: Directives

Strategy: Libraries Strategy: Libraries

4

Typical Routine Simulation

Large-scale Simulation ~19x Speedup

http://www.turbostream-cfd.com/ Source:

Sample Turbostream GPU Simulations

Turbostream: CFD for Turbomachinery

http://www.turbostream-cfd.com/



5

GPU Application

Jameson-developed CFD software SD++ for high order method aerodynamic simulations

GPU Benefit

Use of 16 x Tesla M2070: 15 hrs vs. 202 hrs for 16 x Xeon X5670

Fast turnaround of complex LES simulations that would otherwise be impractical for CPU-only use

Stanford University Aerospace Computing Lab – Prof. Antony Jameson

SD++ and Jameson Aerodynamics Research

15 hours on 16 x M2070s

202 hours ( > one week)

on 16 Xeon x5670 CPUs

Transitional flow over SD70053 airfoil, 21M DOF, Ma =.2, Re=60K, AoA=4, 4th order, 400K RK iters

6

GPU Application

NRL-developed CFD software JENRE for simulation of jet engine acoustics

GPU Benefit

Use of Tesla M2070: 3x vs. Hex core Intel (Westmere) CPU

More detailed mesh simulations possible for longer durations of jet engine transient conditions

U.S. DoD Naval Research Lab Lab for Computational Physics and Fluid Dynamics

Fighter Jet Engine Noise Reduction on GPUs

7

GPU Application

SJTU-developed CFD software NUS3D for aerodynamic simulations of wing shapes

GPU Benefit

Use of Tesla C2070: 20x – 37x vs. single core Intel core i7 CPU

Faster simulations for more wing design candidates vs. wind tunnel testing

Expanding to multi-GPU and full aircraft

COMAC and SJTU Commercial Aircraft Corporation of China

COMAC Wing Candidate

ONERA M6 Wing CFD Simulation

Commercial Aircraft Wing Design on GPUs

8

Particle CFD (LBM, SPH, etc.) generally better fit vs. continuum

Fully deployed explicit solvers generally outperform implicit

Explicit i,j,k stencil operations good fit for massively parallel threads

Most CFD is distributed parallel across CPU multicores/nodes Fits GPU parallel model well and preserves costly MPI investment Focus on hybrid parallel schemes that utilize all CPU cores + GPU

GPU development strategy depends on profile starting point: Legacy explicit scheme: compiler directives such as OpenACC New explicit scheme: CUDA and stencil libraries Legacy implicit scheme: CUDA and libs for solver; OpenACC for rest New implicit scheme: CUDA and libs for solver, matrix assembly, etc.

GPU Development Status for CFD

9 Courtesy of FluiDyna and Lbultra CFD Software: www.fluidyna.de

RTT DeltaGen for photo realistic 3D visualization

Integrates FluiDyna LBultra CFD functionality as plug-in

Designer only must specify

resolution and velocity

Simulation data displayed live with GPU performance

FluiDyna and Aerodynamic-Aware Surface Design

FluidDyna_SxS_R5-wmv1080p.wmv

http://www.fluidyna.de/en/home

http://www.realtime-technology.com/home.html

http://www.fluidyna.de/en/home

10

Prometech and Particleworks for Multiphase Flow

Oil Flow in HB Gearbox

Courtesy of Prometech Software and Particleworks CFD Software

MPS-based method developed at the University of Tokyo [Prof. Koshizuka]

Particleworks 3.0 GPU vs. 4 core i7

http://www.prometech.co.jp

Promotech_SxS_R7-wmv1080p.wmv

Promotech_SxS_R7-wmv1080p.wmv

11

ISV Software Application Method GPU Status

PowerFLOW Aerodynamics LBM Evaluation Lbultra Aerodynamics LBM Available v2.0 XFlow Aerodynamics LBM Evaluation Project Falcon Aerodynamics LBM Evaluation Particleworks Multiphase/FS MPS (~SPH) Available v3.1 BARRACUDA Multiphase/FS MP-PIC In development EDEM Discrete phase DEM In development ANSYS Fluent – DDPM Multiphase/FS DEM In development STAR-CCM+ Multiphase/FS DEM Evaluation AFEA High impact SPH Available v2.0

ESI High impact SPH, ALE In development LSTC High impact SPH, ALE Evaluation Altair High impact SPH, ALE Evaluation

Availability of Commercial DSFD-Based Software

http://www.ansys.com/

12

ISV Primary Applications (Green color indicates CUDA-ready during 2013)

ANSYS ANSYS Mechanical; ANSYS Fluent; ANSYS HFSS

DS SIMULIA Abaqus/Standard; Abaqus/Explicit; Abaqus/CFD

MSC Software MSC Nastran; Marc; Adams

Altair RADIOSS; AcuSolve

CD-adapco STAR-CD; STAR-CCM+

Autodesk AS Mechanical, Moldflow, AS CFD

ESI Group PAM-CRASH imp; CFD-ACE+

Siemens NX Nastran

LSTC LS-DYNA; LS-DYNA CFD

Mentor FloEFD, FloTherm

Metacomp CFD++

Grid-Based Commercial CFD and GPU Progress

http://www.simulia.com/index.html

13

Additional Commercial GPU Developments

ISV Domain Location Primary Applications

FluiDyna CFD Germany Culises for OpenFOAM; LBultra

Vratis CFD Poland Speed-IT for OpenFOAM; ARAEL

Prometech CFD Japan Particleworks

Turbostream CFD England, UK Turbostream

IMPETUS Explicit FEA Sweden AFEA

AVL CFD Austria FIRE

CoreTech CFD (molding) Taiwan Moldex3D

Intes Implicit FEA Germany PERMAS

Next Limit CFD Spain XFlow

CPFD CFD USA BARRACUDA

Flow Science CFD USA FLOW-3D

14

Every primary ISV has products available on GPUs or undergoing evaluation

The 4 largest ISVs all have products based on GPUs, some at 3rd generation

#1 ANSYS, #2 DS SIMULIA, #3 MSC Software, and #4 Altair

The top 4 out of 5 ISV applications are available on GPUs today

ANSYS Fluent, ANSYS Mechanical, Abaqus/Standard, MSC Nastran, (LS-DYNA implicit only)

Several new ISVs were founded with GPUs as a primary competitive strategy Prometech, FluiDyna, Vratis, IMPETUS, Turbostream

Open source CFD OpenFOAM available on GPUs today with many options Commercial options: FluiDyna, Vratis; Open source options: Cufflink, Symscape ofgpu, RAS, etc.

Status Summary of ISVs and GPU Computing

15

Structured Grid FV Unstructured FV Unstructured FE

CFD Algorithm Characterization: Discretization

Finite Volume Finite Element:

16

Explicit

Usually

Compressible

Implicit

Usually

Incompressible


Finite Volume

CFD Algorithm Characterization: Time Integration

Finite Element:

17

Explicit

Usually

Compressible

Implicit

Usually

Incompressible


Finite Volume

Numerical operations on I,J,K stencil, no “solver” [Typically flat profiles: GPU strategy of directives (OpenACC)]

CFD Algorithm Characterization: Time Integration

Finite Element:

18


GPU Acceleration Relative to Single 8-Core CPU

Explicit

Usually

Compressible

Implicit

Usually

Incompressible


Turbostream

SJTU RANS

- SD++

Stanford

(Jameson)

- FEFLO

(Lohner)

Veloxi

~15x ~5x

19



Explicit

Usually

Compressible

Implicit

Usually

Incompressible


Turbostream

SJTU RANS

- SD++

Stanford

(Jameson)

- FEFLO

(Lohner)

Veloxi

~15x ~5x

Sparse matrix linear algebra – iterative solvers [Hot spot ~50%, small % LoC: GPU strategy of CUDA and libs]

20



Explicit

Usually

Compressible

Implicit

Usually

Incompressible


- Moldflow

- AcuSolve

- Moldex3D

Turbostream

SJTU RANS

- SD++

Stanford

(Jameson)

- FEFLO

(Lohner)

Veloxi

~15x ~5x

~2x

- ANSYS Fluent

- Culises for

OpenFOAM

- SpeedIT for

OpenFOAM

- CFD-ACE+

- FIRE

21

Commercial CFD Focus on Sparse Solvers for GPU

CFD Application Software

+

GPU CPU - Hand-CUDA Parallel

- GPU Libraries, CUBLAS

- OpenACC Directives

Implicit Sparse Matrix Operations

50% - 65% of

Profile time,

Small % LoC

(Investigating OpenACC for more tasks on GPU)

Read input, matrix Set-up

Global solution, write output

Implicit Sparse Matrix Operations

22

Library of nested solvers for large sparse Ax=b

Nesting creates a solver hierarchy, e.g.

Example solvers

Jacobi, simple local (neighbor) operations, no/little setup

BiCGStab, local and global operations, no setup

MC-DILU, graph coloring and factorization at setup

AMG, multi-level scheme, on each level: graph coarsening and matrix-

matrix products at setup

Accelerate state-of-the-art multi-level linear solvers in targeted

application domains

Primary Targets: CFD and Reservoir Simulation

Other domains will follow

Focus on difficult-to-parallelize algorithms

Parallelize both setup and solve phases

Difficult problems: parallel graph algorithms, sparse matrix

manipulation, parallel smoothers

No groups have successfully mapped production-quality algorithms to

fine-grained parallel architectures

Ensure NVIDIA architecture team understands these

applications and is influenced by them

BiCGstab AMG Jacobi

MC-DILU

NVIDIA-Developed Library of Linear Solvers

23

Committed: ANSYS – ANSYS Fluent and ANSYS CFD : #1 in CFD

FluiDyna – Culises library use in OpenFOAM: OpenFOAM is #2 in CFD for leveraged hardware

Evaluation: Autodesk – AS Moldflow: the leader in plastic mold injection simulation

Autodesk – AS CFD: important to the design engineering market and being hosted on Autodesk cloud

Discussion: CD-adapco – STAR-CCM+: the # 2 CFD code for software rev, either #2 or #3 for leveraged hardware

ESI – CFD-ACE+: important CFD code in the semiconductor/electronics industry along with others

Cradle – SC/Tetra: #3 CFD in Japan (behind ANSYS Fluent and STAR-CCM+) and primary CFD code at Toyota

Targets: Altair – AcuSolve: GMRES

Metacomp – CFD++: AMG

Mentor – FloEFD: AMG

SIMULIA – Abaqus/CFD: use ML from Petsc

LSTC – LS-DYNA CFD: AMG

AVL – FIRE: AMG

Convergent Technologies – Converge CFD: GMRES

ISV Progress with NVIDIA CFD Solver Library

24

ANSYS and NVIDIA Technical Collaboration

Release

ANSYS Mechanical ANSYS Fluent ANSYS EM

13.0 Dec 2010

SMP, Single GPU, Sparse

and PCG/JCG Solvers

ANSYS Nexxim

14.0 Dec 2011

+ Distributed ANSYS;

+ Multi-node Support

Radiation Heat Transfer

(beta)

ANSYS Nexxim

14.5 Nov 2012

+ Multi-GPU Support;

+ Hybrid PCG;

+ Kepler GPU Support

+ Radiation HT;

+ GPU AMG Solver (beta),

Single GPU

ANSYS Nexxim

15.0 Q4-2013

+ CUDA 5 Kepler Tuning + Multi-GPU AMG Solver;

+ CUDA 5 Kepler Tuning

ANSYS Nexxim

ANSYS HFSS (Transient)

25

Radiation HT Applications:

- Underhood cooling

- Cabin comfort HVAC

- Furnace simulations

- Solar loads on buildings

- Combustor in turbine

- Electronics passive cooling

ANSYS Fluent 14.5 and Radiation HT on GPU

VIEWFAC Utility:

Use on CPUs, GPUs

or both ~2x speedup

RAY TRACING Utility:

Uses OptiX library

from NVIDIA with up

to ~15x speedup

(Use on GPU only)

26

Solve Linear System of Equations: Ax = b

Assemble Linear System of Equations

No Yes

Stop

Accelerate

this first

~ 35%

~ 65%

Runtime:

Non-linear iterations

Converged ?

ANSYS Fluent CPU Job Profile for Coupled PBNS

27

0

3

6

9

Airfoil (hex 784K) Aircraft (hex 1798K)

K20X

3930K(6)

Lower is

Better

NOTE: Times

for solver only

AN

SY

S F

luent

AM

G S

olv

er

Tim

e p

er

Itera

tion (S

ec)

ANSYS Fluent 14.5 Performance – Results by NVIDIA, Nov 2012

CPU Fluent solver:

F-cycle, agg8, DILU,

0pre, 3post

GPU nvAMG solver:

V-cycle, agg8, MC-DILU,

0pre, 3post

2 x Core-i7 3930K, Only 6 Cores Used

Solver settings:

Airfoil and Aircraft Models with Hexahedral Cells

2.4x

2.4x

ANSYS Fluent GPU-Based AMG Solver from NVIDIA

28

Comparison of AMG Cycles on CPU and GPU

Lower is

Better

CPU-F

GPU-F

2D Convection Case: F-cycle best for both CPU and GPU

29

N1 N2 N3 N4

1 2 3

4

Partition on CPU

GPUs and Distributed Cluster Computing

N1

Geometry decomposed: partitions

put on independent cluster nodes;

CPU distributed parallel processing Nodes distributed

parallel using MPI

Global Solution

1

30

N1 N2 N3 N4

1 2 3

4

Partition on CPU

GPUs and Distributed Cluster Computing

N1

Geometry decomposed: partitions

put on independent cluster nodes;

CPU distributed parallel processing Nodes distributed

parallel using MPI

Global Solution

Execution on

CPU + GPU

GPUs shared memory

parallel using OpenMP

under distributed parallel

G1 G2 G3 G4

1

31

NOTE: Times

for solver only

CPU Fluent solver:

F-cycle, agg8, DILU,

0pre, 3post

GPU nvAMG solver:

V-cycle, agg8, MC-DILU,

0pre, 3post

2 x E5_2680 SB CPUs, 16 cores total, only 2 cores used with GPUs

Solver settings:

ANSYS Fluent Preview for 2 x CPU + 2 x Tesla K20X

ANSYS Fluent 15.0 Preview Performance – Results by NVIDIA, Feb 2013

0

0.5

1

1.5

Helix (tet 1173K) Airfoil (hex 784K)

2 x K20X

E5_2680(16)

1.7x

2.1x

Lower is

Better

32

ANSYS Fluent Scaling Results for 4 x Tesla K20X

ANSYS Fluent 15.0 Preview Performance – Results by NVIDIA, Mar 2013

1

2

3

4

K20X(1) K20X(2) K20X(3) K20X(4)

Helix (Tet 1.2M)

Airfoil (Hex .78M)

Sedan (Mixed 3.6M)

Perfect Scaling

Higher is

Better

GPU Solver Settings:

V-cycle, agg8/2,

MC-DILU, 0pre, 3post

• 2 server nodes

• 2 GPUs each node

• Infiniband network

Hardware Setup:

NOTES: • Results for

solver only

• Sedan case starts

with 2 GPUs

04937_Ansys_SxS_R4.mov

33

ANSYS Fluent 15.0 Multi-GPU Demonstration

G1 G2 G3 G4

8-Cores 8-Cores 16-Core Server Node

Multi-GPU Acceleration of

a 16-Core ANSYS Fluent

Simulation of External Aero

Xeon E5-2667 CPUs + Tesla K20X GPUs

2.9X Solver Speedup

CPU Configuration CPU + GPU Configuration

Click to Launch Movie

04937_Ansys_SxS_R4.mov

34

Problem Statement:

CFD demand for increased levels of CFD model

resolution for improved simulation accuracy

CFD use is 80% steady state RANS today rather a short-cut to faster turn-around

Fluid flow is inherently unsteady and in need of better turbulence treatment

CPU-based HPC limits advanced CFD

Opportunity:

CFD ISVs have developed URANS, DES, and LES capabilities which undergo very limited use

CPU-based turnaround times are impractical for many product development workflows

Large Eddy Simulation (LES) is of most interest and has a high degree of arithmetic intensity

GPU computing can offer a practical solution for LES that doesn’t exist today with CPUs

Summary: Opportunity for Advanced CFD

35

Opportunities exist for GPUs to provide significant performance acceleration for solver intensive large jobs

Improved product quality

Shorten product engineering cycles (Faster Time-to-Market)

Better Total Cost of Ownership (TCO)

Cut down energy consumption in the CAE process

Simulations recently considered intractable are now possible

Large Eddy Simulation (LES) with a high degree of arithmetic intensity

Parameter optimization with highly increased number of jobs

Conclusions For CAE on GPUs

Stan Posey

NVIDIA, Santa Clara, CA, USA; [email protected]

mailto:[email protected]

Documents

Stan Posey NVIDIA, Santa Clara, CA, USA; [email protected]