36
Stan Posey NVIDIA, Santa Clara, CA, USA; [email protected]

Stan Posey NVIDIA, Santa Clara, CA, USA; [email protected]

Embed Size (px)

Citation preview

Page 1: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.com

Stan Posey

NVIDIA, Santa Clara, CA, USA; [email protected]

Page 2: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.com

2

Introduction of GPUs in HPC Progress of CFD on GPUs Review of OpenFOAM on GPUs Discussion on WRF Developments

Agenda: GPU Progress and Directions for CAE

Page 3: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.com

3

Explicit [usually

compressible]

Implicit [usually

incompressible]

Structured Grid Unstructured

CFD Speed-Ups Demonstrated in Range of Time Schemes and Spatial Discretization

~15x

CFD Algorithm Suitability for GPUs

ISVs

~5x

~5x ~2-3x

Stencil operations, uniform memory refs

Stencil operations, renumbering schemes

Linear algebra solver, uniform memory refs

Linear algebra solver, renumbering schemes

~x Factors Based

on Comparisons

with Xeon 8-core

Sandy Bridge CPU

Strategy: Directives Strategy: Directives

Strategy: Libraries Strategy: Libraries

Page 4: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.com

4

Typical Routine Simulation

Large-scale Simulation ~19x Speedup

http://www.turbostream-cfd.com/ Source:

Sample Turbostream GPU Simulations

Turbostream: CFD for Turbomachinery

Page 5: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.com

5

GPU Application

Jameson-developed CFD software SD++ for high order method aerodynamic simulations

GPU Benefit

Use of 16 x Tesla M2070: 15 hrs vs. 202 hrs for 16 x Xeon X5670

Fast turnaround of complex LES simulations that would otherwise be impractical for CPU-only use

Stanford University Aerospace Computing Lab – Prof. Antony Jameson

SD++ and Jameson Aerodynamics Research

15 hours on 16 x M2070s

202 hours ( > one week)

on 16 Xeon x5670 CPUs

Transitional flow over SD70053 airfoil, 21M DOF, Ma =.2, Re=60K, AoA=4, 4th order, 400K RK iters

Page 6: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.com

6

GPU Application

NRL-developed CFD software JENRE for simulation of jet engine acoustics

GPU Benefit

Use of Tesla M2070: 3x vs. Hex core Intel (Westmere) CPU

More detailed mesh simulations possible for longer durations of jet engine transient conditions

U.S. DoD Naval Research Lab Lab for Computational Physics and Fluid Dynamics

Fighter Jet Engine Noise Reduction on GPUs

Page 7: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.com

7

GPU Application

SJTU-developed CFD software NUS3D for aerodynamic simulations of wing shapes

GPU Benefit

Use of Tesla C2070: 20x – 37x vs. single core Intel core i7 CPU

Faster simulations for more wing design candidates vs. wind tunnel testing

Expanding to multi-GPU and full aircraft

COMAC and SJTU Commercial Aircraft Corporation of China

COMAC Wing Candidate

ONERA M6 Wing CFD Simulation

Commercial Aircraft Wing Design on GPUs

Page 8: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.com

8

Particle CFD (LBM, SPH, etc.) generally better fit vs. continuum

Fully deployed explicit solvers generally outperform implicit

Explicit i,j,k stencil operations good fit for massively parallel threads

Most CFD is distributed parallel across CPU multicores/nodes Fits GPU parallel model well and preserves costly MPI investment Focus on hybrid parallel schemes that utilize all CPU cores + GPU

GPU development strategy depends on profile starting point: Legacy explicit scheme: compiler directives such as OpenACC New explicit scheme: CUDA and stencil libraries Legacy implicit scheme: CUDA and libs for solver; OpenACC for rest New implicit scheme: CUDA and libs for solver, matrix assembly, etc.

GPU Development Status for CFD

Page 9: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.com

9 Courtesy of FluiDyna and Lbultra CFD Software: www.fluidyna.de

RTT DeltaGen for photo realistic 3D visualization

Integrates FluiDyna LBultra CFD functionality as plug-in

Designer only must specify

resolution and velocity

Simulation data displayed live with GPU performance

FluiDyna and Aerodynamic-Aware Surface Design

Page 10: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.com

10

Prometech and Particleworks for Multiphase Flow

Oil Flow in HB Gearbox

Courtesy of Prometech Software and Particleworks CFD Software

MPS-based method developed at the University of Tokyo [Prof. Koshizuka]

Particleworks 3.0 GPU vs. 4 core i7

http://www.prometech.co.jp

Page 11: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.com

11

ISV Software Application Method GPU Status

PowerFLOW Aerodynamics LBM Evaluation Lbultra Aerodynamics LBM Available v2.0 XFlow Aerodynamics LBM Evaluation Project Falcon Aerodynamics LBM Evaluation Particleworks Multiphase/FS MPS (~SPH) Available v3.1 BARRACUDA Multiphase/FS MP-PIC In development EDEM Discrete phase DEM In development ANSYS Fluent – DDPM Multiphase/FS DEM In development STAR-CCM+ Multiphase/FS DEM Evaluation AFEA High impact SPH Available v2.0

ESI High impact SPH, ALE In development LSTC High impact SPH, ALE Evaluation Altair High impact SPH, ALE Evaluation

Availability of Commercial DSFD-Based Software

Page 12: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.com

12

ISV Primary Applications (Green color indicates CUDA-ready during 2013)

ANSYS ANSYS Mechanical; ANSYS Fluent; ANSYS HFSS

DS SIMULIA Abaqus/Standard; Abaqus/Explicit; Abaqus/CFD

MSC Software MSC Nastran; Marc; Adams

Altair RADIOSS; AcuSolve

CD-adapco STAR-CD; STAR-CCM+

Autodesk AS Mechanical, Moldflow, AS CFD

ESI Group PAM-CRASH imp; CFD-ACE+

Siemens NX Nastran

LSTC LS-DYNA; LS-DYNA CFD

Mentor FloEFD, FloTherm

Metacomp CFD++

Grid-Based Commercial CFD and GPU Progress

Page 13: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.com

13

Additional Commercial GPU Developments

ISV Domain Location Primary Applications

FluiDyna CFD Germany Culises for OpenFOAM; LBultra

Vratis CFD Poland Speed-IT for OpenFOAM; ARAEL

Prometech CFD Japan Particleworks

Turbostream CFD England, UK Turbostream

IMPETUS Explicit FEA Sweden AFEA

AVL CFD Austria FIRE

CoreTech CFD (molding) Taiwan Moldex3D

Intes Implicit FEA Germany PERMAS

Next Limit CFD Spain XFlow

CPFD CFD USA BARRACUDA

Flow Science CFD USA FLOW-3D

Page 14: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.com

14

Every primary ISV has products available on GPUs or undergoing evaluation

The 4 largest ISVs all have products based on GPUs, some at 3rd generation

#1 ANSYS, #2 DS SIMULIA, #3 MSC Software, and #4 Altair

The top 4 out of 5 ISV applications are available on GPUs today

ANSYS Fluent, ANSYS Mechanical, Abaqus/Standard, MSC Nastran, (LS-DYNA implicit only)

Several new ISVs were founded with GPUs as a primary competitive strategy Prometech, FluiDyna, Vratis, IMPETUS, Turbostream

Open source CFD OpenFOAM available on GPUs today with many options Commercial options: FluiDyna, Vratis; Open source options: Cufflink, Symscape ofgpu, RAS, etc.

Status Summary of ISVs and GPU Computing

Page 15: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.com

15

Structured Grid FV Unstructured FV Unstructured FE

CFD Algorithm Characterization: Discretization

Finite Volume Finite Element:

Page 16: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.com

16

Explicit

Usually

Compressible

Implicit

Usually

Incompressible

Structured Grid FV Unstructured FV Unstructured FE

Finite Volume

CFD Algorithm Characterization: Time Integration

Finite Element:

Page 17: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.com

17

Explicit

Usually

Compressible

Implicit

Usually

Incompressible

Structured Grid FV Unstructured FV Unstructured FE

Finite Volume

Numerical operations on I,J,K stencil, no “solver” [Typically flat profiles: GPU strategy of directives (OpenACC)]

CFD Algorithm Characterization: Time Integration

Finite Element:

Page 18: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.com

18

Structured Grid FV Unstructured FV Unstructured FE

GPU Acceleration Relative to Single 8-Core CPU

Explicit

Usually

Compressible

Implicit

Usually

Incompressible

Finite Volume Finite Element:

Turbostream

SJTU RANS

- SD++

Stanford

(Jameson)

- FEFLO

(Lohner)

Veloxi

~15x ~5x

Page 19: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.com

19

Structured Grid FV Unstructured FV Unstructured FE

GPU Acceleration Relative to Single 8-Core CPU

Explicit

Usually

Compressible

Implicit

Usually

Incompressible

Finite Volume Finite Element:

Turbostream

SJTU RANS

- SD++

Stanford

(Jameson)

- FEFLO

(Lohner)

Veloxi

~15x ~5x

Sparse matrix linear algebra – iterative solvers [Hot spot ~50%, small % LoC: GPU strategy of CUDA and libs]

Page 20: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.com

20

Structured Grid FV Unstructured FV Unstructured FE

GPU Acceleration Relative to Single 8-Core CPU

Explicit

Usually

Compressible

Implicit

Usually

Incompressible

Finite Volume Finite Element:

- Moldflow

- AcuSolve

- Moldex3D

Turbostream

SJTU RANS

- SD++

Stanford

(Jameson)

- FEFLO

(Lohner)

Veloxi

~15x ~5x

~2x

- ANSYS Fluent

- Culises for

OpenFOAM

- SpeedIT for

OpenFOAM

- CFD-ACE+

- FIRE

Page 21: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.com

21

Commercial CFD Focus on Sparse Solvers for GPU

CFD Application Software

+

GPU CPU - Hand-CUDA Parallel

- GPU Libraries, CUBLAS

- OpenACC Directives

Implicit Sparse Matrix Operations

50% - 65% of

Profile time,

Small % LoC

(Investigating OpenACC for more tasks on GPU)

Read input, matrix Set-up

Global solution, write output

Implicit Sparse Matrix Operations

Page 22: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.com

22

Library of nested solvers for large sparse Ax=b

Nesting creates a solver hierarchy, e.g.

Example solvers

Jacobi, simple local (neighbor) operations, no/little setup

BiCGStab, local and global operations, no setup

MC-DILU, graph coloring and factorization at setup

AMG, multi-level scheme, on each level: graph coarsening and matrix-

matrix products at setup

Accelerate state-of-the-art multi-level linear solvers in targeted

application domains

Primary Targets: CFD and Reservoir Simulation

Other domains will follow

Focus on difficult-to-parallelize algorithms

Parallelize both setup and solve phases

Difficult problems: parallel graph algorithms, sparse matrix

manipulation, parallel smoothers

No groups have successfully mapped production-quality algorithms to

fine-grained parallel architectures

Ensure NVIDIA architecture team understands these

applications and is influenced by them

BiCGstab AMG Jacobi

MC-DILU

NVIDIA-Developed Library of Linear Solvers

Page 23: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.com

23

Committed: ANSYS – ANSYS Fluent and ANSYS CFD : #1 in CFD

FluiDyna – Culises library use in OpenFOAM: OpenFOAM is #2 in CFD for leveraged hardware

Evaluation: Autodesk – AS Moldflow: the leader in plastic mold injection simulation

Autodesk – AS CFD: important to the design engineering market and being hosted on Autodesk cloud

Discussion: CD-adapco – STAR-CCM+: the # 2 CFD code for software rev, either #2 or #3 for leveraged hardware

ESI – CFD-ACE+: important CFD code in the semiconductor/electronics industry along with others

Cradle – SC/Tetra: #3 CFD in Japan (behind ANSYS Fluent and STAR-CCM+) and primary CFD code at Toyota

Targets: Altair – AcuSolve: GMRES

Metacomp – CFD++: AMG

Mentor – FloEFD: AMG

SIMULIA – Abaqus/CFD: use ML from Petsc

LSTC – LS-DYNA CFD: AMG

AVL – FIRE: AMG

Convergent Technologies – Converge CFD: GMRES

ISV Progress with NVIDIA CFD Solver Library

Page 24: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.com

24

ANSYS and NVIDIA Technical Collaboration

Release

ANSYS Mechanical ANSYS Fluent ANSYS EM

13.0 Dec 2010

SMP, Single GPU, Sparse

and PCG/JCG Solvers

ANSYS Nexxim

14.0 Dec 2011

+ Distributed ANSYS;

+ Multi-node Support

Radiation Heat Transfer

(beta)

ANSYS Nexxim

14.5 Nov 2012

+ Multi-GPU Support;

+ Hybrid PCG;

+ Kepler GPU Support

+ Radiation HT;

+ GPU AMG Solver (beta),

Single GPU

ANSYS Nexxim

15.0 Q4-2013

+ CUDA 5 Kepler Tuning + Multi-GPU AMG Solver;

+ CUDA 5 Kepler Tuning

ANSYS Nexxim

ANSYS HFSS (Transient)

Page 25: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.com

25

Radiation HT Applications:

- Underhood cooling

- Cabin comfort HVAC

- Furnace simulations

- Solar loads on buildings

- Combustor in turbine

- Electronics passive cooling

ANSYS Fluent 14.5 and Radiation HT on GPU

VIEWFAC Utility:

Use on CPUs, GPUs

or both ~2x speedup

RAY TRACING Utility:

Uses OptiX library

from NVIDIA with up

to ~15x speedup

(Use on GPU only)

Page 26: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.com

26

Solve Linear System of Equations: Ax = b

Assemble Linear System of Equations

No Yes

Stop

Accelerate

this first

~ 35%

~ 65%

Runtime:

Non-linear iterations

Converged ?

ANSYS Fluent CPU Job Profile for Coupled PBNS

Page 27: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.com

27

0

3

6

9

Airfoil (hex 784K) Aircraft (hex 1798K)

K20X

3930K(6)

Lower is

Better

NOTE: Times

for solver only

AN

SY

S F

luent

AM

G S

olv

er

Tim

e p

er

Itera

tion (S

ec)

ANSYS Fluent 14.5 Performance – Results by NVIDIA, Nov 2012

CPU Fluent solver:

F-cycle, agg8, DILU,

0pre, 3post

GPU nvAMG solver:

V-cycle, agg8, MC-DILU,

0pre, 3post

2 x Core-i7 3930K, Only 6 Cores Used

Solver settings:

Airfoil and Aircraft Models with Hexahedral Cells

2.4x

2.4x

ANSYS Fluent GPU-Based AMG Solver from NVIDIA

Page 28: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.com

28

Comparison of AMG Cycles on CPU and GPU

Lower is

Better

CPU-F

GPU-F

2D Convection Case: F-cycle best for both CPU and GPU

Page 29: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.com

29

N1 N2 N3 N4

1 2 3

4

Partition on CPU

GPUs and Distributed Cluster Computing

N1

Geometry decomposed: partitions

put on independent cluster nodes;

CPU distributed parallel processing Nodes distributed

parallel using MPI

Global Solution

1

Page 30: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.com

30

N1 N2 N3 N4

1 2 3

4

Partition on CPU

GPUs and Distributed Cluster Computing

N1

Geometry decomposed: partitions

put on independent cluster nodes;

CPU distributed parallel processing Nodes distributed

parallel using MPI

Global Solution

Execution on

CPU + GPU

GPUs shared memory

parallel using OpenMP

under distributed parallel

G1 G2 G3 G4

1

Page 31: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.com

31

NOTE: Times

for solver only

CPU Fluent solver:

F-cycle, agg8, DILU,

0pre, 3post

GPU nvAMG solver:

V-cycle, agg8, MC-DILU,

0pre, 3post

2 x E5_2680 SB CPUs, 16 cores total, only 2 cores used with GPUs

Solver settings:

ANSYS Fluent Preview for 2 x CPU + 2 x Tesla K20X

ANSYS Fluent 15.0 Preview Performance – Results by NVIDIA, Feb 2013

0

0.5

1

1.5

Helix (tet 1173K) Airfoil (hex 784K)

2 x K20X

E5_2680(16)

1.7x

2.1x

Lower is

Better

Page 32: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.com

32

ANSYS Fluent Scaling Results for 4 x Tesla K20X

ANSYS Fluent 15.0 Preview Performance – Results by NVIDIA, Mar 2013

1

2

3

4

K20X(1) K20X(2) K20X(3) K20X(4)

Helix (Tet 1.2M)

Airfoil (Hex .78M)

Sedan (Mixed 3.6M)

Perfect Scaling

Higher is

Better

GPU Solver Settings:

V-cycle, agg8/2,

MC-DILU, 0pre, 3post

• 2 server nodes

• 2 GPUs each node

• Infiniband network

Hardware Setup:

NOTES: • Results for

solver only

• Sedan case starts

with 2 GPUs

Page 33: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.com

33

ANSYS Fluent 15.0 Multi-GPU Demonstration

G1 G2 G3 G4

8-Cores 8-Cores 16-Core Server Node

Multi-GPU Acceleration of

a 16-Core ANSYS Fluent

Simulation of External Aero

Xeon E5-2667 CPUs + Tesla K20X GPUs

2.9X Solver Speedup

CPU Configuration CPU + GPU Configuration

Click to Launch Movie

Page 34: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.com

34

Problem Statement:

CFD demand for increased levels of CFD model

resolution for improved simulation accuracy

CFD use is 80% steady state RANS today rather a short-cut to faster turn-around

Fluid flow is inherently unsteady and in need of better turbulence treatment

CPU-based HPC limits advanced CFD

Opportunity:

CFD ISVs have developed URANS, DES, and LES capabilities which undergo very limited use

CPU-based turnaround times are impractical for many product development workflows

Large Eddy Simulation (LES) is of most interest and has a high degree of arithmetic intensity

GPU computing can offer a practical solution for LES that doesn’t exist today with CPUs

Summary: Opportunity for Advanced CFD

Page 35: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.com

35

Opportunities exist for GPUs to provide significant performance acceleration for solver intensive large jobs

Improved product quality

Shorten product engineering cycles (Faster Time-to-Market)

Better Total Cost of Ownership (TCO)

Cut down energy consumption in the CAE process

Simulations recently considered intractable are now possible

Large Eddy Simulation (LES) with a high degree of arithmetic intensity

Parameter optimization with highly increased number of jobs

Conclusions For CAE on GPUs

Page 36: Stan Posey NVIDIA, Santa Clara, CA, USA; sposey@nvidia.com

Stan Posey

NVIDIA, Santa Clara, CA, USA; [email protected]