Porting the physical parametrizations on GPUs using directives X. Lapillonne, O. Fuhrer, Cristiano Padrin, Piero Lanucara, Alessandro Cheloni Eidgenössisches

Porting the physical parametrizations on GPUs using directives

X. Lapillonne, O. Fuhrer,

Cristiano Padrin, Piero Lanucara, Alessandro Cheloni

Eidgenössisches Departement des Innern EDIBundesamt für Meteorologie und Klimatologie MeteoSchweiz

2 08/09/2011, COSMO GM X. Lapillonne

Outline

• Computing with GPUs

• COSMO parametrizations on GPUs using directives

• Running COSMO on an hybrid system

• Summary


Computing on Graphical Processing Units (GPUs)

• Benefit from the highly parallel architecture of GPUs

• Higher peak performance at lower cost / power consumption.

• High memory bandwidth

CoresFreq.(GHz)

Peak Perf.S.P.

(GFLOPs)

Peak Perf.D.P.

(GFLOPs)

Memory Bandwith (GB/sec)

PowerCons.

(W)

CPU: AMDMagny-cours

12 2.1 202 101 42.7 115

GPU: FermiM2050

448 1.15 1030 515 144 225

X 5 X 3


Execution model

Host

(CPU)

Kernel

Sequential

Sequential

Device(GPU)

Data

Transfer

• Copy data from CPU to GPU(CPU and GPU memory are separate)

• Load specific GPU program (Kernel)

• Execution: Same kernel is executed by all threads, SIMD parallelism (Single instruction, multiple data)

• Copy back data from GPU to CPU

… …

… …

Parallel threads


Outline




• Summary


The directive approach, an example

!$acc data region local(a,b)!$acc update device(b) !initialization !$acc region do k=1,nlev do i=1,N a(i,k)=0.0D0 end do end do !$acc end region

! first layer !$acc region do i=1,N a(i,1)=0.1D0 end do !$acc end region

! vertical computation !$acc region do k=2,nlev do i=1,N a(i,k)=0.95D0*a(i,k-1)+exp(-2*a(i,k)**2)*b(i,k) end do end do !$acc end region !$acc update host(a)!$acc end data region

!$acc data region local(a,b)!$acc update device(b) !initialization !$acc region do kernel do i=1,N do k=1,nlev a(i,k)=0.0D0 end do end do !$acc end region

! first layer !$acc region do i=1,N a(i,1)=0.1D0 end do !$acc end region

! vertical computation !$acc region do kernel do i=1,N do k=2,nlev a(i,k)=0.95D0*a(i,k-1)+exp(-2*a(i,k)**2)*b(i,k) end do end do !$acc end region !$acc update host(a)!$acc end data region

N=1000, nlev=60: t= 555 μs t= 225 μs

note : PGI directives

Loop reordering

3 different kernels

Array “a” remains on the GPU between the different kernel calls


Cosmo physical parametrizations with directives

Note: Directives are tested in standalone version of various parametrizations

• OMP-acc : discussed within the OMP committee, currently only supported by a test version of the Cray compiler.+ : possible future standardcurrently ported: microphysics (hydci_pp), radiation (fesft)

• PGI directives: developed by PGI compiler, quite advance.+ : most mature, OMP-acc are a subset of PGI directives- : vendor specificcurrently ported: microphysics (hydci_pp), radiation (fesft), turbulence (turbdiff)

• F2C-ACC: developed by NOAA+ : freely available, generates CUDA code (possibility to further optimize and debug at this stage)- : on going projectcurrently ported: microphysics (hydci_pp)


Cosmo physical parametrizations with directives

• Specific GPU optimizations have been introduced: • loop reordering (if necessary)• replacement of arrays with scalars

• For a typical COSMO-2 simulation on a CPU cluster, the subroutines hydci_pp, fesft and turbdiff represent respectively 6.7%, 8% and 7.3% of the total execution time.


Performance results: CPU/GPU comparison

• CPU : AMD 12 cores “Magny-Cours”. MPI-parallel code, note: there are no mpi-communication as the parametrization are column independent

• GPU: Fermi card (M2050)

•2010-09-18 12:00z+3h

• Codes are tested using one subdomain of size nx x ny x nz= 80 x 60 x 60


Results: Comparison with CPU

• Speed-up between 2.4x and 6.5x

• Note: Expected performance would be between 3x and 5x and depending whether the problem is compute or memory bandwith bound.

• Overhead of data transfer for microphysics and turbulence is very large.


Comparison PGI, F2-ACC

• First investigations show that F2-ACC implementation is 1.2 time faster

• Showing execution + data transfer time for the microphysics run on same GPU


Outline




• Summary


Possible future implementations in COSMO

Dynamic Microphysics Turbulence Radiation

Phys. parametrization

I/O

GPU

Dynamic Microphysics Turbulence Radiation

Phys. parametrization

I/O

GPU GPU GPU GPU

• Data movement for each routine

• Data remain on device, only send to CPU for I/O and communication

C++, CUDA Directives


Running COSMO-2 on Hybrid-system

Multicores Processor

GPUs

• One (or more) multicores CPU

• Domain decomposition

• One GPU per subdomain.


Outline




• Summary


Summary

• In the frame of POMPA investigations are carried out to port the COSMO code to GPU architecture

• Different directive approaches are considered to port the physical parametrizations on such architecture: PGI, OMP-acc and F2-ACC.

• Comparing with a high-end 12 cores CPU, a speed up between 2.4x and 6.5x was observed using one Fermi GPU card with PGI

• These results are within the expected values considering hardware properties

• The large overhead of data transfer shows that the full GPU approach (i.e. data remains on the GPU, all computation on the device) is the prefered approach for COSMO

• First investigations on the microphysics show a speed up of 1.2x with respect to PGI implementation on the GPU


Additional slides


Results, Fermi card using PGI directives

Performance

0

5

10

15

20

25

30

microphysics radiation turbulence

GF

lop/

s

• Peak performance of a Fermi card for double precision is 515 GFlop/s, i.e. we are getting respectively 5%, 4.5% and 2.5% peak performance for the microphysics, radiation and turbulence schemes

• Test domain: nx x ny x nz = 80 x 60 x 60


Scaling with system size N=nx x ny

nx x ny =100 x 100

not enough parallelism


!$omp acc_data acc_shared(a,b,c)!$omp acc_update acc(a,b)!$omp acc_region do k = 1,n1 do i = 1,n3 c(i,k) = 0.0 do j = 1,n2

c(i,k) = c(i,k) + a(i,j) * b(j,k) enddo enddo enddo !$omp end acc_region!$omp acc_update host(c) !$omp end acc_data

The directive approach

Example with OMP-acc:

Generates kernel at loop level


Computing on Graphical Processing Units (GPUs)

• To be efficient the code needs to take advantage of fine grain parallelism so as to execute 1000s of threads in parallel.

• GPU code:

• Programming level: OpenCL, CUDA, CUDA Fortran (PGI) … Best performance, but require complete rewrite

• Directive based approach: OpenMP-acc, PGI, HMPP, F2-ACC Smaller modifications to original code The resulting code is still understandable by Fortran

programmers and can be easily modified Possible performance sacrifice with respect to CUDA code No standard for the moment (but work within the OMP

commitee)

• Data transfer time between host and GPU may strongly reduce the overall performance

Documents

Porting the physical parametrizations on GPUs using directives X. Lapillonne, O. Fuhrer, Cristiano Padrin, Piero Lanucara, Alessandro Cheloni Eidgenössisches