23
1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement des Innern EDI Bundesamt für Meteorologie und Klimatologie MeteoSchweiz

Porting the physical parametrizations on GPU using directives

  • Upload
    waldo

  • View
    24

  • Download
    0

Embed Size (px)

DESCRIPTION

Eidgenössisches Departement des Innern EDI Bundesamt für Meteorologie und Klimatologie MeteoSchweiz. Porting the physical parametrizations on GPU using directives. X. Lapillonne, O. Fuhrer. Outline. Physics with 2d data structure Porting the physical parametrization to GPU using directives - PowerPoint PPT Presentation

Citation preview

Page 1: Porting the physical parametrizations on GPU using directives

1 06/09/2011, COSMO GM Xavier Lapillonne

Porting the physical parametrizations on GPU using directives

X. Lapillonne, O. Fuhrer

Eidgenössisches Departement des Innern EDIBundesamt für Meteorologie und Klimatologie MeteoSchweiz

Page 2: Porting the physical parametrizations on GPU using directives

2 06/09/2011, COSMO GM Xavier Lapillonne

Outline

• Physics with 2d data structure

• Porting the physical parametrization to GPU using directives

• Running COSMO on an hybrid GPU-CPU system

Page 3: Porting the physical parametrizations on GPU using directives

3 06/09/2011, COSMO GM Xavier Lapillonne

New data structure

• 2D data fields inside the physics packages with one horizontal and one vertical dimensions: f(nproma,ke), with nproma = ie x je / nblock.

• Goals:• Physics package could be shared with ICON code• Blocking strategy: all physics parametrization could be computed

while data remains in the cache• organize_physics should be structured as follow:

call init_radiationcall init_turbulence …

do ib=1,nblockcall copy_to blockcall organize_radiation…call organize_turbulencecall copy_back

end do

• Note : an omp parallelization could be introduced around the block loop

where data inside organise_scheme is in block form t_b(nproma,ke)

Routines below organize_scheme will be shared with ICON. Fields are passed via argument list:

call fesft(t_b(:,:), …

Page 4: Porting the physical parametrizations on GPU using directives

4 06/09/2011, COSMO GM Xavier Lapillonne

Current status• Base code: COSMO 4.18

• 2d version of microphysics (hydci_pp), radiation (Ritter-Geleyn), turbulence (turbtran+turbdiff).

• For the moment microphysics and radiation are in separate block loop. The turbulence scheme is copying 3d fields (i.e turbdiff(t(:,je,:) …)

Next steps

• All 3 parametrizations (microphysics + radiation + turbulence) in a common block loop

• Performance analysis

• OMP parallelization (?)

Longer term

• All parametrization required for operational runs should be inside the block loop and in 2 dimensional form

Page 5: Porting the physical parametrizations on GPU using directives

5 06/09/2011, COSMO GM Xavier Lapillonne

Outline

• Physics with 2d data structure

• Porting the physical parametrization to GPU using directives

• Running COSMO on an hybrid GPU-CPU system

Page 6: Porting the physical parametrizations on GPU using directives

6 06/09/2011, COSMO GM Xavier Lapillonne

Computing on Graphical Processing Units (GPUs)

• Benefit from the highly parallel architecture of GPUs

• Higher peak performance at lower cost / power consumption.

• High memory bandwidth

CoresFreq.

(GHz)

Peak Perf.

S.P. (GFLOPs)

Peak Perf.

D.P. (GFLOPs)

Memory Bandwith (GB/sec)

Power

Cons. (W)

CPU: AMD

Magny-cours12 2.1 202 101 42.7 115

GPU: Fermi

M2050448 1.15 1030 515 144 225

Page 7: Porting the physical parametrizations on GPU using directives

7 06/09/2011, COSMO GM Xavier Lapillonne

Execution model

Host

(CPU)

Kernel

Sequential

Sequential

Device(GPU)

Data

Transfer

• Copy data from CPU to GPU(CPU and GPU memory are separate)

• Load specific GPU program (Kernel)

• Execution: Same kernel is executed by all threads, SIMD parallelism (Single instruction, multiple data)

• Copy back data from GPU to CPU

… …

… …

Parallel threads

Page 8: Porting the physical parametrizations on GPU using directives

8 06/09/2011, COSMO GM Xavier Lapillonne

The directive approach, an example

!$acc data region local(a,b)!$acc update device(b) !initialization !$acc region do k=1,nlev do i=1,N a(i,k)=0.0D0 end do end do !$acc end region

! first layer !$acc region do i=1,N a(i,1)=0.1D0 end do !$acc end region

! vertical computation !$acc region do k=2,nlev do i=1,N a(i,k)=0.95D0*a(i,k-1)+exp(-2*a(i,k)**2)*b(i,k) end do end do !$acc end region !$acc update host(a)!$acc end data region

!$acc data region local(a,b)!$acc update device(b) !initialization !$acc region do kernel do i=1,N do k=1,nlev a(i,k)=0.0D0 end do end do !$acc end region

! first layer !$acc region do i=1,N a(i,1)=0.1D0 end do !$acc end region

! vertical computation !$acc region do kernel do i=1,N do k=2,nlev a(i,k)=0.95D0*a(i,k-1)+exp(-2*a(i,k)**2)*b(i,k) end do end do !$acc end region !$acc update host(a)!$acc end data region

N=1000, nlev=60: t= 555 μs t= 225 μs

note : PGI directives

Loop reordering

3 different kernels

Array “a” remains on the GPU between the different kernel calls

Page 9: Porting the physical parametrizations on GPU using directives

9 06/09/2011, COSMO GM Xavier Lapillonne

Physical parametrizations on GPU using directives

• Physical parametrizations are tested using standalone code.

• Currently ported parametrizations:• PGI : microphysics (hydci_pp), radiation (fesft), turbulence (only

turbdiff yet)• OMP – acc (Cray) : microphysics, radiation • GPU optimizaiton: loop reordering, replacement of arrays with

scalars• Note: hydci_pp, fesft and turbdiff subroutines represents

respectively 6.7%, 8% and 7.3% of the total execution time of a typical cosmo-2 run.

• Current version of OMP-acc are a subset of PGI directives and it is possible to write PGI code so that there is almost a one to one translation to omp-acc.

• First investigation show similar performance between the two compilers, but would need further analysis

Page 10: Porting the physical parametrizations on GPU using directives

10 06/09/2011, COSMO GM Xavier Lapillonne

Results, Fermi card using PGI directives

Performance

0

5

10

15

20

25

30

microphysics radiation turbulence

GF

lop/

s

• Peak performance of a Fermi card for double precision is 515 GFlop/s, i.e. we are getting respectively 5%, 4.5% and 2.5% peak performance for the microphysics, radiation and turbulence schemes

• Theoretical bandwith is 140 GB/s, but maximum achievable is around 110 GB/s

• Test domain: nx x ny x nz = 80 x 60 x 60

Memory

0

20

40

60

80

100

120

microphysics radiation turbulence

Me

m.

Th

rou

gh

pu

t o

vera

ll (G

B/s

)

Page 11: Porting the physical parametrizations on GPU using directives

11 06/09/2011, COSMO GM Xavier Lapillonne

Results: Comparison with CPU

Speed up with respect to a 12 cores CPU (Palu)

0.000

1.000

2.000

3.000

4.000

5.000

6.000

7.000

Microphysics Radiation Turbulence

Sp

ee

d u

p execution time

execution + data transfer

• Parallel CPU code run on 12 cores AMD magny-cours CPU – however there are no mpi-communications in these standalone test codes.

• Note: Expected performance would be between 3x and 5x and depending whether the problem is compute or memory bandwith bound.

• Overhead of data transfer for microphysics and turbulence is very large.

Page 12: Porting the physical parametrizations on GPU using directives

12 06/09/2011, COSMO GM Xavier Lapillonne

Comments on the observed performance

• The microphysics has the largest compute intensity (with respect to memory access) and as such is more suited for the GPU.

• The lower speed up observed for the radiation is quite relative, and essentially comes from the fact that it is very well optimized and is vectorized on the CPU (~9% Peak performance)

• The turbulence scheme requires more memory access.

Next steps

• Port turbtran subroutine with pgi + additional test and optimizations (october 2011)

• Further investigation of radiation and turbulence schemes with Cray directives (november 2011)

• GPU version of microphysics + radiation + turbulence inside COSMO (november-december 2011)

Page 13: Porting the physical parametrizations on GPU using directives

13 06/09/2011, COSMO GM Xavier Lapillonne

Outline

• Physics with 2d data structure

• Porting the physical parametrization to GPU using directives

• Running COSMO on an hybrid GPU-CPU system

Page 14: Porting the physical parametrizations on GPU using directives

14 06/09/2011, COSMO GM Xavier Lapillonne

Possible future implementations in COSMO

Dynamic Microphysics Turbulence Radiation

Phys. parametrization

I/O

GPU

Dynamic Microphysics Turbulence Radiation

Phys. parametrization

I/O

GPU GPU GPU GPU

• Data movement for each routine

• “Full GPU” : Data remain on device, only send to CPU for I/O and communication

C++ - CUDA Directives

Page 15: Porting the physical parametrizations on GPU using directives

15 06/09/2011, COSMO GM Xavier Lapillonne

Running COSMO-2 on Hybrid-system

Multicores Processor

GPUs

• One (or more) multicores CPU

• Domain decomposition

• One GPU per subdomain.

Page 16: Porting the physical parametrizations on GPU using directives

16 06/09/2011, COSMO GM Xavier Lapillonne

Summary

• Porting of the microphysics, radiation and turbulence scheme on GPU was successfully carried out using a directive based approach

• Comparing with a 12 cores CPU, a speed up between 2.4x and 6.5x was observed using one Fermi GPU card

• These results are within the expected values considering hardware properties

• The large overhead of data transfer shows that the “full GPU” approach (i.e. data remains on the GPU, all computation on the device) is the prefered approach for COSMO

Page 17: Porting the physical parametrizations on GPU using directives

17 06/09/2011, COSMO GM Xavier Lapillonne

Additional slides

Page 18: Porting the physical parametrizations on GPU using directives

18 06/09/2011, COSMO GM Xavier Lapillonne

Comparison between PGI and OMP-acc

!$acc data region local(a)!time loopdo itime=1,nt !initialization !$acc region do k=1,nlev do i=1,N a(i,k)=0.0D0 end do end do !$acc end region

! first layer !$acc region do kernel do i=1,N a(i,1)=0.1D0 end do !$acc end region

! vertical computation !$acc region do kernel do i=1,N do k=2,nlev a(i,k)=0.95D0*a(i,k-1)+exp(-2*a(i,k)**2)*a(i,k) end do end do !$acc end region end do ! end time loop!$acc update host(a)!$acc end data region

!$omp acc_data acc_shared(a) !time loopdo itime=1,nt !initialization !$omp acc_region_loop do k=1,nlev do i=1,N a(i,k)=0.0D0 end do end do !$omp end acc_region loop

! first layer !$omp acc_region_loop do i=1,N a(i,1)=0.1D0 end do !$omp end acc_region_loop

! vertical computation !$omp acc_region_loop kernel do i=1,N do k=2,nlev a(i,k)=0.95D0*a(i,k-1)+exp(-2*a(i,k)**2)*a(i,k) end do end do !$omp end acc_region_loop end do ! end time loop!$omp acc_update host(a)!$omp end acc_data

Page 19: Porting the physical parametrizations on GPU using directives

19 06/09/2011, COSMO GM Xavier Lapillonne

MAIN_ / mo_gscp_dwd_hydci_pp_ _ (x10)------------------------------------------------------------------------User time (approx) 2.999 secs 7197500711 cyclesSystem to D1 refill 2.434M/sec 7300271 linesSystem to D1 bandwidth 148.576MB/sec 467217344 bytesD2 to D1 bandwidth 1025.770MB/sec 3225672832 bytesL2 to System BW per core 140.940MB/sec 443203504 bytes

HW FP Ops / User time 435.162M/sec 1308546592 ops 4.5%peak(DP)

MAIN_ / src_radiation_fesft_ (x1)------------------------------------------------------------------------ User time (approx) 7.226 secs 17342858074 cycles 100.0%Time System to D1 refill 11.380M/sec 82232710 lines System to D1 bandwidth 694.569MB/sec 5262893440 bytes D2 to D1 bandwidth 1162.252MB/sec 8806624128 bytes L2 to System BW per core 645.679MB/sec 4892446080 bytes HW FP Ops / User time 893.252M/sec 6511701846 ops 9.3%peak(DP)

Craypat infos

MAIN_ / turbulence_diff_ref_turbdiff_ (x10)------------------------------------------------------------------------ User time (approx) 4.397 secs 10551890928 cycles 100.0%Time System to D1 refill 15.757M/sec 69278266 lines System to D1 bandwidth 961.741MB/sec 4433809024 bytes D2 to D1 bandwidth 485.462MB/sec 2238073856 bytes L2 to System BW per core 982.474MB/sec 4529394160 bytes HW FP Ops / User time 326.405M/sec 1452061875 ops 3.4%peak(DP)

Page 20: Porting the physical parametrizations on GPU using directives

20 06/09/2011, COSMO GM Xavier Lapillonne

Palu Results

Page 21: Porting the physical parametrizations on GPU using directives

21 06/09/2011, COSMO GM Xavier Lapillonne

Results, microphysics, double precision, Palu

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

1 CPU (12 cores)

2 CPU (24 cores)

GPU-Fermi C

ray

GPU-Fermi P

GI

Sp

eed

up

(D

P)

speedup without datatransfer

speedup including datatransfer

Page 22: Porting the physical parametrizations on GPU using directives

22 06/09/2011, COSMO GM Xavier Lapillonne

Results, Radiation, double precision, Palu

0.00

0.50

1.00

1.50

2.00

2.50

3.00

1 CPU (12 co

res)

2 CPU (24 co

res)

GPU-Fermi C

ray

GPU-Fermi P

GI

Sp

ee

d u

p (

DP

)

speedup without datatransfer

speedup including datatransfer

Page 23: Porting the physical parametrizations on GPU using directives

23 06/09/2011, COSMO GM Xavier Lapillonne

Results, Turbulence, double precision, Palu

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

1 CPU (12 cores)

2 CPU (24 cores)

GPU-Fermi C

ray

GPU-Fermi P

GI

Sp

ee

d u

p (

DP

)

speedup without datatransfer

speedup includingdata transfer