Progress Porting WRF to GPU using OpenACC · WRF Porting Plan WRF developers interested in GPU port if performance is suitable. OpenACC extensions are acceptable because: This is

Progress Porting WRF to GPUusing OpenACC

Carl Ponder, Ph.D. Alexey RomanenkoHPC Applications Performance Alexey SnytnikovNVIDIA Developer Technology NVIDIA OpenSource Applications

WRF Porting Plan

WRF developers interested in GPU port if performance is suitable.

OpenACC extensions are acceptable because:● This is an open standard, unlike CUDA Fortran● Minimal changes, like OpenMP, unlike OpenCL

Changes to WRF modules will need to be negotiated with developers in order to provide support.

Tradeoff is extent of changes versus performance gain.

Status So Far

OpenACC, minimal rewrite

Updated changes to WRF 3.6.1 (latest) bugfix release

Physics Models:

Thompson: 2x

Morrison: 4.5x

Kessler: 2x

Speedups for 1 proc, 1 core versus 1 core + 1 GPU

Measured in isolation of the rest of the WRF

Still working on the Dynamics to get end-to-end

Need scaling to more cores sharing GPU

WRF Parallel Performance Issues

WRF has a flattish profile. By Amdahl's Law, speeding up any particular loop doesn't help much.

Loops tend NOT to re-process data, so moving arrays host ↔ GPU costs more than parallel processing would save.

Overall speedup requires speedup of many loops.

Also requires keeping data resident on GPU for the bulk of the computation.

Similar issue with HYCOM and other codes.

WRF Re-Coding Issues

Large code, almost 1 million lines

Slow compiles – 30 minutes on PGI, 8 hours on Cray

Fortran with modules

Subroutines with thousands of lines

Makes manual analysis difficult

Matching data update and present directives

Matching data & kernel/parallel directives

with compound statements

WRF Profile (VampirTrace)1 Core (PSG/IvyBridge) excl. time incl. time calls name 131.570s 374.428s 258215163 zolri2_ 119.127s 119.127s 454156530 psim_unstable_ 112.558s 112.558s 497228964 psih_unstable_ 93.230s 93.230s 920 module_advect_em_advect_scalar_ 89.757s 89.757s 40 module_mp_morr_two_moment_morr_two_moment_micro_ 85.464s 85.464s 400 module_advect_em_advect_scalar_pd_ 70.295s 160.052s 40 module_mp_morr_two_moment_mp_morr_two_moment_ 60.257s 60.257s 280 module_small_step_em_advance_w_ 24.522s 24.522s 501264 module_ra_rrtm_rtrn_ 23.987s 398.414s 4898677 zolri_ 22.205s 22.205s 120 module_big_step_utilities_em_ horizontal_pressure_gradient_gpu_ 20.347s 20.347s 98269398 psih_stable_ 17.660s 17.660s 280 module_small_step_em_advance_uv_ 17.544s 17.544s 92211474 psim_stable_ 17.133s 18.855s 14160 module_bl_ysu_ysu2d_ 16.803s 16.803s 1200 module_em_rk_update_scalar_ 13.758s 13.758s 5120 module_big_step_utilities_em_zero_tend_gpu_ 13.608s 438.738s 14160 module_sf_sfclayrev_sfclayrev1d_ 12.828s 12.828s 280 module_small_step_em_advance_mu_t_ 12.398s 12.398s 120 module_big_step_utilities_em_calc_p_rho_phi_gpu_ 11.297s 37.330s 40 module_pbl_driver_pbl_driver_ 10.807s 10.807s 520 module_big_step_utilities_em_horizontal_diffusion_ ..... 3.623s 3.623s 400 module_em_rk_update_scalar_pd_ 0.278s 1237.894s 40 solve_em_ 28.685ms 183.921s 1200 module_em_rk_scalar_tend_ 14.344ms 67.973s 120 module_em_rk_tendency_ 8.652ms 20.679s 40 module_first_rk_step_part2_first_rk_step_part2_

Total time ~3600s under VampirTrace

OpenACC Speedups & Slowdowns1 Core (PSG/IvyBridge) 1 Core + 1 K40 excl. time incl. time calls name excl. time incl. time 131.570s 374.428s 258215163 zolri2_ 154.832s 435.752s 119.127s 119.127s 454156530 psim_unstable_ 121.169s 121.169s 112.558s 112.558s 497228964 psih_unstable_ 140.449s 140.449s 93.230s 93.230s 920 module_advect_em_advect_scalar_ 78.497s 78.497s 89.757s 89.757s 40 module_mp_morr_two_moment_morr_two_moment_micro_ 28.348s 28.348s 85.464s 85.464s 400 module_advect_em_advect_scalar_pd_ 3.742s 3.742s 70.295s 160.052s 40 module_mp_morr_two_moment_mp_morr_two_moment_ 29.847s 58.195s 60.257s 60.257s 280 module_small_step_em_advance_w_ 7.896s 7.896s 24.522s 24.522s 501264 module_ra_rrtm_rtrn_ 25.075s 25.075s 23.987s 398.414s 4898677 zolri_ 27.724s 463.477s 22.205s 22.205s 120 module_big_step_utilities_em_ 5.213s 5.213s horizontal_pressure_gradient_gpu_ 20.347s 20.347s 98269398 psih_stable_ 26.609s 26.609s 17.660s 17.660s 280 module_small_step_em_advance_uv_ 2.003s 2.003s 17.544s 17.544s 92211474 psim_stable_ 23.846s 23.846s 17.133s 18.855s 14160 module_bl_ysu_ysu2d_ 17.037s 18.753s 16.803s 16.803s 1200 module_em_rk_update_scalar_ 2.218s 2.218s 13.758s 13.758s 5120 module_big_step_utilities_em_zero_tend_gpu_ 1.313s 1.313s 13.608s 438.738s 14160 module_sf_sfclayrev_sfclayrev1d_ 20.420s 515.050s 12.828s 12.828s 280 module_small_step_em_advance_mu_t_ 1.282s 1.282s 12.398s 12.398s 120 module_big_step_utilities_em_calc_p_rho_phi_gpu_ 0.230s 0.230s 11.297s 37.330s 40 module_pbl_driver_pbl_driver_ 11.361s 37.275s 10.807s 10.807s 520 module_big_step_utilities_em_horizontal_diffusion_ 1.399s 1.399s ..... 3.623s 3.623s 400 module_em_rk_update_scalar_pd_ 23.905s 23.905s 0.278s 1237.894s 40 solve_em_ 82.966s 1072.641s 28.685ms 183.921s 1200 module_em_rk_scalar_tend_ 11.346s 85.263s 14.344ms 67.973s 120 module_em_rk_tendency_ 11.437s 48.555s 8.652ms 20.679s 40 module_first_rk_step_part2_first_rk_step_part2_ 20.731s 28.992s


Cumulative Speedups1 Core (PSG/IvyBridge) 1 Core + 1 K40 excl. time incl. time calls name excl. time incl. time 131.570s 374.428s 258215163 zolri2_ 154.832s 435.752s 119.127s 119.127s 454156530 psim_unstable_ 121.169s 121.169s 112.558s 112.558s 497228964 psih_unstable_ 140.449s 140.449s 93.230s 93.230s 920 module_advect_em_advect_scalar_ 78.497s 78.497s 89.757s 89.757s 40 module_mp_morr_two_moment_morr_two_moment_micro_ 28.348s 28.348s 85.464s 85.464s 400 module_advect_em_advect_scalar_pd_ 3.742s 3.742s 70.295s 160.052s 40 module_mp_morr_two_moment_mp_morr_two_moment_ 29.847s 58.195s 60.257s 60.257s 280 module_small_step_em_advance_w_ 7.896s 7.896s 24.522s 24.522s 501264 module_ra_rrtm_rtrn_ 25.075s 25.075s 23.987s 398.414s 4898677 zolri_ 27.724s 463.477s 22.205s 22.205s 120 module_big_step_utilities_em_ 5.213s 5.213s horizontal_pressure_gradient_gpu_ 20.347s 20.347s 98269398 psih_stable_ 26.609s 26.609s 17.660s 17.660s 280 module_small_step_em_advance_uv_ 2.003s 2.003s 17.544s 17.544s 92211474 psim_stable_ 23.846s 23.846s 17.133s 18.855s 14160 module_bl_ysu_ysu2d_ 17.037s 18.753s 16.803s 16.803s 1200 module_em_rk_update_scalar_ 2.218s 2.218s 13.758s 13.758s 5120 module_big_step_utilities_em_zero_tend_gpu_ 1.313s 1.313s 13.608s 438.738s 14160 module_sf_sfclayrev_sfclayrev1d_ 20.420s 515.050s 12.828s 12.828s 280 module_small_step_em_advance_mu_t_ 1.282s 1.282s 12.398s 12.398s 120 module_big_step_utilities_em_calc_p_rho_phi_gpu_ 0.230s 0.230s 11.297s 37.330s 40 module_pbl_driver_pbl_driver_ 11.361s 37.275s 10.807s 10.807s 520 module_big_step_utilities_em_horizontal_diffusion_ 1.399s 1.399s ..... 3.623s 3.623s 400 module_em_rk_update_scalar_pd_ 23.905s 23.905s 0.278s 1237.894s 40 solve_em_ 82.966s 1072.641s 28.685ms 183.921s 1200 module_em_rk_scalar_tend_ 11.346s 85.263s 14.344ms 67.973s 120 module_em_rk_tendency_ 11.437s 48.555s 8.652ms 20.679s 40 module_first_rk_step_part2_first_rk_step_part2_ 20.731s 28.992s


WRF GPU Profile (nvprof)

Time (s) Name107.7675 cuStreamSynchronize 27.52339 rk_update_scalar_pd_1493_gpu 23.91459 morr_two_moment_micro_3412_gpu 21.09411 [CUDA memcpy DtoH] 20.81395 [CUDA memcpy HtoD] 19.31991 cuEventSynchronize 3.999928 advance_w_1553_gpu 3.106227 morr_two_moment_micro_1365_gpu 2.620422 mp_morr_two_moment_799_gpu 2.508038 spec_bdytend_gpu_2241_gpu 2.387098 advance_w_1628_gpu 1.445499 cuLaunchKernel

Total time ~1800 s under nvprof.

WRF GPU Profile (nvprof)

Time (s) Name107.7675 cuStreamSynchronize 27.52339 rk_update_scalar_pd_1493_gpu ---> <1 second 23.91459 morr_two_moment_micro_3412_gpu 21.09411 [CUDA memcpy DtoH] 20.81395 [CUDA memcpy HtoD] 19.31991 cuEventSynchronize 3.999928 advance_w_1553_gpu 3.106227 morr_two_moment_micro_1365_gpu 2.620422 mp_morr_two_moment_799_gpu 2.508038 spec_bdytend_gpu_2241_gpu 2.387098 advance_w_1628_gpu 1.445499 cuLaunchKernel

Total time ~1800 s under nvprof.

Loop Speedups

So far any loop we accelerate goes to near-zero time

Usually don't have to restructure the loops, just add OpenACC annotations

Will revisit this once we have a broader speedup

GPU-coordination versus CPU-Cache Striding This is the most efficient arrangement for CPU execution:

1058 DO j = j_start, j_end 1059 DO k=kts,ktf 1060 DO i = i_start, i_end 1061 mrdx=msfux(i,j)*rdx ! ADT eqn 44, 1st term on RHS 1062 tendency(i,k,j)=tendency(i,k,j)-mrdx*0.25 & 1063 *((ru(i+1,k,j)+ru(i,k,j))*(u(i+1,k,j)+u(i,k,j)) & 1064 -(ru(i,k,j)+ru(i-1,k,j))*(u(i,k,j)+u(i-1,k,j))) 1065 ENDDO 1066 ENDDO 1067 ENDDO

Fortran arrays are stored in column-major orderEach inner iteration scans consecutive entries in the same cache lineEach outer iteration processes a sequence of cache lines, with no repetition

For OpenMP parallelism, you will tend to split at the outer loop and let the inner loops proceed as before.

GPU-coordination versus CPU-Cache Striding

k-stride

jth Matrix Block from (i,k,j)

Adjacent threads in (i,j) warp process adjacent elements of kth column

i - a

djac

ency

(i,j)threads

GPU-coordination versus CPU-Cache Striding This is the most efficient arrangement for GPU execution:

!$acc parallel !$acc loop collapse(2) DO j = j_start, j_end DO i = i_start, i_end DO k=kts,ktf mrdx=msfux(i,j)*rdx ! ADT eqn 44, 1st term on RHS tendency(i,k,j)=tendency(i,k,j)-mrdx*0.25 & *((ru(i+1,k,j)+ru(i,k,j))*(u(i+1,k,j)+u(i,k,j)) & -(ru(i,k,j)+ru(i-1,k,j))*(u(i,k,j)+u(i-1,k,j))) ENDDO ENDDO ENDDO !$acc end parallel

Iterations of outer loops are mapped to GPU threadsAdjacent threads in Warp are used to process consecutive elements in same columnSequential inner loop processes consecutive columns with coordinated threads

Loop/Conditional Interchange 3072 j_loop_y_flux_6 : DO j = j_start, j_end+1 3073 3074 IF( (j >= j_start_f ) .and. (j <= j_end_f) ) THEN ! use full stencil 3075 3076 DO k=kts,ktf 3077 DO i = i_start, i_end 3078 vel = rv(i,k,j) 3079 fqy( i, k, jp1 ) = vel*flux6( & 3080 field(i,k,j-3), field(i,k,j-2), field(i,k,j-1), & 3081 field(i,k,j ), field(i,k,j+1), field(i,k,j+2), vel ) 3082 ENDDO 3083 ENDDO 3084 3085 3086 ELSE IF ( j == jds+1 ) THEN ! 2nd order flux next to south boundary …..

3167 ENDDO j_loop_y_flux_6

________________________________________________________________________________________________________

3091 j_loop_y_flux_6 : DO k=kts,ktf 3092 DO i = i_start, i_end 3093 fqy_1 = 0. 3094 fqy_2 = 0. 3095 DO j = j_start, j_end+1 ….

3102 IF( (j >= j_start_f ) .and. (j <= j_end_f) ) THEN ! use full stencil 3103 vel = rv(i,k,j) 3104 fqy_1 = vel*flux6( & 3105 field(i,k,j-3), field(i,k,j-2), field(i,k,j-1), & 3106 field(i,k,j ), field(i,k,j+1), field(i,k,j+2), vel ) 3107 ELSE IF ( j == jds+1 ) THEN ! 2nd order flux next to south boundary …. 3142 ENDDO ….. 3146 ENDDO 3147 ENDDO j_loop_y_flux_6

Horror Loop (dyn_em/module_advect_em.F) 7452 DO j=j_start, j_end 7453 DO k=kts, ktf 7454 #ifdef XEON_SIMD 7455 !DIR$ vector always 7456 #endif 7457 DO i=i_start, i_end 7458 7459 ph_low = (mub(i,j)+mu_old(i,j))*field_old(i,k,j) & 7460 - dt*( msftx(i,j)*msfty(i,j)*( & 7461 rdx*(fqxl(i+1,k,j)-fqxl(i,k,j)) + & 7462 rdy*(fqyl(i,k,j+1)-fqyl(i,k,j)) ) & 7463 +msfty(i,j)*rdzw(k)*(fqzl(i,k+1,j)-fqzl(i,k,j)) ) 7464 7465 flux_out = dt*( (msftx(i,j)*msfty(i,j))*( & 7466 rdx*( max(0.,fqx (i+1,k,j)) & 7467 -min(0.,fqx (i ,k,j)) ) & 7468 +rdy*( max(0.,fqy (i,k,j+1)) & 7469 -min(0.,fqy (i,k,j )) ) ) & 7470 +msfty(i,j)*rdzw(k)*( min(0.,fqz (i,k+1,j)) & 7471 -max(0.,fqz (i,k ,j)) ) ) 7472 7473 IF( flux_out .gt. ph_low ) THEN 7474 7475 scale = max(0.,ph_low/(flux_out+eps)) 7476 IF( fqx (i+1,k,j) .gt. 0.) fqx(i+1,k,j) = scale*fqx(i+1,k,j) 7477 IF( fqx (i ,k,j) .lt. 0.) fqx(i ,k,j) = scale*fqx(i ,k,j) 7478 IF( fqy (i,k,j+1) .gt. 0.) fqy(i,k,j+1) = scale*fqy(i,k,j+1) 7479 IF( fqy (i,k,j ) .lt. 0.) fqy(i,k,j ) = scale*fqy(i,k,j ) 7480 ! note: z flux is opposite sign in mass coordinate because 7481 ! vertical coordinate decreases with increasing k 7482 IF( fqz (i,k+1,j) .lt. 0.) fqz(i,k+1,j) = scale*fqz(i,k+1,j) 7483 IF( fqz (i,k ,j) .gt. 0.) fqz(i,k ,j) = scale*fqz(i,k ,j) 7484 7485 END IF 7487 ENDDO 7488 ENDDO 7489 ENDDO

Horror Loop (dyn_em/module_advect_em.F) 7450 !$acc kernels 7451 !$acc loop independent collapse(3) private(ph_low,flux_out) 7452 DO j=j_start, j_end 7453 DO k=kts, ktf 7454 #ifdef XEON_SIMD 7455 !DIR$ vector always 7456 #endif 7457 DO i=i_start, i_end 7458 7459 ph_low = (mub(i,j)+mu_old(i,j))*field_old(i,k,j) & 7460 - dt*( msftx(i,j)*msfty(i,j)*( & 7461 rdx*(fqxl(i+1,k,j)-fqxl(i,k,j)) + & 7462 rdy*(fqyl(i,k,j+1)-fqyl(i,k,j)) ) & 7463 +msfty(i,j)*rdzw(k)*(fqzl(i,k+1,j)-fqzl(i,k,j)) ) 7464 7465 flux_out = dt*( (msftx(i,j)*msfty(i,j))*( & 7466 rdx*( max(0.,fqx (i+1,k,j)) & 7467 -min(0.,fqx (i ,k,j)) ) & 7468 +rdy*( max(0.,fqy (i,k,j+1)) & 7469 -min(0.,fqy (i,k,j )) ) ) & 7470 +msfty(i,j)*rdzw(k)*( min(0.,fqz (i,k+1,j)) & 7471 -max(0.,fqz (i,k ,j)) ) ) 7472 7473 IF( flux_out .gt. ph_low ) THEN 7474 7475 scale = max(0.,ph_low/(flux_out+eps)) 7476 IF( fqx (i+1,k,j) .gt. 0.) fqx(i+1,k,j) = scale*fqx(i+1,k,j) 7477 IF( fqx (i ,k,j) .lt. 0.) fqx(i ,k,j) = scale*fqx(i ,k,j) 7478 IF( fqy (i,k,j+1) .gt. 0.) fqy(i,k,j+1) = scale*fqy(i,k,j+1) 7479 IF( fqy (i,k,j ) .lt. 0.) fqy(i,k,j ) = scale*fqy(i,k,j ) 7480 ! note: z flux is opposite sign in mass coordinate because 7481 ! vertical coordinate decreases with increasing k 7482 IF( fqz (i,k+1,j) .lt. 0.) fqz(i,k+1,j) = scale*fqz(i,k+1,j) 7483 IF( fqz (i,k ,j) .gt. 0.) fqz(i,k ,j) = scale*fqz(i,k ,j) 7484 7485 END IF 7487 ENDDO 7488 ENDDO 7489 ENDDO 7490 !$acc end kernels

Mirroring the Record-Structure

TYPE( domain ), POINTER :: grid

real,DIMENSION(grid%sm31:grid%em31,grid%sm33:grid%em33) :: msfux

ALLOCATE(grid%msfux(sm31:em31,sm33:em33),STAT=ierr)

real ,DIMENSION(:,:) ,POINTER :: msfux_pt

CALL alloc_gpu_extra2d(grid%msfux,msfux_pt)

___________________________________________________________________

SUBROUTINE alloc_gpu_extra1d(grid_pt,pt)

real ,DIMENSION(:) ,POINTER :: pt

real ,DIMENSION(:) ,POINTER :: grid_pt

!$acc enter data create(grid_pt)

pt => grid_pt

END SUBROUTINE alloc_gpu_extra1d

OpenACC – Original Intent

Well-bracketed regions

Position data on GPU, process it, save result & discard the rest:

!$acc data create ( …. ) &

!$acc copy (…..)

….

call subroutine

…

!$acc end data

Finding instead that we're swapping arrays between Host ↔ GPU memory at random points

Partial Port – GPU Data Placement

Need to reduce amount of data movement between Host ↔ GPU

Requires porting more code, so it can operate on GPU-resident data

Explicit swaps have to be used when moving from GPU-code to host-code.

!$acc update device(....)!$acc update host(...)

GPU Data Placement (WRF)

wrf_run

-> integrate

-> solve_interface

-> solve_em (DATA PLACEMENT HERE)

-> microphysics_driver -> mp_gt_driver

-> first_rk_step_part1

-> first_rk_step_part2

-> rk_scalar_tend

-> advect_scalar_pd

Mixing OpenACC & CPU Code

In some cases it may be clearer to interleave the two in the same file:

#ifdef _OPENACC

! GPU-optimized loops

#else

! CPU-optimized loops

#endif

In other cases it may be clearer to duplicate the file and use the build-process to select them:

module_advect_em.F

module_advect_em.OpenACC.F

You can view the changes side-by-side with sdiff.

Documents

Progress Porting WRF to GPU using OpenACC · WRF Porting Plan WRF developers interested in GPU port if performance is suitable. OpenACC extensions are acceptable because: This is