Upload
others
View
13
Download
0
Embed Size (px)
Citation preview
Progress Porting WRF to GPUusing OpenACC
Carl Ponder, Ph.D. Alexey RomanenkoHPC Applications Performance Alexey SnytnikovNVIDIA Developer Technology NVIDIA OpenSource Applications
WRF Porting Plan
WRF developers interested in GPU port if performance is suitable.
OpenACC extensions are acceptable because:● This is an open standard, unlike CUDA Fortran● Minimal changes, like OpenMP, unlike OpenCL
Changes to WRF modules will need to be negotiated with developers in order to provide support.
Tradeoff is extent of changes versus performance gain.
Status So Far
OpenACC, minimal rewrite
Updated changes to WRF 3.6.1 (latest) bugfix release
Physics Models:
Thompson: 2x
Morrison: 4.5x
Kessler: 2x
Speedups for 1 proc, 1 core versus 1 core + 1 GPU
Measured in isolation of the rest of the WRF
Still working on the Dynamics to get end-to-end
Need scaling to more cores sharing GPU
WRF Parallel Performance Issues
WRF has a flattish profile. By Amdahl's Law, speeding up any particular loop doesn't help much.
Loops tend NOT to re-process data, so moving arrays host ↔ GPU costs more than parallel processing would save.
Overall speedup requires speedup of many loops.
Also requires keeping data resident on GPU for the bulk of the computation.
Similar issue with HYCOM and other codes.
WRF Re-Coding Issues
Large code, almost 1 million lines
Slow compiles – 30 minutes on PGI, 8 hours on Cray
Fortran with modules
Subroutines with thousands of lines
Makes manual analysis difficult
Matching data update and present directives
Matching data & kernel/parallel directives
with compound statements
WRF Profile (VampirTrace)1 Core (PSG/IvyBridge) excl. time incl. time calls name 131.570s 374.428s 258215163 zolri2_ 119.127s 119.127s 454156530 psim_unstable_ 112.558s 112.558s 497228964 psih_unstable_ 93.230s 93.230s 920 module_advect_em_advect_scalar_ 89.757s 89.757s 40 module_mp_morr_two_moment_morr_two_moment_micro_ 85.464s 85.464s 400 module_advect_em_advect_scalar_pd_ 70.295s 160.052s 40 module_mp_morr_two_moment_mp_morr_two_moment_ 60.257s 60.257s 280 module_small_step_em_advance_w_ 24.522s 24.522s 501264 module_ra_rrtm_rtrn_ 23.987s 398.414s 4898677 zolri_ 22.205s 22.205s 120 module_big_step_utilities_em_ horizontal_pressure_gradient_gpu_ 20.347s 20.347s 98269398 psih_stable_ 17.660s 17.660s 280 module_small_step_em_advance_uv_ 17.544s 17.544s 92211474 psim_stable_ 17.133s 18.855s 14160 module_bl_ysu_ysu2d_ 16.803s 16.803s 1200 module_em_rk_update_scalar_ 13.758s 13.758s 5120 module_big_step_utilities_em_zero_tend_gpu_ 13.608s 438.738s 14160 module_sf_sfclayrev_sfclayrev1d_ 12.828s 12.828s 280 module_small_step_em_advance_mu_t_ 12.398s 12.398s 120 module_big_step_utilities_em_calc_p_rho_phi_gpu_ 11.297s 37.330s 40 module_pbl_driver_pbl_driver_ 10.807s 10.807s 520 module_big_step_utilities_em_horizontal_diffusion_ ..... 3.623s 3.623s 400 module_em_rk_update_scalar_pd_ 0.278s 1237.894s 40 solve_em_ 28.685ms 183.921s 1200 module_em_rk_scalar_tend_ 14.344ms 67.973s 120 module_em_rk_tendency_ 8.652ms 20.679s 40 module_first_rk_step_part2_first_rk_step_part2_
Total time ~3600s under VampirTrace
OpenACC Speedups & Slowdowns1 Core (PSG/IvyBridge) 1 Core + 1 K40 excl. time incl. time calls name excl. time incl. time 131.570s 374.428s 258215163 zolri2_ 154.832s 435.752s 119.127s 119.127s 454156530 psim_unstable_ 121.169s 121.169s 112.558s 112.558s 497228964 psih_unstable_ 140.449s 140.449s 93.230s 93.230s 920 module_advect_em_advect_scalar_ 78.497s 78.497s 89.757s 89.757s 40 module_mp_morr_two_moment_morr_two_moment_micro_ 28.348s 28.348s 85.464s 85.464s 400 module_advect_em_advect_scalar_pd_ 3.742s 3.742s 70.295s 160.052s 40 module_mp_morr_two_moment_mp_morr_two_moment_ 29.847s 58.195s 60.257s 60.257s 280 module_small_step_em_advance_w_ 7.896s 7.896s 24.522s 24.522s 501264 module_ra_rrtm_rtrn_ 25.075s 25.075s 23.987s 398.414s 4898677 zolri_ 27.724s 463.477s 22.205s 22.205s 120 module_big_step_utilities_em_ 5.213s 5.213s horizontal_pressure_gradient_gpu_ 20.347s 20.347s 98269398 psih_stable_ 26.609s 26.609s 17.660s 17.660s 280 module_small_step_em_advance_uv_ 2.003s 2.003s 17.544s 17.544s 92211474 psim_stable_ 23.846s 23.846s 17.133s 18.855s 14160 module_bl_ysu_ysu2d_ 17.037s 18.753s 16.803s 16.803s 1200 module_em_rk_update_scalar_ 2.218s 2.218s 13.758s 13.758s 5120 module_big_step_utilities_em_zero_tend_gpu_ 1.313s 1.313s 13.608s 438.738s 14160 module_sf_sfclayrev_sfclayrev1d_ 20.420s 515.050s 12.828s 12.828s 280 module_small_step_em_advance_mu_t_ 1.282s 1.282s 12.398s 12.398s 120 module_big_step_utilities_em_calc_p_rho_phi_gpu_ 0.230s 0.230s 11.297s 37.330s 40 module_pbl_driver_pbl_driver_ 11.361s 37.275s 10.807s 10.807s 520 module_big_step_utilities_em_horizontal_diffusion_ 1.399s 1.399s ..... 3.623s 3.623s 400 module_em_rk_update_scalar_pd_ 23.905s 23.905s 0.278s 1237.894s 40 solve_em_ 82.966s 1072.641s 28.685ms 183.921s 1200 module_em_rk_scalar_tend_ 11.346s 85.263s 14.344ms 67.973s 120 module_em_rk_tendency_ 11.437s 48.555s 8.652ms 20.679s 40 module_first_rk_step_part2_first_rk_step_part2_ 20.731s 28.992s
Total time ~3600s under VampirTrace
Cumulative Speedups1 Core (PSG/IvyBridge) 1 Core + 1 K40 excl. time incl. time calls name excl. time incl. time 131.570s 374.428s 258215163 zolri2_ 154.832s 435.752s 119.127s 119.127s 454156530 psim_unstable_ 121.169s 121.169s 112.558s 112.558s 497228964 psih_unstable_ 140.449s 140.449s 93.230s 93.230s 920 module_advect_em_advect_scalar_ 78.497s 78.497s 89.757s 89.757s 40 module_mp_morr_two_moment_morr_two_moment_micro_ 28.348s 28.348s 85.464s 85.464s 400 module_advect_em_advect_scalar_pd_ 3.742s 3.742s 70.295s 160.052s 40 module_mp_morr_two_moment_mp_morr_two_moment_ 29.847s 58.195s 60.257s 60.257s 280 module_small_step_em_advance_w_ 7.896s 7.896s 24.522s 24.522s 501264 module_ra_rrtm_rtrn_ 25.075s 25.075s 23.987s 398.414s 4898677 zolri_ 27.724s 463.477s 22.205s 22.205s 120 module_big_step_utilities_em_ 5.213s 5.213s horizontal_pressure_gradient_gpu_ 20.347s 20.347s 98269398 psih_stable_ 26.609s 26.609s 17.660s 17.660s 280 module_small_step_em_advance_uv_ 2.003s 2.003s 17.544s 17.544s 92211474 psim_stable_ 23.846s 23.846s 17.133s 18.855s 14160 module_bl_ysu_ysu2d_ 17.037s 18.753s 16.803s 16.803s 1200 module_em_rk_update_scalar_ 2.218s 2.218s 13.758s 13.758s 5120 module_big_step_utilities_em_zero_tend_gpu_ 1.313s 1.313s 13.608s 438.738s 14160 module_sf_sfclayrev_sfclayrev1d_ 20.420s 515.050s 12.828s 12.828s 280 module_small_step_em_advance_mu_t_ 1.282s 1.282s 12.398s 12.398s 120 module_big_step_utilities_em_calc_p_rho_phi_gpu_ 0.230s 0.230s 11.297s 37.330s 40 module_pbl_driver_pbl_driver_ 11.361s 37.275s 10.807s 10.807s 520 module_big_step_utilities_em_horizontal_diffusion_ 1.399s 1.399s ..... 3.623s 3.623s 400 module_em_rk_update_scalar_pd_ 23.905s 23.905s 0.278s 1237.894s 40 solve_em_ 82.966s 1072.641s 28.685ms 183.921s 1200 module_em_rk_scalar_tend_ 11.346s 85.263s 14.344ms 67.973s 120 module_em_rk_tendency_ 11.437s 48.555s 8.652ms 20.679s 40 module_first_rk_step_part2_first_rk_step_part2_ 20.731s 28.992s
Total time ~3600s under VampirTrace
WRF GPU Profile (nvprof)
Time (s) Name107.7675 cuStreamSynchronize 27.52339 rk_update_scalar_pd_1493_gpu 23.91459 morr_two_moment_micro_3412_gpu 21.09411 [CUDA memcpy DtoH] 20.81395 [CUDA memcpy HtoD] 19.31991 cuEventSynchronize 3.999928 advance_w_1553_gpu 3.106227 morr_two_moment_micro_1365_gpu 2.620422 mp_morr_two_moment_799_gpu 2.508038 spec_bdytend_gpu_2241_gpu 2.387098 advance_w_1628_gpu 1.445499 cuLaunchKernel
Total time ~1800 s under nvprof.
WRF GPU Profile (nvprof)
Time (s) Name107.7675 cuStreamSynchronize 27.52339 rk_update_scalar_pd_1493_gpu ---> <1 second 23.91459 morr_two_moment_micro_3412_gpu 21.09411 [CUDA memcpy DtoH] 20.81395 [CUDA memcpy HtoD] 19.31991 cuEventSynchronize 3.999928 advance_w_1553_gpu 3.106227 morr_two_moment_micro_1365_gpu 2.620422 mp_morr_two_moment_799_gpu 2.508038 spec_bdytend_gpu_2241_gpu 2.387098 advance_w_1628_gpu 1.445499 cuLaunchKernel
Total time ~1800 s under nvprof.
Loop Speedups
So far any loop we accelerate goes to near-zero time
Usually don't have to restructure the loops, just add OpenACC annotations
Will revisit this once we have a broader speedup
GPU-coordination versus CPU-Cache Striding This is the most efficient arrangement for CPU execution:
1058 DO j = j_start, j_end 1059 DO k=kts,ktf 1060 DO i = i_start, i_end 1061 mrdx=msfux(i,j)*rdx ! ADT eqn 44, 1st term on RHS 1062 tendency(i,k,j)=tendency(i,k,j)-mrdx*0.25 & 1063 *((ru(i+1,k,j)+ru(i,k,j))*(u(i+1,k,j)+u(i,k,j)) & 1064 -(ru(i,k,j)+ru(i-1,k,j))*(u(i,k,j)+u(i-1,k,j))) 1065 ENDDO 1066 ENDDO 1067 ENDDO
Fortran arrays are stored in column-major orderEach inner iteration scans consecutive entries in the same cache lineEach outer iteration processes a sequence of cache lines, with no repetition
For OpenMP parallelism, you will tend to split at the outer loop and let the inner loops proceed as before.
GPU-coordination versus CPU-Cache Striding
k-stride
jth Matrix Block from (i,k,j)
Adjacent threads in (i,j) warp process adjacent elements of kth column
i - a
djac
ency
(i,j)threads
GPU-coordination versus CPU-Cache Striding This is the most efficient arrangement for GPU execution:
!$acc parallel !$acc loop collapse(2) DO j = j_start, j_end DO i = i_start, i_end DO k=kts,ktf mrdx=msfux(i,j)*rdx ! ADT eqn 44, 1st term on RHS tendency(i,k,j)=tendency(i,k,j)-mrdx*0.25 & *((ru(i+1,k,j)+ru(i,k,j))*(u(i+1,k,j)+u(i,k,j)) & -(ru(i,k,j)+ru(i-1,k,j))*(u(i,k,j)+u(i-1,k,j))) ENDDO ENDDO ENDDO !$acc end parallel
Iterations of outer loops are mapped to GPU threadsAdjacent threads in Warp are used to process consecutive elements in same columnSequential inner loop processes consecutive columns with coordinated threads
Loop/Conditional Interchange 3072 j_loop_y_flux_6 : DO j = j_start, j_end+1 3073 3074 IF( (j >= j_start_f ) .and. (j <= j_end_f) ) THEN ! use full stencil 3075 3076 DO k=kts,ktf 3077 DO i = i_start, i_end 3078 vel = rv(i,k,j) 3079 fqy( i, k, jp1 ) = vel*flux6( & 3080 field(i,k,j-3), field(i,k,j-2), field(i,k,j-1), & 3081 field(i,k,j ), field(i,k,j+1), field(i,k,j+2), vel ) 3082 ENDDO 3083 ENDDO 3084 3085 3086 ELSE IF ( j == jds+1 ) THEN ! 2nd order flux next to south boundary …..
3167 ENDDO j_loop_y_flux_6
________________________________________________________________________________________________________
3091 j_loop_y_flux_6 : DO k=kts,ktf 3092 DO i = i_start, i_end 3093 fqy_1 = 0. 3094 fqy_2 = 0. 3095 DO j = j_start, j_end+1 ….
3102 IF( (j >= j_start_f ) .and. (j <= j_end_f) ) THEN ! use full stencil 3103 vel = rv(i,k,j) 3104 fqy_1 = vel*flux6( & 3105 field(i,k,j-3), field(i,k,j-2), field(i,k,j-1), & 3106 field(i,k,j ), field(i,k,j+1), field(i,k,j+2), vel ) 3107 ELSE IF ( j == jds+1 ) THEN ! 2nd order flux next to south boundary …. 3142 ENDDO ….. 3146 ENDDO 3147 ENDDO j_loop_y_flux_6
Horror Loop (dyn_em/module_advect_em.F) 7452 DO j=j_start, j_end 7453 DO k=kts, ktf 7454 #ifdef XEON_SIMD 7455 !DIR$ vector always 7456 #endif 7457 DO i=i_start, i_end 7458 7459 ph_low = (mub(i,j)+mu_old(i,j))*field_old(i,k,j) & 7460 - dt*( msftx(i,j)*msfty(i,j)*( & 7461 rdx*(fqxl(i+1,k,j)-fqxl(i,k,j)) + & 7462 rdy*(fqyl(i,k,j+1)-fqyl(i,k,j)) ) & 7463 +msfty(i,j)*rdzw(k)*(fqzl(i,k+1,j)-fqzl(i,k,j)) ) 7464 7465 flux_out = dt*( (msftx(i,j)*msfty(i,j))*( & 7466 rdx*( max(0.,fqx (i+1,k,j)) & 7467 -min(0.,fqx (i ,k,j)) ) & 7468 +rdy*( max(0.,fqy (i,k,j+1)) & 7469 -min(0.,fqy (i,k,j )) ) ) & 7470 +msfty(i,j)*rdzw(k)*( min(0.,fqz (i,k+1,j)) & 7471 -max(0.,fqz (i,k ,j)) ) ) 7472 7473 IF( flux_out .gt. ph_low ) THEN 7474 7475 scale = max(0.,ph_low/(flux_out+eps)) 7476 IF( fqx (i+1,k,j) .gt. 0.) fqx(i+1,k,j) = scale*fqx(i+1,k,j) 7477 IF( fqx (i ,k,j) .lt. 0.) fqx(i ,k,j) = scale*fqx(i ,k,j) 7478 IF( fqy (i,k,j+1) .gt. 0.) fqy(i,k,j+1) = scale*fqy(i,k,j+1) 7479 IF( fqy (i,k,j ) .lt. 0.) fqy(i,k,j ) = scale*fqy(i,k,j ) 7480 ! note: z flux is opposite sign in mass coordinate because 7481 ! vertical coordinate decreases with increasing k 7482 IF( fqz (i,k+1,j) .lt. 0.) fqz(i,k+1,j) = scale*fqz(i,k+1,j) 7483 IF( fqz (i,k ,j) .gt. 0.) fqz(i,k ,j) = scale*fqz(i,k ,j) 7484 7485 END IF 7487 ENDDO 7488 ENDDO 7489 ENDDO
Horror Loop (dyn_em/module_advect_em.F) 7450 !$acc kernels 7451 !$acc loop independent collapse(3) private(ph_low,flux_out) 7452 DO j=j_start, j_end 7453 DO k=kts, ktf 7454 #ifdef XEON_SIMD 7455 !DIR$ vector always 7456 #endif 7457 DO i=i_start, i_end 7458 7459 ph_low = (mub(i,j)+mu_old(i,j))*field_old(i,k,j) & 7460 - dt*( msftx(i,j)*msfty(i,j)*( & 7461 rdx*(fqxl(i+1,k,j)-fqxl(i,k,j)) + & 7462 rdy*(fqyl(i,k,j+1)-fqyl(i,k,j)) ) & 7463 +msfty(i,j)*rdzw(k)*(fqzl(i,k+1,j)-fqzl(i,k,j)) ) 7464 7465 flux_out = dt*( (msftx(i,j)*msfty(i,j))*( & 7466 rdx*( max(0.,fqx (i+1,k,j)) & 7467 -min(0.,fqx (i ,k,j)) ) & 7468 +rdy*( max(0.,fqy (i,k,j+1)) & 7469 -min(0.,fqy (i,k,j )) ) ) & 7470 +msfty(i,j)*rdzw(k)*( min(0.,fqz (i,k+1,j)) & 7471 -max(0.,fqz (i,k ,j)) ) ) 7472 7473 IF( flux_out .gt. ph_low ) THEN 7474 7475 scale = max(0.,ph_low/(flux_out+eps)) 7476 IF( fqx (i+1,k,j) .gt. 0.) fqx(i+1,k,j) = scale*fqx(i+1,k,j) 7477 IF( fqx (i ,k,j) .lt. 0.) fqx(i ,k,j) = scale*fqx(i ,k,j) 7478 IF( fqy (i,k,j+1) .gt. 0.) fqy(i,k,j+1) = scale*fqy(i,k,j+1) 7479 IF( fqy (i,k,j ) .lt. 0.) fqy(i,k,j ) = scale*fqy(i,k,j ) 7480 ! note: z flux is opposite sign in mass coordinate because 7481 ! vertical coordinate decreases with increasing k 7482 IF( fqz (i,k+1,j) .lt. 0.) fqz(i,k+1,j) = scale*fqz(i,k+1,j) 7483 IF( fqz (i,k ,j) .gt. 0.) fqz(i,k ,j) = scale*fqz(i,k ,j) 7484 7485 END IF 7487 ENDDO 7488 ENDDO 7489 ENDDO 7490 !$acc end kernels
Mirroring the Record-Structure
TYPE( domain ), POINTER :: grid
real,DIMENSION(grid%sm31:grid%em31,grid%sm33:grid%em33) :: msfux
ALLOCATE(grid%msfux(sm31:em31,sm33:em33),STAT=ierr)
real ,DIMENSION(:,:) ,POINTER :: msfux_pt
CALL alloc_gpu_extra2d(grid%msfux,msfux_pt)
___________________________________________________________________
SUBROUTINE alloc_gpu_extra1d(grid_pt,pt)
real ,DIMENSION(:) ,POINTER :: pt
real ,DIMENSION(:) ,POINTER :: grid_pt
!$acc enter data create(grid_pt)
pt => grid_pt
END SUBROUTINE alloc_gpu_extra1d
OpenACC – Original Intent
Well-bracketed regions
Position data on GPU, process it, save result & discard the rest:
!$acc data create ( …. ) &
!$acc copy (…..)
….
call subroutine
…
!$acc end data
Finding instead that we're swapping arrays between Host ↔ GPU memory at random points
Partial Port – GPU Data Placement
Need to reduce amount of data movement between Host ↔ GPU
Requires porting more code, so it can operate on GPU-resident data
Explicit swaps have to be used when moving from GPU-code to host-code.
!$acc update device(....)!$acc update host(...)
GPU Data Placement (WRF)
wrf_run
-> integrate
-> solve_interface
-> solve_em (DATA PLACEMENT HERE)
-> microphysics_driver -> mp_gt_driver
-> first_rk_step_part1
-> first_rk_step_part2
-> rk_scalar_tend
-> advect_scalar_pd
Mixing OpenACC & CPU Code
In some cases it may be clearer to interleave the two in the same file:
#ifdef _OPENACC
! GPU-optimized loops
#else
! CPU-optimized loops
#endif
In other cases it may be clearer to duplicate the file and use the build-process to select them:
module_advect_em.F
module_advect_em.OpenACC.F
You can view the changes side-by-side with sdiff.