Parallel Execution of WRF on GPU Performance Comparison

ITSC-XIX_2014_Poster_for_GPU_WRF.pptSpace Science and Engineering Center, University of Wisconsin–Madison
Introduction
WRF System Components Code Validation
19th International TOVS Study Conference (ITSC), Jeju Island, South Korea, 26 March-1 April 2014
Conclusions
Jimy Dudhia: WRF physics options
-Fused multiply-addition was turned off (--fmad=false) -GNU C math library was used on GPU, i.e. powf(), expf(), sqrt () and logf() were replaced by routines from GNU C library -> bit-exact output compared to gfortran compiler on CPU -Small output differences for –fast-math (shown below)
Potential temperatures [K] Difference between CPU and GPU outputs Performance Profile of WRF
Jan. 2000 30km workload profiling by John Michalakes, “Code restructuring to improve performance in WRF model physics on Intel Xeon Phi”, Workshop on Programming weather, climate, and earth-system models on heterogeneous multi-core platforms, September 20, 2013
Performance Comparison
•Great interest in the community in accelerators •Continuing work on accelerating other WRF modules using CUDA C (~20 modules finished) •Lessons learned during CUDA C implementation of WSM5 will be applied to OpenACC 2.0/OpenMP 4.0 optimization of WRF modules for GPU/Intel MIC
The WRF physics components are microphysics, cumulus parametrization, planetary boundary layer (PBL), land-surface model and shortwave/longwave radiation.‏
•WRF is mesoscale and global Weather Research and Forecasting model •Designed for both operational forecasters and atmospheric researchers •WRF is currently in operational use at numerous weather centers around the world •WRF is suitable for a broad spectrum of applications across domain scales ranging from meters to hundreds of kilometers • Increases in computational power enables - Increased vertical as well as horizontal resolution - More timely delivery of forecasts - Probabilistic forecasts based on ensemble methods •Why accelerators? -Cost performance -Need for strong scaling
GPU Speedups
WRF Module name Speedup Single moment 6-class microphysics 500x Eta microphysics 272x Purdue Lin microphysics 692x Stony-Brook University 5-class microphysics 896x BeHs-Miller-Janjic convecKon 105x Kessler microphysics 816x New Goddard shortwave radiance 134x Single moment 3-class microphysics 331x New Thompson microphysics 153x Double moment 6-class microphysics 206x Dudhia shortwave radiance 409x Goddard microphysics 1311x Double moment 5-class microphysics 206x Total Energy Mass Flux surface layer 214x Mellor-Yamada Nakanishi Niino surface layer 113x Single moment 5-class microphysics 350x Pleim-Xiu surface layer 665x
Parallel Execution of WRF on GPU
CPUs focus on per-core performance - Sandy Bridge: 4 cores, 115.2 Gflops, 90 WaHs (~1.4 GFlop /WaQ), Memory Bandwidth: 34.1 GB/s
GPUs focus on parallel execuKon - Tesla K40: 2,880 cores, 4290 Gflops (Peak SP), 1430 Gflops (Peak DP), 235 WaHs (~18.3 Gflops / WaQ), Memory bandwidth: 288 GB/s
•GPUs have more cores than CPUs •GPUs are optimized for data parallel work •GPU needs 1000s of threads for full efficiency •GPUs have a higher floating-point capability
than CPUs
Single threaded non-vectorized CPU code is compiled with gfortran 4.4.6
WRF model integration procedure

Parallel Execution of WRF on GPU Performance Comparison

Documents

Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University…

Accelerating HPL on Heterogeneous GPU Clustersmvapich.cse.ohio-state.edu/static/media/publications/...Current Execution of HPL on Heterogeneous GPU Clusters OSU-GTC-2014 Outline •Introduction

ITAP: Idle-Time-Aware Power Management for GPU Execution Units · 2020. 8. 23. · 1 ITAP: Idle-Time-Aware Power Management for GPU Execution Units MOHAMMAD SADROSADATI, Sharif University

Accelerate GPU Concurrent Kernel Execution by … solve the problem of GPU resource underutilization, ... reservation failure occurs and the memory pipeline is stalled. ... Table 1

Graphics Processing Units (GPUs)csg.csail.mit.edu/6.823/Lectures/L21.pdf · GPU Kernel Execution Transfer input data from CPU to GPU memory Launch kernel (grid) Wait for kernel to

2. fejezet - Wigner GPU lab.gpu.wigner.mta.hu/en/laboratory/teaching/science...GPU Lab Berényi Dániel –Nagy-Egri Máté Ferenc •Out-of-order execution: Az éppen elérhető adatok

A Versatile Software Systolic Execution Model for GPU Memory … · 2019-09-09 · A Versatile Software Systolic Execution Model for GPU Memory-Bound Kernels Peng Chen Tokyo Institute

Improved GPU/CUDA based parallel weather and research forecast (wrf)single moment 5 class(wsm5) cloud microphysics

Porting and Tuning WRF Physics Packages on Intel Xeon and Xeon Phi and NVIDIA GPU Tom Henderson Thomas.B.Henderson@noaa.gov Mark Govett, James Rosinski,

Optimal Physics Parameterization Scheme Combination of the … · 2020. 1. 14. · 6 AdvancesinMeteorology ERA-Interim WRF-EM WRF-Conf7 WRF-Conf6 WRF-Conf5 WRF-Conf4 WRF-Conf3 WRF-Conf2

Compiling and Optimizing Java 8 Programs for GPU Execution

GPU Parallel Execution Model / Architecture

Simultaneous Multikernel GPU: Multi-tasking Throughput ...shared workloads with complementary resource occupancy, SMK improves GPU throughput by 52% over non-shared execution and 17%

EDGE: Event-Driven GPU Executionaamodt/publications/papers/edge.pact2019.pdf · EDGE: Event-Driven GPU Execution Tayler Hicklin Hetherington, Maria Lubeznov, Deval Shah, Tor M. Aamodt

HW# 2 /Tutorial # 2 WRF Chapter 16; WWWR Chapter 17 ID Chapter 3 Tutorial #2 WRF#16.2;WWWR#17.13, WRF#16.1; WRF#16.12; WRF#17.39; WRF#16.22. To be discussed

GPU Computing Overview - SEA Computing.pdfA GPU is a Throughput Optimized Processor GPU Achieves high throughput by parallel execution 2,688 cores (GK110) Millions of resident threads

A Framework for Accelerating Bottlenecks in GPU Execution with … · 2016. 2. 3. · A Framework for Accelerating Bottlenecks in GPU Execution with Assist Warps Nandita Vijaykumar

SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling

GPU Programming in Haskell - svenssonjoel.github.io · GPU Programming in Haskell Joel Svensson Joint work with Koen Claessen and Mary Sheeran Chalmers University. ... SIMD execution

Anatomy of GPU Memory System for Multi-Application Execution