Multi-core Acceleration of NWP

Multi-core Acceleration of NWP

John Michalakes, NCARJohn Linford, Virginia Tech

Manish Vachharajani, University of ColoradoAdrian Sandu, Virginia Tech

HPC Users Forum, September 10, 2009

Outline

• WRF and multi-core overview• Cost breakdown and kernel repository• Two cases• Path forward

WRF Overview• Large collaborative effort to develop

community weather model – 10000+ registered users– Applications

• Numerical Weather Prediction• High resolution climate• Air quality research/prediction• Wildfire• Atmospheric Research

• Software designed for HPC– Ported to and in use on virtually all

types of system in the Top500– 2007 Gordon Bell finalist

• Why acceleration?– Exploit fine-grained parallelism– Cost performance ($ and e-)– Need for strong scaling

5 day global WRF forecast at20km horizontal resolution

WRF Overview• Large collaborative effort to develop

community weather model – 10000+ registered users– Applications

• Numerical Weather Prediction• High resolution climate• Air quality research/prediction• Wildfire• Atmospheric Research

• Software designed for HPC– Ported to and in use on virtually all

types of system in the Top500– 2007 Gordon Bell finalist

• Why acceleration?– Exploit fine-grained parallelism– Cost performance ($ and e-)– Need for strong scaling courtesy Peter Johnsen, Cray

Multi-/Many-core

• Graphics Processing Units– NVIDIA GTX280, AMD– High-end versions of commodity graphics cards– O(100) physical SIMD cores supporting O(1000) way concurrent threads– Separate co-processor to host CPU, PCIe connection– Large register files, fast (but very small) shared memory– Programmed using special purpose threading languages: CUDA, OpenCL– Higher-level language support in development (e.g. PGI 9)

• “Traditional” multi-core– Xeon 5500, Opteron Istanbul, Power 6/7– Much improved memory b/w (5x on Stream*)– Hyperthreading/SMT– Includes heterogeneity in the form of SIMD units & instructions– x86 instruction set; Native C, Fortran, OpenMP, ...

• Cell Broadband Engine– PowerXCell 8i– PowerPC with 8 co-processors on a chip– No shared memory but relatively large local stores per core– Cores separately programmed; all computation and data

movement programmer controlled

*http://www.advancedclustering.com/company-blog/stream-benchmarking.html

Percentages of total run time(single processor profile)

Microphysics

Radiation

Planetary Boundary

CumulusTKE

Surface processes

Dynamics and other

WRF Cost Breakdown and Kernel Repository

www.mmm.ucar.edu/wrf/WG2/GPU

WSM5 Microphysics• WRF Single Moment 5-Tracer (WSM5)* scheme • Represents condensation, precipitation, and

thermodynamic effects of latent heat release• Operates independently up each column of 3D WRF

domain• Expensive and relatively computationally intense (~2 ops

per word)

*Hong, S., J. Dudhia, and S. Chen (2004). Monthly Weather Review, 132(1):103-120.

WSM5 Microphysics

contributed by Roman Dubtsov, Intel

WSM5 Microphysics• CUDA version distributed with WRFV3• Users have seen 1.2-1.3x improvement

– Case/system dependent– Makes other parts of code run faster (!)

• PGI has implemented with 9.0 acceleration directives and seen comparable speedups and overheads from transfer cost

WRF CONUS 12km benchmarkCourtesy Brent Leback and

Craig Toepfer, PGI

total seconds

microphysics

Kernel: WRF-Chem• WRF model coupled to

atmospheric chemistry for air quality research and air pollution forecasting– Time evolution and advection of

tens to hundreds of chemical species being produced and consumed at varying rates in networks of reactions

– Many times cost of core meteorology; seemingly ideal for acceleration

*Grell et al., WRF Chem Version 3.0 User’s Guide, http://ruc.fsl.noaa.gov/wrf/WG11 **Hairer E. and G. Wanner. Solving ODEs II: Stiff and Differential-Algebraic Problems, Springer 1996.***Damian, et al. (2002). Computers & Chemical Engineering 26, 1567-1579.

Kernel: WRF-Chem• Rosenbrock solver for stiff

system of ODEs at each cell– Rosenbrock** solver for stiff

system of ODEs at each cell– Computation at each cell

independent – perfectly parallel– Solver itself is not parallelizable– 600K fp ops per cell

(compare to 5K ops/cell for meteorology)

– 1-million load-stores per cell– Very large footprint: 15KB state

at each cell

! OMP PARALLEL DO J = 1, ... DO K = 1, ... DO I = 1, ...

Kernel: WRF-Chem• KPP generates Fortran solver called at each grid-cell in 3D

domain• Multi-core implementation

– Insert OpenMP directives and multithread loop over grid cells

Kernel: WRF-Chem• Cell BE implementation

– PPU acts as master, invoking SPEs cell-by-cell– SPUs round robin through domain, triple buffering– Enhancement:

• SPUs process cells in blocks of four• Cell-index innermost in blocks, utilize SIMD

Kernel: WRF-Chem• GPU implementation

– Host CPU controls outer loop over steps in Rosenbrock algorithm– Each step implemented as kernel over all cells in domain– Thread-shared memory utilized where possible (but not much)– Cells are masked out of the computation as they reach convergence– Cell index (thread index) innermost for coalesced access to device memory

kernel invocation

kernel invocation

kernel invocation

kernel invocation

kernel invocation

kernel invocation

GPUCPU

kernel invocation

Chemistry Performance on GPU

Chemistry Performance on GPU

Some preliminary conclusions• Chemistry kinetics

– Very expensive but not computationally intense, so data movement costs are high and memory footprint is very large (15K bytes per cell)

– Each Cell BE core has local store large enough to allow effective overlap of computation and communication

– Today’s GPUs have ample device memory bandwidth, but there is not enough fast local memory for working set

– Xeon 5500, Power6 have sufficient cache, concurrency, and bandwidth

• WRF Microphysics– More computationally intense so GPU has an edge,

but Xeon is closing– PCIe transfer costs tip balance to the Xeon but this

can be addressed– Haven’t tried on Cell

• In all cases, conventional multi-core CPU easier to program, debug, and optimize

Garcia, J. R. Kelly, and T. Voran. Computing Spectropolarimetric Signals on Accelerator Hardware Comparing the Cell BE and NVIDIA GPUs. Proceedings of 2009 LCI Conference. 10-12 March 2009. Boulder CO.

Accelerators for Weather and Climate?

• Considerable conversion effort and maintenance issues, esp. for large legacy codes.

• What speedup justifies?• What limits speedups?

– Fast, close, and large enough memory resources– Distance from host processor– Baseline moving: CPUs getting faster

• Can newer generations of accelerators address? – Technically: probable– Business case: ???

Documents

Multi-core Acceleration of NWP