22
© Crown copyright Met Office From Knights ENDGame to Gung Ho beginnings: Porting old, and developing new, dynamical cores in the age of accelerators Dr Chris Maynard [email protected]

From Knights ENDGame to Gung Ho beginningsdata1.gfdl.noaa.gov/multi-core/presentations/maynard_5a.pdf7 Open MP loops with colouring for Gung Ho no code restructuring 4 lines of directives

Embed Size (px)

Citation preview

Page 1: From Knights ENDGame to Gung Ho beginningsdata1.gfdl.noaa.gov/multi-core/presentations/maynard_5a.pdf7 Open MP loops with colouring for Gung Ho no code restructuring 4 lines of directives

© Crown copyright Met Office

From Knights ENDGame to Gung Ho beginnings:

Porting old, and developing new, dynamical cores in the age of accelerators

Dr Chris Maynard [email protected]

Page 2: From Knights ENDGame to Gung Ho beginningsdata1.gfdl.noaa.gov/multi-core/presentations/maynard_5a.pdf7 Open MP loops with colouring for Gung Ho no code restructuring 4 lines of directives

© Crown copyright Met Office

Introduction

End of Moore’s Law1 – Rise of multi-core processors The free lunch is over – Herb Sutter 1Processor speed

If you were ploughing a field, which would you rather use: Two strong oxen or 1024 chickens?

- Seymour Cray

Page 3: From Knights ENDGame to Gung Ho beginningsdata1.gfdl.noaa.gov/multi-core/presentations/maynard_5a.pdf7 Open MP loops with colouring for Gung Ho no code restructuring 4 lines of directives

Parallelism

© Crown copyright Met Office

Data parallel Domain decomposition Bind MPI task to CPU

Task parallel Processor perform different work I/O server

ILP – SIMD FMA, SSE/AVX CPU 512bit wide SIMD MIC Coalesced memory access (per warp) GPU

Hybrid programming MPI + OpenMP – affinity? Hierarchical memory spaces NUMA Accelerators host + device Asynchronous use?

Oversubscribed concurrency CPUs hyper-threading/SMT Not much benefit due to Out-of-order-execution MIC instruction interleave GPU hide memory latency

Page 4: From Knights ENDGame to Gung Ho beginningsdata1.gfdl.noaa.gov/multi-core/presentations/maynard_5a.pdf7 Open MP loops with colouring for Gung Ho no code restructuring 4 lines of directives

© Crown copyright Met Office

The Unified Model and ENDGame

© Crown copyright Met Office

Unified Model – NWP and Climate – 25 years old

ENDGame is 3rd generation dynamical core Half million lines of F90 (F77)/openMP threading low-level, incomplete coverage

T/T 2

4

Long-lat finite difference semi-implicit semi-lagrangian

Much improved stability and scalability c.f. New Dymics

Page 5: From Knights ENDGame to Gung Ho beginningsdata1.gfdl.noaa.gov/multi-core/presentations/maynard_5a.pdf7 Open MP loops with colouring for Gung Ho no code restructuring 4 lines of directives

© Crown copyright Met Office

Machines

© Crown copyright Met Office

STFC Daresbury 1 Dual socket 8 core Sandybridge host + KNC No queues, instant accesses

Stampede, TACC - 6400 nodes! batch system, multi-MIC

Met Office IBM Power7 2 machines ~ 500 32-core nodes each

Cray XC30 UK National Academic HPC service – 12-core dual socket Ivybridge

Page 6: From Knights ENDGame to Gung Ho beginningsdata1.gfdl.noaa.gov/multi-core/presentations/maynard_5a.pdf7 Open MP loops with colouring for Gung Ho no code restructuring 4 lines of directives

© Crown copyright Met Office

96x72x70 single Phi – UM

A: 1 OMP thread

B: 2 OMP, -O2, -O3 and compile thread=2

C: 2 OMP, -mkl, seg-size 20-120

D: 4 OMP, seg-size

E 2x Phi 6x10 each – 2 OMP

6x10 MPI tasks, + OpenMP threads – domain decomposition KMP_AFFINITY="verbose,granularity=thread,compact"

Compiled for native mode -mmic

Page 7: From Knights ENDGame to Gung Ho beginningsdata1.gfdl.noaa.gov/multi-core/presentations/maynard_5a.pdf7 Open MP loops with colouring for Gung Ho no code restructuring 4 lines of directives

© Crown copyright Met Office

Interpreting the results •  Threading performance is poor

–  Segments for radiation routines doesn’t help

–  Is there enough work per thread?

•  MKL really helps radiation routines –  Lots of trig func - library boots performance 30%

•  Two KNC cards is faster –  Even for MPI over PCIe - enough data parallelism

–  OMP at low level in UM – incomplete coverage

•  Bigger problem size? –  192x144 runs out of global memory on 1 and 2 cards

–  Same local volume size as N48 on 4!

Page 8: From Knights ENDGame to Gung Ho beginningsdata1.gfdl.noaa.gov/multi-core/presentations/maynard_5a.pdf7 Open MP loops with colouring for Gung Ho no code restructuring 4 lines of directives

© Crown copyright Met Office

Focus on the solver - threading

Solver performance is dominated by preconditioner – tri_sor

Red-black checkerboard for threading N48 – 96 x 72 / 6 x 10 (MPI) = 16 x 7(8) per task

split over threads?

Stopping criteria for BiCGStab 10-3

Satisfy this bound with single precision Halve data transfer from memory

Compare N96 (192 x 144 grid) 4 x KNC - 12x20 MPI task 2 x Sandy Bridge nodes (4-sockets) – 4x8 MPI tasks

1 Power 7 node (4 –sockets) – 4x8 MPI tasks

Page 9: From Knights ENDGame to Gung Ho beginningsdata1.gfdl.noaa.gov/multi-core/presentations/maynard_5a.pdf7 Open MP loops with colouring for Gung Ho no code restructuring 4 lines of directives

© Crown copyright Met Office

Tri-diagonal SOR preconditioner (strong scaling)

PWR 7 is faster (no off node comms) No SMT scaling MIC is slowest shows some thread scaling

Single precision solver is much faster Reduces memory traffic Reduces communication On all architectures

Page 10: From Knights ENDGame to Gung Ho beginningsdata1.gfdl.noaa.gov/multi-core/presentations/maynard_5a.pdf7 Open MP loops with colouring for Gung Ho no code restructuring 4 lines of directives

© Crown copyright Met Office

Data layout critical

Data layout is lexicographical (i,j,k)in UM loop order is red-black in tri_sor BAD cache use and SIMD

Change to linear red and black data layout

LGC is universal in UM including comms - data transformation Try out in serial-stand alone solver code – not full UM

Page 11: From Knights ENDGame to Gung Ho beginningsdata1.gfdl.noaa.gov/multi-core/presentations/maynard_5a.pdf7 Open MP loops with colouring for Gung Ho no code restructuring 4 lines of directives

Code length and complexity increase

© Crown copyright Met Office

Matrix-vector routine LCG ~ 100 lines RB ~ 700 lines Just for i and j both even!

Page 12: From Knights ENDGame to Gung Ho beginningsdata1.gfdl.noaa.gov/multi-core/presentations/maynard_5a.pdf7 Open MP loops with colouring for Gung Ho no code restructuring 4 lines of directives

Architecture comparison

© Crown copyright Met Office

Serial code only – mimic memory access fully occupied node Run multiple copies per node stress memory performance Global problem size, per socket, same on all architectures Amount of data transformation per memory system is the same, amount of work per core is different

Page 13: From Knights ENDGame to Gung Ho beginningsdata1.gfdl.noaa.gov/multi-core/presentations/maynard_5a.pdf7 Open MP loops with colouring for Gung Ho no code restructuring 4 lines of directives

Performance comparison

Page 14: From Knights ENDGame to Gung Ho beginningsdata1.gfdl.noaa.gov/multi-core/presentations/maynard_5a.pdf7 Open MP loops with colouring for Gung Ho no code restructuring 4 lines of directives

Summary

© Crown copyright Met Office

• ENDGame solver data layout is lexicographical but loop order is Red-black • improve using single precision and change data layout

•  will require lots of coding in UM • These changes help on all architectures (least on MIC) • Vector/SIMD – Compiler reports successful vectorisation, but no performance

•  peer at assembler code suggest all iterations in peel loop, vector length is 1?

• Try OpenMP SIMD with ifort14?

Page 15: From Knights ENDGame to Gung Ho beginningsdata1.gfdl.noaa.gov/multi-core/presentations/maynard_5a.pdf7 Open MP loops with colouring for Gung Ho no code restructuring 4 lines of directives

Stop press …

© Crown copyright Met Office

ifort –align array64byte

16x128x64 4 threads bound to core KNC 2x IB 1.3x Thanks to Tom Henderson for tip!

Page 16: From Knights ENDGame to Gung Ho beginningsdata1.gfdl.noaa.gov/multi-core/presentations/maynard_5a.pdf7 Open MP loops with colouring for Gung Ho no code restructuring 4 lines of directives

First look at Gung Ho

© Crown copyright Met Office

Met Office, Bath, Exeter, Imperial, Leeds, Reading, Warwick and STFC Daresbury Design and build new scalable dynamical core Numerics still under R & D High level software architecture design is complete First implementation developments started this year: Dynamo

Finite Element Method (FEM) – possibly higher order Unstructured mesh (semi-uniform horizontal, vertically graded) Element topology (i.e. Triangular, quadrilateral, hexagonal) not yet fixed

Page 17: From Knights ENDGame to Gung Ho beginningsdata1.gfdl.noaa.gov/multi-core/presentations/maynard_5a.pdf7 Open MP loops with colouring for Gung Ho no code restructuring 4 lines of directives

© Crown copyright Met Office

Design Principle

Scientific programming

Find numerical solution (and estimate of the uncertainty) to a (set of) mathematical equations which describe the action of a physical system

Parallel programming and optimisation are the methods by which large problems can be solved faster than real-time.

Roles: Natural and Computational scientists

What tasks are performed and not job descriptions

Page 18: From Knights ENDGame to Gung Ho beginningsdata1.gfdl.noaa.gov/multi-core/presentations/maynard_5a.pdf7 Open MP loops with colouring for Gung Ho no code restructuring 4 lines of directives

© Crown copyright Met Office

Layered architecture

Parallel code Science code

Page 19: From Knights ENDGame to Gung Ho beginningsdata1.gfdl.noaa.gov/multi-core/presentations/maynard_5a.pdf7 Open MP loops with colouring for Gung Ho no code restructuring 4 lines of directives

Comparing ENDGame and Gung Ho

© Crown copyright Met Office

ENDGame 32x32x64 domain RB BiCGstab solver (tri_sor) preconditioner 25 times

Gung Ho 32x32x64 Galerkin Projection on lowest order Quads for W1, W2 and W3 spaces 4 colours, no preconditioner, BiCGstab, 10 times

Single socket Ivy bridge processor – Open scaling wrt single thread

Page 20: From Knights ENDGame to Gung Ho beginningsdata1.gfdl.noaa.gov/multi-core/presentations/maynard_5a.pdf7 Open MP loops with colouring for Gung Ho no code restructuring 4 lines of directives

ENDGame versus Gung Ho Intel Xeon Phi

© Crown copyright Met Office

ENDGame – problem to small

EG artificial prpblem size

GH – native mode

Gung Ho Offload

Page 21: From Knights ENDGame to Gung Ho beginningsdata1.gfdl.noaa.gov/multi-core/presentations/maynard_5a.pdf7 Open MP loops with colouring for Gung Ho no code restructuring 4 lines of directives

Summary

© Crown copyright Met Office

7 Open MP loops with colouring for Gung Ho no code restructuring 4 lines of directives to offload kernel Gung Ho absolute performance KNC native twice as slow as Ivybridge offload twice as slow as native Data movement is not optimised, (fused kernels) No SIMD performance – kernel structure not yet finalised for cache and SIMD performance Parallel porting only requires changes to PSy layer science code is unchanged

Page 22: From Knights ENDGame to Gung Ho beginningsdata1.gfdl.noaa.gov/multi-core/presentations/maynard_5a.pdf7 Open MP loops with colouring for Gung Ho no code restructuring 4 lines of directives

Conclusions

Demise of Moore’s Law means applications must exploit increased parallelism and hierarchy

UM ported in native mode to Xeon Phi

initial performance is disappointing

code changes single: precision solver, checkerboarding: improve performance but more on other architectures (need full implementation in UM)

Gung Ho portable and scalable (hopefully) by design

Design for ILP/SIMD at low level evolving

© Crown copyright Met Office