Benchmarking Best Practices in the RAPS ... Contact: luis.kornblueh@zmaw.de CAS2K11 Slide 26 ECMWF RAPS

  • View
    0

  • Download
    0

Embed Size (px)

Text of Benchmarking Best Practices in the RAPS ... Contact: luis.kornblueh@zmaw.de CAS2K11 Slide 26 ECMWF...

  • ECMWFSlide 1CAS2K11

    Benchmarking Best Practices in the

    RAPS Consortium

    George Mozdzynski, ECMWF

    RAPS Chairman

  • ECMWFSlide 2

    Outline

     Benchmark evolution

     What is RAPS?

     Benchmarking ‘best’ practices

     RAPS futures

     RAPS forum

    CAS2K11

  • ECMWFSlide 3

    Benchmark Evolution

    CAS2K11

    1970‟s 1980‟s 1990‟s 2000+

    Whetstone/

    Dhrystone

    Full Application

    benchmarks with

    MPI + OpenMP

    Linpack100 NAS Parallel

    Benchmarks

    SpecFP

    Livermore Loops Ping-Pong/ Ping-Ping

    IOzone

    (top500)

    HPL

    homo benchmarkus

  • ECMWFSlide 4CAS2K11

    What is RAPS?

     Real Applications on Parallel Systems

     European Software Initiative

     RAPS Consortium (founded early 90’s)

     Working group of hardware vendors

     Programming model (MPI + F90/F95 + OpenMP)

    “The partners of the RAPS Consortium develop portable

    parallel versions of their production codes which are made

    available to a Working Group of Hardware Vendors for

    benchmarking and testing.”

  • ECMWFSlide 5CAS2K11

    RAPS Consortium

    CCLRC Fraunhofer SCAI/ITWM

    CERFACS Met Office UK

    CPTEC MPI-M

    CSCS METEO-FRANCE

    DWD NERC

    DKRZ ONERA

    ECMWF

  • ECMWFSlide 6CAS2K11

    Working Group of

    Hardware Vendors

    Bull INTEL

    Cray NEC

    Fujitsu Oracle

    HP SGI

    IBM

  • ECMWFSlide 7CAS2K11

    Why RAPS

     Portability of codes (F90/F95/F2003, C/C++, MPI, OpenMP)

     Availability of benchmark codes ahead of a formal procurement

     RAPS community

    - Some influence on vendors adopting standards

     Regular meetings in past years (24 in total)

  • ECMWFSlide 8CAS2K11

    RAPS process

     RAPS benchmarks distributed by individual organizations

     No official membership required for vendors

     Vendors approach individual orgs for benchmarks

    - Confidentiality / Licence agreement

     Meetings

    - Every 2 years as part of ECMWF ‘Use of HPC in Meteorology

    workshop’

    - Chairmanship with host organization (this will probably change)

  • ECMWFSlide 9CAS2K11

    RAPS benchmarks

     Today

    - DWD

    - ECMWF

    - Met Office

    - MPI-M

    - Meteo-France

     Commitment to produce up to date benchmarks reflecting key operational applications of consortium members

  • ECMWFSlide 10CAS2K11

    Benchmark Best Practices

     Using ECMWF’s IFS as a case study

     Benchmark preparation

     What are correct results?

     Testing

    - With different compilers

    - On different systems

  • ECMWFSlide 11

    Benchmark preparation

     Benchmarks should be bug free 

     No formal way of proving a program is bug free

     But we can do a lot to find bugs

    - Using compiler options

    - Using memory debugging tools

    - Run combinations of number of MPI tasks and OpenMP threads

    - Enforcing bit reproducible results in our applications

     These approaches should be part of normal testing of each new cycle/release of the application and not just for benchmarking

    - Doing this work post release is unproductive, back stitches …

    CAS2K11

  • ECMWFSlide 12

    Using compilers to find bugs /1

     Compile with no optimization

    - On IBM, –O0 or -qnooptimize

    - Will pick up potential OpenMP bugs where some variables should

    be PRIVATE but are by default SHARED

    - We get away with it with higher compiler optimization as these

    variables are more likely to be held in registers

    - Remember a vendor benchmarker will probably revert to this if

    they have problems with some ‘safe’ compiler options

    - Object libraries built with no optimization are needed for use with

    debuggers

    - On IBM, for routines with OpenMP also need, –qsmp=omp:noopt

    CAS2K11

  • ECMWFSlide 13

    Using compilers to find bugs /2

     Compile with subscript checking

    - on IBM –qcheck or –C

    - Can be done with normal (production) compiler optimization

    - A few months of effort was initially required to get IFS to run with

    compiler subscript checking (circa 2002)

    - Many rewards of bugs detected since then

    - Problem areas requiring resolution

     Unused field index in array actual argument on subroutine call

     Incorrect subroutine dummy argument size

    - Resolution straightforward, combination of

     Unused field indices set to 1 (IFS namelist variable NUNDEFLD=1)

     Use of interface blocks, assumed shape arrays, modules

    CAS2K11

  • ECMWFSlide 14

    Using compilers to find bugs /3

     Initialize arrays to NANS

    - On IBM, -qinitauto=7FF7FFFF for automatic variables

    - Plus some local code to intercept malloc’s and initialize to NANS

     For IFS activated via environment variable EC_FILL_NANS=1

     to catch all memory allocations (heap/stack)

    - Finds the occasional bug where scalars or arrays are not initialized

    correctly (or at all)

    CAS2K11

  • ECMWFSlide 15

    Using memory debugging tools to find bugs

    CAS2K11

  • ECMWFSlide 16

    Using memory debugging tools to

    find bugs and leaks

     Using RogueWave’s MemoryScape

     Started from a Totalview debugging run (at ECMWF)

    - Guard blocks

     When memory is malloc’d and free’d

    - Red Zones

     Align each malloc on page boundary (last word)

     Next page is RO, so SEGV on write attempt

     Supported on Linux

     Not supported on AIX

    - Check for memory leaks

    CAS2K11

  • ECMWFSlide 17

    IFS RAPS11 model-only benchmark

     Model resolutions T159, T399, T799, T1279 and T2049

     Full outputs from IBM Power6 (all resolutions)

     No output of model fields (i.e. not an I/O benchmark)

     Reference job scripts

    - 24 time steps

    - test of correctness

     Long run job scripts, for performance runs

    - use same executable as for reference runs

     Results ([SP,GP] norms) should be bit-reproducible when changing

    - number of MPI tasks

    - number of OpenMP threads

    CAS2K11

    Results of ERROR calculation (T159/Intel/pgi-10.6)

    The error calculated from the results shows

    that the calculations are correct

    The maximum error is = 0.16518 %

  • ECMWFSlide 18

    Maximum Error Correctness Check

    (optimized libs v non-optimized libs on P6)

    CAS2K11

    0

    1

    2

    3

    4

    5

    6

    7

    1 7

    1 3

    1 9

    2 5

    3 1

    3 7

    4 3

    4 9

    5 5

    6 1

    6 7

    7 3

    7 9

    8 5

    9 1

    9 7

    1 0 3

    1 0 9

    1 1 5

    1 2 1

    1 2 7

    1 3 3

    1 3 9

    1 4 5

    1 5 1

    1 5 7

    1 6 3

    1 6 9

    1 7 5

    1 8 1

    1 8 7

    1 9 3

    1 9 9

    2 0 5

    2 1 1

    2 1 7

    2 2 3

    2 2 9

    2 3 5

    M a x im

    u m

    E rr

    o r

    %

    Model Hours

    T159 (125 km)

    T1279 (16 km)

    Maximum Error (all time steps)

    computed from saved reference

    norms and norms from running

    model (at each atmospheric level)

    using following spectral variables

    -Vorticity

    -Divergence

    -Temperature

    -Humidity

    -Kinetic Energy

    IFS CY37R3

    (RAPS12)

  • ECMWFSlide 19

    Maximum Error Correctness Check

    (optimized libs v non-optimized libs on P6)

    CAS2K11

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    1.2

    1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47

    M a x im

    u m

    E rr

    o r

    %

    Model Hours

    T159 (125 km)

    T1279 (16 km)

    T1279 TSTEP=600s

    24 steps = 4 model hours

    T159 TSTEP=3600s

    24 steps = 24 hours

    Acceptable max error is below 1 %

    -at 24 model time-steps

    -to keep run time short

  • ECMWFSlide 20

    IFS RAPS11 benchmark DR_HOOK_OPT=prof output

    provided for all model cases (T2047 extract)

    CAS2K11

    Name of the executable : /fws2/lb/work/rd/mpm/RAPS11/T2047/../bin/ifsMASTER

    Number of MPI-tasks : 512

    Number of OpenMP-threads : 8

    Wall-times over all MPI-tasks (secs) : Min=1020.070, Max=1025.730, Avg=1022.972, StDev=1.261

    Routines whose total time (i.e. sum) > 1.000 secs will be included in the listing

    Avg-% Avg.time Min.time Max.time St.dev Imbal-% # of calls : Name of the routine

    7.91% 80.951 71.799 86.287 1.554 16.79% 6545408 : MXMAOP

    4.15% 42.488 39.259 47.132