22
CCSM Portability and Performance, Software Engineering Challenges, and Future Targets Tony Craig National Center for Atmospheric Research Boulder, Colorado, USA CAS Meeting, September 7-11, 2003, Annecy France

CCSM Portability and Performance, Software Engineering Challenges, and Future Targets Tony Craig National Center for Atmospheric Research Boulder, Colorado,

Embed Size (px)

Citation preview

CCSM Portability and Performance, Software Engineering Challenges,

and Future Targets

Tony Craig National Center for Atmospheric Research

Boulder, Colorado, USA

CAS Meeting, September 7-11, 2003, Annecy France

Topics

• CCSM SE and design overview• Coupler design and performance• Production and performance

– portability– scaling

• SE Challenges• The Future

CCSM Overview

• CCSM = Community Climate System Model (NCAR)

• Designed to evaluate and understand earth’s global climate, both historical and future.

• Multiple executables (5)– Atmosphere (CAM), MPI/OpenMP– Ocean (POP), MPI– Land (CLM), MPI/OpenMP– Sea Ice (CSIM), MPI– Coupler (CPL6), MPI

CCSM SE Overview

• Good science top priority• Fortran 90 (mostly)• 500k lines of code• Community project, dozens of developers• Collaborations are critical

– University Community– DOE - SciDAC– NASA - ESMF

• Regular code releases• Netcdf history files• Binary restart files• Many levels of parallelism-multiple executables, MPI,

OpenMP

CCSM “Hub and Spoke” System

cpl

atm

ocn ice

lnd

• Each component is a separate executable

• Each component on a unique set of hardware processors

• All communications go through coupler

• Coupler– communicates with all

components– maps (interpolates) data– merges fields– computes some fluxes– has diagnostic, history, and

restart capability

The CCSM coupler

• Recent redesign (cpl6)• Create a fully parallel distributed memory coupler• Implement M to N communication between components• Improve communication performance to minimize

bottlenecks at higher resolutions in the future• Improve coupling interfaces, abstract communication

method away from components• Improve usability, flexibility, and extensibility of coupled

system• Improve overall performance

The Solution

MCT*MPH**

*Model Coupling Toolkit (DOE Argonne National Lab)

** Multi-Component Handshaking Library (DOE Lawrence Berkley National Lab)

cpl6

Build a new coupler framework with abstracted, parallel communication software in the foundation. Create a coupler application instantiation called cpl6 which reproduces the functionality of cpl5:

cpl6 Design: Another view of CCSM

• In cpl5, MPI was the coupling interface• In cpl6, the “coupler” is now attached to each component

– Components unaware of coupling method– Coupling work can be carried out on component processors– Separate coupler no longer absolutely required

atm lnd ice ocn cpl

couplinginterface

layer

hardware processors

CCSM Communication: cpl5 vs cpl6

cpl5

gather scatter

comm(root to root)

comm(root to root)

copy copy

Coupler on 8pesIce component on 16pes

240 transfers, 21 fieldsProduction configuration

cpl6

comm(M to N)

comm(M to N)

copy copy

NO copy NO copy

cpl5 communication=61.5s cpl6 communication=18.5s

CCSM Production

• Forward integration of relatively coarse models– atm/land - T42 (128x64, L26)– ocn/ice - 1 degree (320x384, L40)

• Finite difference and spectral, explicit and implicit methods, vertical physics, global sums, nearest neighbor communication

• I/O not a bottleneck (5 Gb / simulated year)• Restart capability (750 Mb)• Separate harvesting to local mass storage

system• Auto resubmit

CCSM Throughput vs Resolution

Atmosphere Resolution

Ocean Resolution

Processors

Throughput* (yrs/day)

T31 (3.7 deg)

3 deg 68 12.0

T42 (2.8 deg)

1 deg 152 10.9

T42 1 deg 120 8.6

T42 1 deg 104 7.2

T85 (1.4 deg)

1 deg 200 4.0

T170 (0.7 deg)

1 deg 400 2.0 (estimate)

*IBM power4 system, bluesky, as of 9/1/2003

CCSM Throughput vs Platform

Platform Processors

Throughput*(yrs/day)

IBM power3 wh (NCAR) 104 3.5

IBM power3 nh (NERSC)

112 4.0

SGI O3K (LANL) 112 4.0

Linux cluster* (ANL) 104 4.0-7.0 (estimate)

HP/CPQ Alpha (PSC) 104 6.0 (estimate)

IBM power4 (NCAR) 104 7.2

IBM power4 (NCAR) 152 10.9

NEC Earth Simulator 80 20.0 (estimate)

Cray X1 (ORNL) 80 20.0 (estimate)

T42/1 degree atm/ocn resolution

* ANL jazz machine, 2.4Ghz Pentium

Ocean Model Performance and Scaling

**Courtesy of PW Jones, PH Worley, Y Yoshida, JB White III, J Levesque

CCSM Component Scaling

0

10

20

30

40

50

60

70

80

2 4 8 16 32 48 64 80

Number of processors

Seconds/simulated day

atmlndiceocncpl

CCSM2_2_beta08T42_gx1v3IBM Power4, bluesky

CCSM Load Balance Example

64 ocn48 atm16 ice 8 lnd16 cpl

152 total

Seconds per simulated day

processors

3.3 1.2

1.2 11.8

4.0 14.3

2.1 1.2

21.7

3.0 1.1.8

21.3

CCSM2_2_beta08IBM Power4, bluesky

Challenges in the Environment (1)

• Machines often not well balanced– chip speed– interconnect– memory access – cache– I/O– vector capabilities

• Each machine is “balanced” differently• Optimum coding strategy often depends

largely on platform• Need to develop “flexible” software strategies

RISC vs Vector

• Data layout; index order, data structure layout• Floating operation count (if) versus pipelining (masking)• Loop ordering and loop structure• Vectorization impacts parallelization• Memory access, cache blocking, array layouts, array

usage• Bottom Line (In My Opinion):

– Truly effective cache reuse is very hard to achieve on real codes

– Sustained performance on some RISC machines is disappointing

– Poor vectorization costs an order of magnitude in performance on vector machines

– We are now (re-)vectorizing and expect to pay little or no performance penalty on RISC machines

Challenges in the Environment (2)

• Startup and control of multiple executables• Compilers and libraries• Tools

– Debuggers inadequate; multiple executables and MPI/OpenMP parallel models

– Timing and profiling tools generally inadequate• IBM HPM getting better• Jumpshot works well• Cray performance tools look promising

– Have avoided instrumenting code (risk, robust, #if)– Use print statements and calls to system clock

Summary

• Science top priority, large community project, regular model releases

• SE improvements continuous, cpl6 is a success• Machines change rapidly and are highly

variable in architecture• Component scaling and CCSM load balance are

acceptable• (Re)-Vectorization • Tools and machine software can present

significant challenges

Future

• Increased coupling flexibility– Single executable– Mixed concurrent/serial design

• Continue to work on scalar and parallel performance in all models

• Take advantage of libraries/collaborations for performance portability software, more layering, leverage external efforts– NASA/ESMF– DOE/SCIDAC– University Community– Others

• IBM is still an important production platform for CCSM• CCSM is moving onto vector platforms and linux clusters,

production capability on these platforms still to be determined

THE END