Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
The U.S. DOE Accelerated Climate Modeling for Energy Project
Robert Jacob April 22, 2015 Third Workshop on Coupling Technologies for Earth System Models
Argonne National Laboratory § $675M opera,ng
budget § 3,200 employees § 1,450 scien,sts and
engineers § 750 Ph.D.s
ACME in a nutshell… A new U.S. climate modeling effort led by the U.S. Department of Energy Office of Biological and Enviornmental Research
or…
“A collabora,on among the DOE na,onal laboratories (and a few other ins,tu,ons) to develop and apply the most complete, leading-‐edge climate and Earth system models for the most challenging and demanding climate-‐change research problems and DOE mission needs while efficiently using DOE Leadership Compu,ng Facili,es.”
Why should DOE in par3cular be interested?
Leading DOE machines at our last meeting
• IBM Blue Gene/Q System – 48 racks – 49,152 nodes – 786 TB of memory – Peak flop rate: 10 PF
• Cray XK7 System – 299,008 cores – 18,688 NVIDIA Kepler
K20x GPUs – 710 TB of memory – Peak flop rate: 27 PF
Upcoming DOE machines
• Intel/Cray Aurora (ALCF) – 50,000 Xeon Phi nodes (“Knight’s Hill”) – Approx 150PF – Produc,on in 2019
• IBM/NVIDIA Summit (OLCF) – 3,400 Power9 nodes – Mul,ple NVIDIA Volta GPUs per
node – Approx 150PF – Produc,on in 2018
• Intel/Cray Cori (NERSC) – 9,300 Xeon Phi nodes (“Knight’s Corner”) – Approx 30PF – Produc,on in 2017
ACME Project Goals • a series of prediction and simulation experiments
addressing scientific questions and mission needs; • a well documented and tested, continuously advancing,
evolving, and improving system of model codes that comprise the ACME Earth system model;
• the ability to use effectively leading (and “bleeding”) edge computational facilities soon after their deployment at DOE national laboratories; and
• an infrastructure to support code development, hypothesis testing, simulation execution, and analysis of results.
Climate Science Drivers for ACME
Water cycle: How do the hydrological cycle and water resources interact with the climate system on local to global scales?
Biogeochemistry: How do biogeochemical cycles
interact with global climate change? Cryosphere: How do rapid changes in cryospheric systems
interact with the climate system?
ACME: let specific science questions drive development
Atmosphere: More accurate simulation of aerosols, clouds, wind, and precipitation
Land: More accurate simulation of terrestrial feedbacks from more complex carbon, nutrient and water cycles Ocean: Introduction of multi-resolution dynamics to more accurately simulate ocean heat uptake and
water masses Sea ice: Recast numerics to focus resolution in polar regions, and add icebergs, sea ice strength, and
snow physics Land ice: Addition of the first realistic, dynamic coupled ice-sheet model
Driver
Ques,ons Hypotheses Experiments Requirements
Development
ACME Roadmap
ACME size/start: • No new funding: 6-7 existing DOE lab climate projects
combined in to one program • 8 U.S. national laboratories and 6 partner institutions • 85 researchers working ¼ time or more • Total effort ~43 FTE • Started from beta tag of CESM1.3
– Using cpl7/MCT for coupler.
• Started July 1, 2014. 3 years initially.
ACME organization
ACME Council Dave Bader, Chair
Execu,ve Commihee: W. Collins, M. Taylor R. Jacob, P. Jones, P. Rasch, P. Thornton, D. Williams
Ex Officio: J. Edmonds, J. Hack, W. Large, E. Ng
Execu3ve CommiAee Chair: D. Bader
Chief Scien,st: William Collins Chief Computa,onal Scien,st: Mark Taylor
Project Engineer Renata McCoy
Coupled Simula3on Group Dave Bader, Bill Collins,
Mark Taylor
Coupled Sim. Task Leaders
Workflow Group
Dean Williams Katherine Evans
Workflow Task Leaders
SoLware Eng./Coupler Group
Robert Jacob Andrew Salinger
SE/Coupler Task Leaders
Performance/ Algorithms Group
Patrick Worley Hans Johansen
Perf. / Alg. Task Leaders
Land Group
Peter Thornton William Riley
Land Task Leaders
Atmosphere Group
Philip Rasch Shaocheng Xie
Atmosphere Task Leaders
Ocean/Ice Group
Philip Jones Todd Ringler
Ocean/Ice Task Leaders
ACME organization
ACME Council Dave Bader, Chair
Execu,ve Commihee: W. Collins, M. Taylor R. Jacob, P. Jones, P. Rasch, P. Thornton, D. Williams
Ex Officio: J. Edmonds, J. Hack, W. Large, E. Ng
Execu3ve CommiAee Chair: D. Bader
Chief Scien,st: William Collins Chief Computa,onal Scien,st: Mark Taylor
Project Engineer Renata McCoy
Coupled Simula3on Group Dave Bader, Bill Collins,
Mark Taylor
Coupled Sim. Task Leaders
Workflow Group
Dean Williams Katherine Evans
Workflow Task Leaders
SoLware Eng./Coupler Group
Robert Jacob Andrew Salinger
SE/Coupler Task Leaders
Performance/ Algorithms Group
Patrick Worley Hans Johansen
Perf. / Alg. Task Leaders
Land Group
Peter Thornton William Riley
Land Task Leaders
Atmosphere Group
Philip Rasch Shaocheng Xie
Atmosphere Task Leaders
Ocean/Ice Group
Philip Jones Todd Ringler
Ocean/Ice Task Leaders
ACME Workflow group: Developing a comprehensive approach to enable large scale climate science
• The end-to-end workflow integrates – a model run simulation manager
(AKUNA) – a data publishing/sharing/archiving
infrastructure (ESGF) – a secure data transport (Globus) – analysis/diagnostics/visualization tools
(UV-CDAT) – provenance capture framework (ProvEn)
to improve reproducibility and tracking
• Performance Monitoring and Analysis • Internode and I/O: Load balancing, communica,on algorithm op,miza,on,
computa,on/communica,on overlap, exploi,ng addi,onal concurrency • On-‐node: Accelerators, threading, memory management, programming models • Next Genera,on Architectures: NERSC NESAP, OLCF-‐4 CAAR, ALCF-‐3 ESP
• Work with DOE computer scien,sts in other projects involving performance and fast numerical libraries.
ACME Performance Group: Simula3on Throughput with a target of 5 simulated years/day
• Best Practices: Common tools, methodologies adopted across ACME science/tech teams. – Developer’s test suites – Continuous Integration with Jenkins – Repository set up and workflow
• I/O: Parallel I/O at ACME model scales, increased use of in-situ diagnostics.
• Modularity and configurability: Modular interfaces for all new models, runtime configurability
• Coupling: Coupler performance, coupler/main design, and MCT!
ACME SoLware Engineering/Coupler Group
ACME git workflow
Model Coupling Toolkit • A set of Fortran90 datatypes and functions for building
parallel coupled models. – With or without a coupler (which MCT doesn’t provide). – All models are assumed to be parallel with MPI. – 2-sided send/recv model for moving data similar to MPI – Support for online parallel interpolation using offline-calculated
weights. – Model registry, decomposition descriptors (by numbering dofs),
distributed data type, communication tables. – Functions for time accumulation, spatial averaging and merging.
Model Coupling Toolkit
• At last mee,ng, Feb, 2013, this was s,ll news • Now, a lihle embarrassing
MCT website as of 4/22/15
A lot has been happening with MCT…
GitX display of recent MCT repository history
Model Coupling Toolkit, v2.9
• Moved repository from Argonne git server to github.com – https://github.com/MCSclimate/MCT!
• New Features to aid in studying Router initialization – GSMap and MCTWorld print().
• Print contents to ascii file for later reading – Router init internal timers
• Invoked with optional string argument to Router init. – RouterTest.F90 - test program which reads in output GSMaps
and MCTWorld info and builds a Router. • Will build on same number of procs and same decomposition as
original model. – Great for creating coupling benchmarks!
Model Coupling Toolkit, v2.9 • Support for NAG 6.0 • Support for Mac builds • Bug fixes including ones found by valgrind (many thanks to
NCAR’s Sean Santos for above)
• mpi-serial 2.0 – a small single-node MPI library. – For programs that assume MPI and users who don’t want to install
a full MPI-library on their laptop/desktop. Doesn’t require mpirun. – Not a stub-library MPI_Send/Recv really copies data. – In 2.0
• Many more MPI datatypes/functions added. • Self-contained build system (autoconf)
– Developed by ALCF’s Raymond Loy • MCT 2.9 release is imminent.
Model Coupling Toolkit: Development process
• Use gitworkflows on github. – Anyone with a free github account can create a fork (git clone) of
the MCT repo. – Make a branch and develop your cool new feature. – Submit a “Pull Request” to have your feature included in master
• A few developers make branches directly in MCT main repo. – Change gets reviewed and tested (by me) before inclusion. – Branch gets merged to master.
• Discussion of bugs/proposed new features with github issues.
• Documentation on github wiki.
Model Coupling Toolkit: Future plans
• Near term – Router initilization benchmarks based on ACME science cases
(0.25 degree atmosphere, 1/10th degree ocean, 10K cores) – Improve scaling/timing of Router init and Rearranger/MCT_Send/
Recv communication for ACME cases on LCFs. Releasing any MCT improvements.
• Long term – MCT-MOAB
• Talked about before. Mesh Oriented Database • Tried using MOAB’s Fortran interface to build MCT datatypes/
functions. Not very satisfying. • New plan (notion): MCT-MOAB in C with MOAB’s C/C++ interface
and Fortran interfaces defined using F2003 standard.
Exascale is coming
Top500 list and projected performance. (top500.org)
#1: Tianhae-‐2 (33PF) #2: Titan (DOE OLCF 17PF) #5: Mira (DOE ALCF 8.5PF)
Exascale impact on software (including coupling software) • Massive in-node parallelism (exponential growth)
– Programmer cannot hand-pick work granularity – Deeper memory hierarchy – “Communication is expensive, FLOPS are free”
• Power as a managed system resource – Turning on/off components – Selecting algorithms for speed within power envelope – Adjusting arithmetic precision – Potentially adjusting fault protection
• Dynamic parallelism and work decomposition • Fault tolerance actively managed in software at many levels • Architecture organization:
– Heterogeneous cores – Specialized functional units – In-situ NVRAM
Coupling at Exascale: possible problems
• Coupler (for Earth sub-system interfaces) is almost entirely 2D – Limited amount of parallelism – Also not a huge number of flops compared to full model – Not a huge memory demand except for datatypes that
grow with number of cores. • But coupler does lots of memory movement
– Moving data between model’s native data type and coupler data type.
– Moving data from one model’s processors to another’s
Coupling at Exascale: to do
• More parallelism through more components executing concurrently – Ensembles – Different models
• Reduce memory movement – One data type across all model components? – Co-located decompositions.
Co-located decomposition
2
3 1
2
3 1
1 1 1
2 2 2
3 3 3
3
3
2
2 3 2
1
1 1 2 models: unrelated decomposi,ons 2 models: related decomposi,ons
hhp://climatemodeling.science.energy.gov/projects/accelerated-‐climate-‐modeling-‐energy
More informa3on
hhp://www.mcs.anl.gov/mct/