Exascale computing: endgame or new beginning for … · Exascale computing: endgame or new beginning for ... Center() =-4.0 * data ... the International Conference on High-Performance

T. Schulthess17th Workshop on HPC in Meteorology @ ECMWF, Reading, Wednesday October 26, 2016

Thomas C. Schulthess

1

Exascale computing: endgame or new beginning for climate modelling


Operational system for NWP at CSCS

2

Albis & Lema (in production through 3/2016) New system: Kesch & Escha


MeteoSwiss’ performance ambitions

3

1

5

10

15

20

25

30

35

40

Constant budget for investments and operations

24x Ensemble with multiple forecasts

Grid 2.2 km ! 1.1 km

10x

Requirements from MeteoSwissData assimilation6x

We need a 40x improvement between 2012 and 2015 at constant cost

?

T. Schulthess17th Workshop on HPC in Meteorology @ ECMWF, Reading, Wednesday October 26, 2016 4

September 15, 2015

Today’s Outlook: GPU-accelerated Weather ForecastingJohn Russell

“Piz Kesch”


2012

2013

2014

2015

2016

2017+Summit Aurora

Lightweight coresGPU - accelerated hybrid

DARPA HPCS

Tsuname-3.0 U. Tokyo &Tsukuba

2011

post-KMulti-core

MeteoSwiss

5

TaihuLight


State of the art implementation of new system for MeteoSwiss

• New system needs to be installed Q3/2015

• Assuming 2x improvement in per-socket performance:~20x more X86 sockets would require 30 Cray XC cabinets

Current Cray XC30/XC40 platform (space for 40 racks XC)

New system for Meteo Swiss if we build it like the German Weather Service (DWD) did theirs, or UK Met Office, or ECMWF … (30 racks XC)

Albis & Lema: 3 cabinets Cray XE6 installed Q2/2012

Thinking inside the box was not a good option!CSCS machine room

6


COSMO-OPCODE:

7

% Code Lines (F90) % Runtime

Runtime based 2 km production model of MeteoSwiss

Original code (with OpenACC for GPU) Rewrite in C++ (with CUDA backend for GPU)

‣monolithic Fortran 90 code ‣ 250,000 lines of code

a legacy code migration project


How we solved the problem

8

1

5

10

15

20

25

30

35

40

Constant budget for investments and operations

Grid 2.2 km ! 1.1 km

24x Ensemble with multiple forecasts

Data assimilation

10x

1.7x from software refactoring (old vs. new implementation on x86)

2.8x Mathematical improvements (resource utilisation)

2.8x Moore’s Law on x86

2.3x Change in architecture (CPU ! GPU)

1.3x additional processors

Requirements from MeteoSwiss

6x

Investment in software allowed mathematical improvements and change in architecture

There is no silver bullet!

Bonus: reduction in power!

T. Schulthess25 Years CSCS, Lugano, Wednesday October 19, 2016 9

$500,000,000 $2,000,000,000

$13,000Source: Andy Keane @ ISC’10

Why commodity processors?



The end of Denard Scaling

11

L/

W/

tox

/

xd/

NA 1/2

constant

n+ n+

source drain

GATE

WIRING

p substrate, doping

Voltage, V/

SCALING Voltage: Oxide: Wire width: Gate Width: Diffusion: Substrate:

CONSEQUENCE: Higher density: Higher speed: Power/ckt: Power density:

V/tox

/

W/

L/

NA

xd/

↵2

↵

Oxide layer thickness ~1nm

Source: Ronald Luijten, IBM-ZRL

The end of Robert H. Dennard (1974)



Who consumes how much energy on a 28nm processor

13

•64 bit floating point unit: 20 pJ

•256-bit access 8kB SRAM: 50 pJ

•256-bit bus across die: 1,000 pJ

•Read/write to DRAM: 16,000 pJ

20 mm

Source: Bill Dally, 2011By a wide margin, most energy is spend in moving data on the die and to memory

Developing algorithms that maximise data locality should be THE TOP PRIORITY


Source: Rajeeb Hazra’s (HPC@Intel) talk at SOS14, March 2010


“Piz Daint”

15

Cray XC30 with 5272 hybrid, GPU accelerated compute nodes

Compute node: > Host: Intel Xeon E5 2670 (SandyBridge 8c) > Accelerator: One NVIDIA K20X GPU (GK110)

13 m (~0.43 µs)

2.5 GHz (~0.38 ns)

0.73 GHz (~1.4ns)

Latency (μs) Bandwidth (GB/s)Aries chip 1.28 9.89Within Chassis 1.61 9.65Between Chassis 1.58 9.82Between Groups 2.53 9.63


Architectural diversity is here to stay, because it is a consequence of the dusk of CMOS scaling

(Moore’s Law)

16

What are the implications?

Complexity in software is one, but we don’t understand all implications

Physics of the computer matters more than ever


velocities

pressure

temperature

water

turbulence

physics et al. tendencies vertical adv. water adv.horizontal adv.3x fast wave solver~10x1x

Timestepexplicit (leapfrog)implicit (sparse solver)explicit (RK3)implicit (sparse)


Stencil example: Laplace operator in 2D

18

lap(i,j,k) = –4.0 * data(i,j,k) + data(i+1,j,k) + data(i-1,j,k) + data(i,j+1,k) + data(i,j-1,k);


1. Loop-logic defines stencil application domain and order2. Stencil defines the operator to be applied

do k = kstart, kend do j = jstart, jend do i = istart, iend lap(i, j, k) = -4.0 * data(i, j , k) + & data(i+1, j, , k) + data(i-1, j , k) + & data(i , j+1, k) + data(i , j-1, k) end do end doend do

Two main components of an operator on a structured grid


IJKRealField lapfield, datafield;

Stencil stencil;

StencilCompiler::Build(

pack_parameters(

Param<lap, cInOut>(lapfield),

Param<data, cIn>(datafield)

),

concatenate_sweeps(

define_sweep<KLoopFullDomain>(

define_stages(

StencilStage<Laplace, IJRangeComplete>()

)

)

)

);

stencil.Apply();

enum data, lap ;

template<typename TEnv>

struct Laplace

STENCIL_STAGE(Tenv)

STAGE_PARAMETER(FullDomain, data)

STAGE_PARAMETER(FullDomain, lap)

static void Do()

lap::Center() =

-4.0 * data::Center() +

data::At(iplus1) +

data::At(iminus1) +

data::At(jplus1) +

data::At(jminus1);

;

T. SchulthessGTC Europe, Amsterdam, Thursday September 29, 2016 21

enum data, lap ;

template<typename TEnv>

struct Laplace

STENCIL_STAGE(Tenv)

STAGE_PARAMETER(FullDomain, data)

STAGE_PARAMETER(FullDomain, lap)

static void Do()

lap::Center() =

-4.0 * data::Center() +

data::At(iplus1) +

data::At(iminus1) +

data::At(jplus1) +

data::At(jminus1);

;

IJKRealField lapfield, datafield;

Stencil stencil;

StencilCompiler::Build(

pack_parameters(

Param<lap, cInOut>(lapfield),

Param<data, cIn>(datafield)

),

concatenate_sweeps(

define_sweep<KLoopFullDomain>(

define_stages(

StencilStage<Laplace, IJRangeComplete>()

)

)

)

);

stencil.Apply();

Stencil Loop logic


Architecture dependent backend

• The same user-level code can be compiled with different, architecture dependent backends

• multi-core CPU (x86) – SIMD • kij-storage • ij-blocking • Coarse: OpenMP theads • Fine: vectorisation by compiler

• GPU (Tesla) – SIMT • ijk-storage • Coarse: CUDA thread blocks • Fine: CUDA threads • software managed caching

STELLA backends

•  The same high-level user code can be compiled using different backends

•  CPU (x86 multi-core) •  kij-storage •  ij-blocking •  Coarse: OpenMP threads •  Fine: vectorization by compiler

•  GPU (NVIDIA) •  ijk-storage •  Coarse: CUDA thread blocks •  Fine: CUDA threads •  software managed caching

A single switch chooses the

STELLA backend

Horizontal IJ-plane

Block0 Block1

Block2 Block3

Coarse-grained parallelism"

Fine-grained parallelism (vectorization)"


References and Collaborators

• Peter Messmer and his team at the NVIDIA co-design lab at ETH Zurich • Teams at CSCS and Meteo Suisse, group of Christoph Schaer @ ETH Zurich • O. Fuhrer, C. Osuna, X. Lapillonne, T. Gysi, B. Cumming, M. Bianco, A. Arteaga, T. C. Schulthess, “Towards a performance portable, architecture agnostic implementation strategy for weather and climate models”, Supercomputing Frontiers and Innovations, vol. 1, no. 1 (2014), see superfri.org

• G. Fourestey, B. Cumming, L. Gilly, and T. C. Schulthess, “First experience with validating and using the Cray power management database tool”, Proceedings of the Cray Users Group 2014 (CUG14) (see arxiv.org for reprint)

• B. Cumming, G. Fourestey, T. Gysi, O. Fuhrer, M. Fatica, and T. C. Schulthess, “Application centric energy-efficiency study of distributed multi-core and hybrid CPU-GPU systems”, Proceedings of the International Conference on High-Performance Computing, Networking, Storage and Analysis, SC’14, New York, NY, USA (2014). ACM

• T. Gysi, C. Osuna, O. Fuhrer, M. Bianco and T. C. Schulthess, “STELLA: A domain-specific tool for structure grid methods in weather and climate models”, to be published in Proceedings of the International Conference on High-Performance Computing, Networking, Storage and Analysis, SC’15, New York, NY, USA (2015). ACM

http://superfri.org

http://arxiv.org


COSMO: old and new (refactored) code

24

main (current / Fortran)

physics (Fortran)

dynamics (Fortran)

MPI

system

main (new / Fortran)

physics (Fortran)

with OpenMP / OpenACC

dynamics (C++)

MPI or whatever

system

Generic Comm. Library

boundary conditions & halo exchg.

stencil library

X86 GPU

Shared Infrastructure


Physical model

Algorithmic description

Compilation

Computer

Imperative codelap(i,j,k) = –4.0 * data(i,j,k) + data(i+1,j,k) + data(i-1,j,k) + data(i,j+1,k) + data(i,j-1,k);

Domain science & applied mathematics

Computer engineering

370 NATURE PHYSICS | VOL 11 | MAY 2015 | www.nature.com/naturephysics

COMMENTARY | FOCUS

Whereas this long-term sustained exponential growth had profound impact on the productivity of scientists and opened many new avenues in physics research, not all types of problems in scientific computing have seen the same performance improvements. For example, the sustained performance of climate codes, as documented by the European Centre for Medium-Range Weather Forecasts (ECMWF) over approximately the same period as the Top500 project, has improved only by a factor of 100 per decade (Peter Bauer, manuscript in preparation). This is still an exponential growth, but it demonstrates the significant decrease in efficiency for software applications in some fields. This is more important, as meteorological and climate simulations have been around since the dawn of modern computing1. They rely on complex, but typically well-engineered computer codes that have been designed to run on the top supercomputing systems. If experts use computers inefficiently, what does this say about the applications developed by regular researchers?

In this Commentary, I discuss state-of-the-art programming and parallelization in physics today. I try to analyse the challenges in writing efficient scientific software and examine possible ways in which physicists can deal with the rapidly increasing complexity of computer architectures. To do so it is important to first recall the main uses of computing in physics.

Imperative code

Compilation

Physical modelMathematical description


Computer

Domain science and applied mathematics


= − ∇p + ρg − 2Ω×(ρv) + F

= − (cpd /cvdp∇· v + (cpd /cvd−1)Qh

= p· +Qh

= − ∇ · Fv − (Il + I f)= ∇ · (P l,f + F l,f) + Il,f

= p[Rd (1+(Rυ/Rd−1) qυ− q l−qf)T ]−1

ODSLMN ± GDWDLMNGDWDLMNGDWDLMNGDWDLMNGDWDLMN

Wind ρv·

Pressure p·

Water ρq· υ

ρq· l,f

Density ρ

Temperature ρcpd T·

Figure 1 | Traditional computational science workflow. A physical model (of the atmosphere, for instance) is first translated into mathematical equations (here, non-hydrostatic compressible Euler equations), which in turn are solved with algorithms (such as finite differences on a structured grid), implemented in a program (for example, stencil code), and subsequently compiled into machine code that executes on a canonical computer architecture. The green line marks the separation of work. The physical model image is adapted from ref. 9, NPG. The supercomputer image © British Crown Copyright, The Met Office / Science Photo Library.

Predictions and data analysisLong before the advent of modern computing, modelling and simulation were used in physics in two ways. The first and best known (which we call the traditional way) is the use of computers to solve challenging theoretical problems that have no known analytical solution. In this case, the theory is well understood and the governing equations are solved numerically with elaborate computational methods to make quantitative and verifiable predictions. Sometimes the numerical solution of a theoretical problem may lead to new insights in its own right, as was the case with the discovery of the fluctuation theorem2. This was an argument for defining computer simulations as a third, independent pillar of science, complementing theory and experiment3. For our purpose, this distinction is not necessary, as from a computational point of view we are still solving known equations. The simulations are carefully planned — that is, the mathematical analysis and algorithms are well known and the elaborate computer codes, as in the case of climate simulations, have been developed and optimized. Scientists, and physicists in particular, will not shy away from great efforts in using cutting-edge technologies to solve such problems, and they will use imperative programming languages such as C or FORTRAN with machine-level codes to squeeze every last bit of performance out of a computing system.

The second, and profoundly different, use is the analysis of experimental data with the help of modelling and simulations before the theory and governing equations are known. This is essentially what Johannes Kepler did when he analysed Tycho Brahe’s planetary orbit data with heliocentric elliptical models to discover the three famous laws that now carry his name — Newton’s theory of gravitation, which explains Kepler’s laws, came later. Scientists today use computers to rapidly prototype models, thereby assimilating in a matter of seconds or minutes many orders of magnitude more data than Kepler did in months of laborious manual computations. Along with the development of electronic computing came large experimental facilities, which significantly increased the importance of systematic exploratory tools for data analysis. This lead to a substantial improvement of mathematical algorithms over the past few decades, which, together with the emergence of social media on the World Wide Web, have made this exploratory use relevant to areas outside of natural sciences, for instance in economics and social sciences. These have, in turn, led to the argument that a fundamentally new, fourth paradigm of science is emerging: ‘data science’3. For our present purpose, however, this distinction is again not necessarily important. But, for this second exploratory use of modelling and simulation scientists use more descriptive programming languages like Python or Ruby, and they rely on existing libraries even if they are not optimized.

Scientists and computer engineersProgramming serves two complementary purposes: one is to specify the computation and the other is to manage computer resources. Most scientists are familiar with the former, whereas the latter is considered to be primarily the concern of computer engineers. The distinction is important as it allows a clear separation of concerns: scientists only need to know about the complexity of models and mathematics, and system engineers only need to focus on the complexity of the computer.

In this ideal case, the programming environment allows scientists to specify the computational tasks in terms of human-readable equations — descriptive programming — that are independent of the underlying system, which is portable across many platforms. The Python programming language, with its many associated libraries and tools, provides such an environment, but at the cost of performance. When the computation is big and has to be scaled, performance does matter. In this case scientists have the choice of algorithms

© 2015 Macmillan Publishers Limited. All rights reserved

Mathematical description

Schulthess, Nature Physics, vol 11, 369-373 (2015)


Imperative code

Compilation

Computer



Physical model





Physical model



Imperative code

Compilation

Computer


Computer engineering Schulthess, Nature Physics, vol 11, 369-373 (2015)


Imperative code

Compilation

Computer



Physical model


H = −t!

⟨ij⟩,σ

c†iσcjσ + U!

i

ni↑ni↓

Algorithmic descriptionGc(si, lk+1) = Gc(si, l0) + [a0|a1|...|ak] × [b0|b1|...|bk]t

Gc(si, lk+1) = Gc(si, lk) + ak × btk

28



Physical model


Imperative code

Compilation





COMMENTARY | FOCUS



Imperative code

Compilation



Computer



= − ∇p + ρg − 2Ω×(ρv) + F


= p· +Qh


= p[Rd (1+(Rυ/Rd−1) qυ− q l−qf)T ]−1


Wind ρv·

Pressure p·

Water ρq· υ

ρq· l,f

Density ρ












Imperative code

Compilation





COMMENTARY | FOCUS



Imperative code

Compilation



Computer



= − ∇p + ρg − 2Ω×(ρv) + F


= p· +Qh


= p[Rd (1+(Rυ/Rd−1) qυ− q l−qf)T ]−1


Wind ρv·

Pressure p·

Water ρq· υ

ρq· l,f

Density ρ









Physical model



Physical model


Imperative code

Compilation





COMMENTARY | FOCUS



Imperative code

Compilation



Computer



= − ∇p + ρg − 2Ω×(ρv) + F


= p· +Qh


= p[Rd (1+(Rυ/Rd−1) qυ− q l−qf)T ]−1


Wind ρv·

Pressure p·

Water ρq· υ

ρq· l,f

Density ρ










T. SchulthessSimons Foundation Lectures, Frontier of Data Science, New York, Wednesday, May 11, 2016 32

Science applications using a descriptive and dynamic developer environment



Imperative code

Architecture 1

Compiler frontend

Optimisation / low-level libraries / runtime

Architecture specific backends

Architecture 2 Architecture N…

Multi-disciplinary co-design of tools, libraries, programming environment

dynamic environmentfor model develop.

tools forhigh-performance

scientific computing


iPython/notebookJUPYTER


The good news

33

C++ 11, 14, HPX-3, … 17, 20, …

C++ standard is evolving quickly and implementations follow!


Who will pay for the implementation of Fortran, OpenACC, OpenMP, …?


Source: Andy Keane @ ISC’10

for 8 GPUs, or $16k a piece

$500,000,000 $2,000,000,000

$13,000



Areas of interest include (but are not limited to):

- Theuseofadvancedcomputingsystemsforlarge-scalescientificapplications- Implementationstrategiesforscienceapplicationsinenergy-efficientcomputing architectures- Domain-specific,languages,librariesorframeworks- Theintegrationoflarge-scaleexperimentalandobservationalscientificdataandhigh- performance data analytics and computing- Bestpracticesforsustainablesoftwaredevelopmentandscientificapplication development

Committee ChairsJackWells(OakRidgeNationalLaboratory,USA)TorstenHoefler(ETHZurich,Switzerland)

Submission GuidelinesWeinvitepapersof5-10pagesinlength,whichwillberevieweddoubleblind.Fullsubmis-sionsguidelinescanbefoundatwww.pasc17.org.

-Submissionsclose:12 December 2016-Firstreviewnotification:31 January 2017-Revisedsubmissionsclose:1 March 2017-Finalacceptancenotification:11 April 2017

Conference Participation and ProceedingsAcceptedmanuscriptswillbepublished in the ACM Digital Libraryonthefirstdayoftheconference.Authorswillbegiven30-minutepresentationslotsattheconference,groupedintopicallyfocused,parallelsessions.

Post-Conference Journal SubmissionFollowingtheconference,authorswillhavetheopportunity to develop their papers for publication in a relevant, computationally focused, domain-specific journal.

Authorsthusstandtobenefitfromtherapidandbroaddisseminationofresultsaffordedbytheconferencevenueandassociatedproceedings,and,fromtheimpactassociatedwithpublicationinahigh-qualityscientificjournal.

The Platform for Advanced Scientific Computing (PASC) invites submissions for the PASC17 Conference, co-sponsored by the Association for Computing Machinery (ACM) and SIGHPC, which will be held at the Palazzo dei Congressi in Lugano, Switzerland, from June 26 to 28, 2017.

PASC17isaninterdisciplinaryeventinhighperformancecomputingthatbringstogetherdomainscience,appliedmathematicsandcomputerscience-wherecomputerscienceisfocusedonenablingtherealizationofscientificcomputation.

Wearesolicitinghigh-qualitycontributionsoforiginalresearchrelatingtohighperformancecomputingineightdomain-specifictracks:

pasc17.pasc-conference.org

CLIMATE &WEATHER

SOLID EARTH DYNAMICS

LIFE SCIENCES

CHEMISTRY&MATERIALS

PHYSICS

COMPUTER SCIENCE & APPLIED MATHEMATICS

ENGINEERING

EMERGING DOMAINS (SPECIAL TOPIC: PRECISION MEDICINE)

Call for Papers

Documents

Exascale computing: endgame or new beginning for … · Exascale computing: endgame or new beginning for ... Center() =-4.0 * data ... the International Conference on High-Performance