Not Just Bigger Versions of Today’s Peta-Computerson-demand.gputechconf.com/supercomputing/...Systems... · HPX Runtime Design • Current version of HPX provides the following

Tomorrow’s Exascale Systems: Not Just Bigger Versions of Today’s Peta-Computers

Thomas Sterling

Professor of Informatics and Computing, Indiana University

Chief Scientist and Associate Director Center for Research in Extreme Scale Technologies (CREST)

School of Informatics and Computing

Indiana University

November 20, 2013

Tianhe-2: Half-way to Exascale • China, 2013: the 30 PetaFLOPS dragon

• Developed in cooperation between NUDT and Inspur for National Supercomputer Center in Guangzhou

• Peak performance of 54.9 PFLOPS – 16,000 nodes contain 32,000 Xeon Ivy Bridge

processors and 48,000 Xeon Phi accelerators totaling 3,120,000 cores

– 162 cabinets in 720m2 footprint

– Total 1.404 PB memory (88GB per node)

– Each Xeon Phi board utilizes 57 cores for aggregate 1.003 TFLOPS at 1.1GHz clock

– Proprietary TH Express-2 interconnect (fat tree with thirteen 576-port switches)

– 12.4 PB parallel storage system

– 17.6MW power consumption under load; 24MW including (water) cooling

– 4096 SPARC V9 based Galaxy FT-1500 processors in front-end system

http://www.google.com/url?sa=i&rct=j&q=inspur&source=images&cd=&cad=rja&docid=sxOZ2z0hHvcIHM&tbnid=iiVBiBo92_4LbM:&ved=0CAUQjRw&url=http://www.tmforum.org/ImprovedServiceCapability/12794/home.html&ei=ezW2UbzKIaGvyAGz2oGwCw&psig=AFQjCNH8uK0aQRedzl6V8amVEQeoJVMjvA&ust=1370982134108645

Exaflops by 2019 (maybe)

0.1

1

10

100

1000

10000

100000

1000000

10000000

100000000

1E+09

1E+10

1E+11

1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020

SUM

N=1

N=500

1 Gflop/s

1 Tflop/s

100 Mflop/s

100 Gflop/s

100 Tflop/s

10 Gflop/s

10 Tflop/s

1 Pflop/s

100 Pflop/s

10 Pflop/s

1 Eflop/s

Courtesy of Erich Strohmaier LBNL

Elements of an MFE Integrated Model Complex Multi-scale, Multi-physics Processes

Courtesy of Bill Tang, Princeton

GTC

simulation

Computer

name

PE#

used

Speed

(TF)

Particle

#

Time

steps

Physics Discovery

(Publication)

1998 Cray T3E

NERSC

102 10-1 108 104 Ion turbulence zonal flow

(Science, 1998)

2002 IBM SP

NERSC

103 100 109 104 Ion transport scaling

(PRL, 2002)

2007 Cray XT3/4

ORNL

104 102 1010 105 Electron turbulence

(PRL, 2007);

EP transport (PRL, 2008)

2009 Jaguar/Cray XT5

ORNL

105 103 1011 105 Electron transport scaling

(PRL, 2009);

EP-driven MHD modes

2012-13

(current)

Cray XT5Titan

ORNL

Tianhe-1A (China)

105 104 1012 105 Kinetic-MHD;

Turbulence + EP + MHD

2018

(future)

To Extreme Scale

HPC Systems

106

1013 106 Turbulence + EP + MHD + RF

Progress in Turbulence Simulation Capability: Faster Computer

Achievement of Improved Fusion Energy Physics Insights

* Example here of GTC code (Z. Lin, et al.) delivering production runs @ TF in 2002 and PF in 2009 Courtesy of Bill Tang, Princeton

6

Practical Constraints for Exascale

Sustained Performance

Exaflops

100 Petabytes

125 Petabytes/sec.

Cost

Deployment – $200M

Operational support

Power

Energy required to run the computer

Energy for cooling (remove heat from machine)

20 Megawatts

Reliability

One factor of availability

Generality

How good is it across a range of problems

Strong scaling

Productivity

User programmability

Performance portability

Size

Floor space – 4,000 sq. meters

Access way for power and signal cabling

Execution Model Phase Change • Guiding principles for system design and operation

– Semantics, Mechanisms, Policies, Parameters, Metrics

– Driven by technology opportunities and challenges

– Historically, catalyzed by paradigm shift

• Decision chain across system layers – For reasoning towards optimization of design and

operation

• Essential for co-design of all system layers – Architecture, runtime and OS, programming models

– Reduces design complexity from O(N2) to O(N)

– Enables holistic reasoning about concepts and tradeoffs

• Empowers discrimination, commonality, portability – Establishes a phylum of HPC class systems

Vector Model 1975

SIMD-array Model 1983

CSP Model 1991

SIF-MOE Model 1968

Von Neumann Model 1949

? ? Model 2020

Total Power

0

1

2

3

4

5

6

7

8

9

10

1/1

/19

92

1/1

/19

96

1/1

/20

00

1/1

/20

04

1/1

/20

08

1/1

/20

12

Po

we

r (M

W)

Heavyweight Lightweight Heterogeneous

Courtesy of Peter Kogge,

UND

Technology Demands new Response

9

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1/1

/19

92

1/1

/19

96

1/1

/20

00

1/1

/20

04

1/1

/20

08

1/1

/20

12

TC (

Flo

ps/

Cyc

le)

Heavyweight Lightweight Heterogeneous

Total Concurrency

Courtesy of Peter Kogge,

UND

Performance Factors - SLOWER • Starvation

– Insufficiency of concurrency of work

– Impacts scalability and latency hiding

– Effects programmability

• Latency

– Time measured distance for remote access and services

– Impacts efficiency

• Overhead

– Critical time additional work to manage tasks & resources

– Impacts efficiency and granularity for scalability

• Waiting for contention resolution

– Delays due to simultaneous access requests to shared physical or logical resources

P = e(L,O,W) * S(s) * a(r) * U(E)

P – performance (ops) e – efficiency (0 < e < 1) s – application’s average parallelism, a – availability (0 < a < 1) U – normalization factor/compute unit E – watts per average compute unit r – reliability (0 < r < 1)

The Negative Impact of Global Barriers in Astrophysics Codes

Computational phase diagram from the MPI based GADGET code (used for N-body and SPH simluations) using 1M particles over four timesteps on 128 procs. Red indicates computation Blue indicates waiting for communication

Goals of a New Execution Model for Exascale

• Serve as a discipline to govern future scalable system architectures, programming methods, and runtime

• Latency hiding at all system distances – Latency mitigating architectures

• Exploit parallelism in diversity of forms and granularity

• Provide a framework for efficient fine-grain synchronization and scheduling (dispatch)

• Enable optimized runtime adaptive resource management and task scheduling for dynamic load balancing

• Support full virtualization for fault tolerance and power management, and continuous optimization

• Self-aware infrastructure for power management

• Semantics of failure response for graceful degradation

• Complexity of operation as an emergent behavior from simplicity of design, high replication, and local adaptation for global optima in time and space

ParalleX Execution Model • Lightweight multi-threading

– Divides work into smaller tasks

– Increases concurrency

• Message-driven computation

– Move work to data

– Keeps work local, stops blocking

• Constraint-based synchronization

– Declarative criteria for work

– Event driven

– Eliminates global barriers

• Data-directed execution

– Merger of flow control and data structure

• Shared name space

– Global address space

– Simplifies random gathers

ParalleX Addresses Critical Challenges (1)

• Starvation – Lightweight threads for additional level of parallelism

– Lightweight threads with rapid context switching for non-blocking

– Low overhead for finer granularity and more parallelism

– Parallelism discovery at runtime through data-directed execution

– Overlap of successive phases of computation for more parallelism

• Latency – Lightweight thread context switching for non-blocking

– Overlap computation and communication to limit effects

– Message-driven computation to reduce latency to put work near data

– Reduce number and size of global messages

ParalleX Addresses Critical Challenges (2)

• Overhead – Eliminates (mostly) global barriers

– However, ultimately will require hardware support in the limit

– Uses synchronization objects exhibiting high semantic power

– Reduces context switching time

– Not all actions require thread instantiation

• Waiting due to contention – Adaptive resource allocation with redundant resources

• Like hardware for threads

– Eliminates polling and reduces # of sources of synch contacts

HPX Runtime Design • Current version of HPX provides the following infrastructure

as defined by the ParalleX execution model – Complexes (ParalleX Threads) and ParalleX Thread Management

– Parcel Transport and Parcel Management

– Local Control Objects (LCOs)

– Active Global Address Space (AGAS)

Overlapping computational phases for hydrodynamics

MPI HPX

Computational phases for LULESH (mini-app for hydrodynamics codes). Red indicates work White indicates waiting for communication Overdecomposition: MPI used 64 process while HPX used 1E3 threads spread across 64 cores.

Dynamic load balancing via message-driven work-queue execution for Adaptive Mesh Refinement (AMR)

Application: Adaptive Mesh Refinement (AMR) for Astrophysics simulations

Conclusions

• HPC is in a (6th) phase change

• Ultra high scale computing of the next decade will require a new model of computation to effectively exploit new technologies and guide system co-design

• ParalleX is an example of an experimental execution model that addresses key challenges to Exascale

• Early experiments prove encouraging for enhancing scaling of graph-based numeric intensive and knowledge management applications

Documents

Not Just Bigger Versions of Today’s Peta-Computerson-demand.gputechconf.com/supercomputing/...Systems... · HPX Runtime Design • Current version of HPX provides the following