Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Tomorrow’s Exascale Systems: Not Just Bigger Versions of Today’s Peta-Computers
Thomas Sterling
Professor of Informatics and Computing, Indiana University
Chief Scientist and Associate Director Center for Research in Extreme Scale Technologies (CREST)
School of Informatics and Computing
Indiana University
November 20, 2013
Tianhe-2: Half-way to Exascale • China, 2013: the 30 PetaFLOPS dragon
• Developed in cooperation between NUDT and Inspur for National Supercomputer Center in Guangzhou
• Peak performance of 54.9 PFLOPS – 16,000 nodes contain 32,000 Xeon Ivy Bridge
processors and 48,000 Xeon Phi accelerators totaling 3,120,000 cores
– 162 cabinets in 720m2 footprint
– Total 1.404 PB memory (88GB per node)
– Each Xeon Phi board utilizes 57 cores for aggregate 1.003 TFLOPS at 1.1GHz clock
– Proprietary TH Express-2 interconnect (fat tree with thirteen 576-port switches)
– 12.4 PB parallel storage system
– 17.6MW power consumption under load; 24MW including (water) cooling
– 4096 SPARC V9 based Galaxy FT-1500 processors in front-end system
Exaflops by 2019 (maybe)
0.1
1
10
100
1000
10000
100000
1000000
10000000
100000000
1E+09
1E+10
1E+11
1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
SUM
N=1
N=500
1 Gflop/s
1 Tflop/s
100 Mflop/s
100 Gflop/s
100 Tflop/s
10 Gflop/s
10 Tflop/s
1 Pflop/s
100 Pflop/s
10 Pflop/s
1 Eflop/s
Courtesy of Erich Strohmaier LBNL
Elements of an MFE Integrated Model Complex Multi-scale, Multi-physics Processes
Courtesy of Bill Tang, Princeton
GTC
simulation
Computer
name
PE#
used
Speed
(TF)
Particle
#
Time
steps
Physics Discovery
(Publication)
1998 Cray T3E
NERSC
102 10-1 108 104 Ion turbulence zonal flow
(Science, 1998)
2002 IBM SP
NERSC
103 100 109 104 Ion transport scaling
(PRL, 2002)
2007 Cray XT3/4
ORNL
104 102 1010 105 Electron turbulence
(PRL, 2007);
EP transport (PRL, 2008)
2009 Jaguar/Cray XT5
ORNL
105 103 1011 105 Electron transport scaling
(PRL, 2009);
EP-driven MHD modes
2012-13
(current)
Cray XT5Titan
ORNL
Tianhe-1A (China)
105 104 1012 105 Kinetic-MHD;
Turbulence + EP + MHD
2018
(future)
To Extreme Scale
HPC Systems
106
1013 106 Turbulence + EP + MHD + RF
Progress in Turbulence Simulation Capability: Faster Computer
Achievement of Improved Fusion Energy Physics Insights
* Example here of GTC code (Z. Lin, et al.) delivering production runs @ TF in 2002 and PF in 2009 Courtesy of Bill Tang, Princeton
6
Practical Constraints for Exascale
Sustained Performance
Exaflops
100 Petabytes
125 Petabytes/sec.
Cost
Deployment – $200M
Operational support
Power
Energy required to run the computer
Energy for cooling (remove heat from machine)
20 Megawatts
Reliability
One factor of availability
Generality
How good is it across a range of problems
Strong scaling
Productivity
User programmability
Performance portability
Size
Floor space – 4,000 sq. meters
Access way for power and signal cabling
Execution Model Phase Change • Guiding principles for system design and operation
– Semantics, Mechanisms, Policies, Parameters, Metrics
– Driven by technology opportunities and challenges
– Historically, catalyzed by paradigm shift
• Decision chain across system layers – For reasoning towards optimization of design and
operation
• Essential for co-design of all system layers – Architecture, runtime and OS, programming models
– Reduces design complexity from O(N2) to O(N)
– Enables holistic reasoning about concepts and tradeoffs
• Empowers discrimination, commonality, portability – Establishes a phylum of HPC class systems
Vector Model 1975
SIMD-array Model 1983
CSP Model 1991
SIF-MOE Model 1968
Von Neumann Model 1949
? ? Model 2020
Total Power
0
1
2
3
4
5
6
7
8
9
10
1/1
/19
92
1/1
/19
96
1/1
/20
00
1/1
/20
04
1/1
/20
08
1/1
/20
12
Po
we
r (M
W)
Heavyweight Lightweight Heterogeneous
Courtesy of Peter Kogge,
UND
Technology Demands new Response
9
1.E+01
1.E+02
1.E+03
1.E+04
1.E+05
1.E+06
1.E+07
1/1
/19
92
1/1
/19
96
1/1
/20
00
1/1
/20
04
1/1
/20
08
1/1
/20
12
TC (
Flo
ps/
Cyc
le)
Heavyweight Lightweight Heterogeneous
Total Concurrency
Courtesy of Peter Kogge,
UND
Performance Factors - SLOWER • Starvation
– Insufficiency of concurrency of work
– Impacts scalability and latency hiding
– Effects programmability
• Latency
– Time measured distance for remote access and services
– Impacts efficiency
• Overhead
– Critical time additional work to manage tasks & resources
– Impacts efficiency and granularity for scalability
• Waiting for contention resolution
– Delays due to simultaneous access requests to shared physical or logical resources
P = e(L,O,W) * S(s) * a(r) * U(E)
P – performance (ops) e – efficiency (0 < e < 1) s – application’s average parallelism, a – availability (0 < a < 1) U – normalization factor/compute unit E – watts per average compute unit r – reliability (0 < r < 1)
The Negative Impact of Global Barriers in Astrophysics Codes
Computational phase diagram from the MPI based GADGET code (used for N-body and SPH simluations) using 1M particles over four timesteps on 128 procs. Red indicates computation Blue indicates waiting for communication
Goals of a New Execution Model for Exascale
• Serve as a discipline to govern future scalable system architectures, programming methods, and runtime
• Latency hiding at all system distances – Latency mitigating architectures
• Exploit parallelism in diversity of forms and granularity
• Provide a framework for efficient fine-grain synchronization and scheduling (dispatch)
• Enable optimized runtime adaptive resource management and task scheduling for dynamic load balancing
• Support full virtualization for fault tolerance and power management, and continuous optimization
• Self-aware infrastructure for power management
• Semantics of failure response for graceful degradation
• Complexity of operation as an emergent behavior from simplicity of design, high replication, and local adaptation for global optima in time and space
ParalleX Execution Model • Lightweight multi-threading
– Divides work into smaller tasks
– Increases concurrency
• Message-driven computation
– Move work to data
– Keeps work local, stops blocking
• Constraint-based synchronization
– Declarative criteria for work
– Event driven
– Eliminates global barriers
• Data-directed execution
– Merger of flow control and data structure
• Shared name space
– Global address space
– Simplifies random gathers
ParalleX Addresses Critical Challenges (1)
• Starvation – Lightweight threads for additional level of parallelism
– Lightweight threads with rapid context switching for non-blocking
– Low overhead for finer granularity and more parallelism
– Parallelism discovery at runtime through data-directed execution
– Overlap of successive phases of computation for more parallelism
• Latency – Lightweight thread context switching for non-blocking
– Overlap computation and communication to limit effects
– Message-driven computation to reduce latency to put work near data
– Reduce number and size of global messages
ParalleX Addresses Critical Challenges (2)
• Overhead – Eliminates (mostly) global barriers
– However, ultimately will require hardware support in the limit
– Uses synchronization objects exhibiting high semantic power
– Reduces context switching time
– Not all actions require thread instantiation
• Waiting due to contention – Adaptive resource allocation with redundant resources
• Like hardware for threads
– Eliminates polling and reduces # of sources of synch contacts
HPX Runtime Design • Current version of HPX provides the following infrastructure
as defined by the ParalleX execution model – Complexes (ParalleX Threads) and ParalleX Thread Management
– Parcel Transport and Parcel Management
– Local Control Objects (LCOs)
– Active Global Address Space (AGAS)
Overlapping computational phases for hydrodynamics
MPI HPX
Computational phases for LULESH (mini-app for hydrodynamics codes). Red indicates work White indicates waiting for communication Overdecomposition: MPI used 64 process while HPX used 1E3 threads spread across 64 cores.
Dynamic load balancing via message-driven work-queue execution for Adaptive Mesh Refinement (AMR)
Application: Adaptive Mesh Refinement (AMR) for Astrophysics simulations
Conclusions
• HPC is in a (6th) phase change
• Ultra high scale computing of the next decade will require a new model of computation to effectively exploit new technologies and guide system co-design
• ParalleX is an example of an experimental execution model that addresses key challenges to Exascale
• Early experiments prove encouraging for enhancing scaling of graph-based numeric intensive and knowledge management applications