Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
ORNL is managed by UT-Battelle
for the US Department of Energy
Quantifying Resiliency in the
Extreme Scale HPC Co-Design
Space
Jeffrey S. Vetter
Jeremy Meredith
http://ft.ornl.gov [email protected]
Dagstuhl Seminar #15281
Algorithms and Scheduling Techniques to
Manage Resilience and Power Consumption
in Distributed Systems
9 Jul 2015
3
Overview
• Our community has major challenges in HPC as we move to extreme scale – Power, Performance, Resilience, Productivity
– Major shifts in architectures, software, applications • Not just HPC: Most uncertainty in two decades
• New technologies emerging to address some of these challenges – Heterogeneous computing
– Nonvolatile memory
• Consequently, we now have critical situations in – Portable programming models
– Better design and analysis tools for procurement, optimization, etc
• Aspen is a tool for structured design and analysis – Co-design applications and architectures for performance, power, resiliency
4
Surveying the HPC Landscape:
Today and Tomorrow
5
Notional Exascale Architecture Targets
(From Exascale Arch Report 2009)
System attributes 2001 2010 “2015” “2018”
System peak 10 Tera 2 Peta 200 Petaflop/sec 1 Exaflop/sec
Power ~0.8 MW 6 MW 15 MW 20 MW
System memory 0.006 PB 0.3 PB 5 PB 32-64 PB
Node performance 0.024 TF 0.125 TF 0.5 TF 7 TF 1 TF 10 TF
Node memory BW 25 GB/s 0.1 TB/sec 1 TB/sec 0.4 TB/sec 4 TB/sec
Node concurrency 16 12 O(100) O(1,000) O(1,000) O(10,000)
System size (nodes) 416 18,700 50,000 5,000 1,000,000 100,000
Total Node
Interconnect BW
1.5 GB/s 150 GB/sec 1 TB/sec 250 GB/sec 2 TB/sec
MTTI day O(1 day) O(1 day)
http://science.energy.gov/ascr/news-and-resources/workshops-and-conferences/grand-challenges/
Parallel I/O ??
6
Today’s Status
7
(Un-)Balanced Systems ??
System attributes 2001 2010 2014 est 2018 Summit/Titan
Name Seaborg3 Jaguar Titan SUMMIT
System peak 10 Tera 2 27 136 5.0
Power (MW) 0.8 6 9 10 1.1
Node main memory (GB) 16 38 512 13.5
System memory (PB) 0.006 0.3 0.7106 1.7408 2.4
Node Persistent Memory (GB) 800 inf
System Persistent Memory (PB) 2.72 inf
Node performance (TF) 0.024 0.125 1.4 0.5 7 40 28.6 1 10
Node memory BW 25 GB/s 0.1 TB/sec 1 TB/sec 0.4 TB/sec 4 TB/sec
Node concurrency 16 12 O(100) O(1,000) *POWER9s + *VOLTAs O(1,000) O(10,000)
System size (nodes) 416 18700 18700 50000 5000 3400 0.2 1000000 100000
Total Node Interconnect BW (GB/s) 1.5 GB/s 150 GB/sec 1 TB/sec 250 GB/sec 2 TB/sec
injection bandwidth per node (GB/s) 7.6 20 23 1.2
File system capacity (PB) 6 32 120 3.8
File system bandwidth (TB/s) 0.3 1 1 1.0
MTTI day O(1 day) O(1 day)
“2015” “2018”
200 1 Exaflop/sec
15 20
5 32-64
• Power is constant • 1/5 of the node count • Heterogeneous • I/O and NIC bandwidth has plateaued • NVM is new!
8
Notional Future Architecture
Interconnection
Network
Investigating Emerging Technologies
• Heterogeneous Computing
• NV Memory
• Optical interconnect, Silicon photonics
• Storage systems (key, value)
10
Earlier Experimental Computing Systems (past
decade)
• The past decade has started the trend away from traditional ‘simple’ architectures
• Examples
– Cell, GPUs, FPGAs, SoCs, etc
• Mainly driven by facilities costs and successful (sometimes heroic) application examples
Popula
r arc
hitectu
res s
ince ~
2004
11
We’ve seen this for decades; Why is it different this
time?
Bob Colwell, Hotchips 25
IEEE Spectrum Avago acquires Broadcom; Intel acquires Altera; next ??
12
Emerging Computing Architectures – Now
• Heterogeneous processing
– Latency tolerant cores
– Throughput cores
– Special purpose hardware (e.g., AES, MPEG, RND)
• Memory
– Fused, configurable memory
– 2.5D and 3D Stacking
– HMC, HBM, WIDEIO2, LPDDR4, etc
– New devices (PCRAM, ReRAM)
• Interconnects
– Collective offload
– Scalable topologies
• Storage – Active storage
– Non-traditional storage architectures (key-value stores)
• Improving performance and programmability in face of increasing complexity
– Power, resilience
HPC (mobile, enterprise, embedded) computer design is more fluid now than in the past two decades.
13
Recent announcements
14
NVRAM Technology Continues to Improve – Driven by
Market Forces
http://www.eetasia.com/STATIC/ARTICLE_IMAGES/201212/EEOL_20
12DEC28_STOR_MFG_NT_01.jpg
15
Comparison of emerging memory technologies
Jeffrey Vetter, ORNL
Robert Schreiber, HP Labs
Trevor Mudge, University of Michigan
Yuan Xie, Penn State University
SRAM DRAM eDRAM 2D NAND
Flash
3D NAND
Flash
PCRAM STTRAM 2D ReRAM 3D ReRAM
Data Retention N N N Y Y Y Y Y Y
Cell Size (F2) 50-200 4-6 19-26 2-5 <1 4-10 8-40 4 <1
Minimum F demonstrated (nm) 14 25 22 16 64 20 28 27 24
Read Time (ns) < 1 30 5 104 104 10-50 3-10 10-50 10-50
Write Time (ns) < 1 50 5 105 105 100-300 3-10 10-50 10-50
Number of Rewrites 1016 1016 1016 104-105 104-105 108-1010 1015 108-1012 108-1012
Read Power Low Low Low High High Low Medium Medium Medium
Write Power Low Low Low High High High Medium Medium Medium
Power (other than R/W) Leakage Refresh Refresh None None None None Sneak Sneak
Maturity
http://ft.ornl.gov/trac/blackcomb
16
Opportunities for NVM in Emerging Systems
• Burst Buffers
• In-mem tables
• In situ visualization
J.S. Vetter and S. Mittal, “Opportunities for Nonvolatile Memory Systems in Extreme-Scale High-Performance Computing,” Computing in Science &
Engineering, 17(2):73-82, 2015, 10.1109/MCSE.2015.4.
http://ft.ornl.gov/eavl
With so many complex architectural
choices, we need new methods and
tools
• Performance Portable Programming Models
• Design tools for architectures and applications
– Performance
– Resiliency
– Power
17
19
System
Software
Proxy
Apps
Application
Co-Design
Hardware
Co-Design
Computer
Science
Co-Design
Vendor
Analysis Sim
Exp
Proto HW
Prog Models
HW Simulator
Tools
Open
Analysis Models
Simulators
Emulators
HW
Design
Stack
Analysis Prog
models
Tools
Compilers
Runtime
OS, I/O, ... HW Constraints
Domain/Alg
Analysis
SW Solutions
System Design
Application Design
Workflow within the Exascale Ecosystem
“(Application driven) co-design is the process
where scientific problem requirements influence
computer architecture design, and technology
constraints inform formulation and design of
algorithms and software.” – Bill Harrod (DOE)
Slide courtesy of ExMatEx Co-design team.
21
Prediction Techniques Ranked
22
Prediction Techniques Ranked
24
Aspen: Abstract Scalable Performance Engineering Notation
Creation
• Static analysis via compilers
• Empirical, Historical
• Manual for future applications
Use
• Interactive tools for graphs, queries
• Design space optimization
• Drive simulators
• Feedback to runtime systems
Representation in Aspen
• Modular
• Sharable
• Composable
• Reflects prog structure
Existing models for MD, UHPC CP 1,
Lulesh, 3D FFT, CoMD, VPFFT, …
Source code Aspen code
K. Spafford and J.S. Vetter, “Aspen: A Domain Specific Language for Performance Modeling,” in SC12: ACM/IEEE International Conference for High Performance
Computing, Networking, Storage, and Analysis, 2012
Researchers are using Aspen for parallel applications, scientific workflows, capacity planning, quantum computing, etc
25
Manual Example of LULESH
26
Creating Aspen Models
S. Lee, J.S. Meredith, and J.S. Vetter, “COMPASS: A Framework for Automated Performance Modeling and Prediction,” in ACM
International Conference on Supercomputing (ICS). Newport Beach, California: ACM, 2015, 10.1145/2751205.2751220.
27
Simple MM example generated from COMPASS
28
Example Queries
Automated Big O Notation
Idealized Concurrency
29
Example Queries
Computational Intensity and Memory Usage per Subroutine
31
Fast and Accurate Enough for Online Decisions
32
PANORAMA Overview
Infrastructure
Design
Model Validation
Workflow Execution
Simulation
Anomaly
Detection and
Diagnosis
Resource
Mapping and
Adaptation
ExoGENI
OLCF
NERSC
Viz
APS
HPSS
VDF
SNS ES
ne
t
Workflow
Pegasus Framework
Aspen Modeling Language
and System
Resources
Ra
w a
nd
Co
rre
late
d M
on
ito
rin
g D
ata
ESnet
testbed
E. Deelman, C. Carothers et al., “PANORAMA: An Approach to Performance Modeling and Diagnosis of Extreme Scale Workflows,” International Journal of
High Performance Computing Applications, (to appear), 2015,
End-to-end Resiliency Design using
Aspen
35
Data Vulnerability Factor: Why a new metric and
methodology?
• Analytical model of resiliency that includes important features of architecture and application
– Fast
– Flexible
• Balance multiple design dimensions
– Application requirements
– Architecture (memory capacity and type)
• Focus on main memory initially
• Prioritize vulnerabilities of application data
L. Yu, D. Li et al., “Quantitatively modeling application resilience with the data vulnerability factor (Best Student Paper Finalist),” in
SC14: International Conference for High Performance Computing, Networking, Storage and Analysis. New Orleans, Louisiana:
IEEE Press, 2014, pp. 695-706, 10.1109/sc.2014.62.
36
DVF Defined
𝑁𝑒𝑟𝑟𝑜𝑟 = 𝐹𝐼𝑇 ∗ 𝑇 ∗ 𝑆𝑑
Hardware Failure Rate ( 𝐹𝐼𝑇 ) Execution Time ( 𝑇 ) Footprint Size ( 𝑆𝑑 )
Hardware Effects Number of Errors ( 𝑵𝒆𝒓𝒓𝒐𝒓 )
Hardware Access Pattern
Application Effects Number of Hardware Accesses ( 𝑵𝒉𝒂 )
𝑁ℎ𝑎 Hardware Access Pattern
Data Structure Vulnerability → 𝐷𝑉𝐹𝑑 = 𝑁𝑒𝑟𝑟𝑜𝑟 ∗ 𝑁ℎ𝑎
Application Vulnerability → 𝐷𝑉𝐹𝑎 = 𝐷𝑉𝐹𝑑𝑖𝑛𝑖=1
Hardware Access Pattern
Application Effects Number of Hardware Accesses ( 𝑵𝒉𝒂 ) We focus on a specific hardware
component, the main memory, in this work
Larger DVF indicates higher vulnerability, and vice versa
37
Implementing DVF
• Extend Aspen performance modeling language
• Specify memory access patterns
• Combine error rates with memory regions and performance
• Assign DVF to each application memory region, Sum for application
38
Workflow to calculate Data Vulnerability Factor
39
An Example of Aspen Program for DVF
procedure VM(A,B,C) for i 1, 1000 do C[i] C[i] + A[i*4] * B[i*8] end for end procedure
Pseudocode
kernel vecmul { execute mainblock2 [1] { flops [2*(n^3)] as sp, fmad, simd access {1000} from {matA} as stream(4,16) access {4000} from {matB} as stream(4,32) access {8000} from {matC} as stream(4,4) } }
Extended Aspen Statements
Resilience Statements: Footprint Sizes: Int: 16,000 Data Structures: Ident: matA Access Pattern: Stream Int: 4 Int: 16 Resilience Statements: Footprint Sizes: Int: 16,000 Data Structures: Ident: matA Access Pattern: Stream Int: 4 Int: 16 Resilience Statements: Footprint Sizes: Int: 16,000 Data Structures: Ident: matA Access Pattern: Stream Int: 4 Int: 16
Syntax Tree
Data structure A: Number of errors: 30,400 Number of memory accesses: 51 DVF: 105504e+06 …
Resilience Modeling Results
Extended
Parser
Extended
Complier
40
40
DVF Results Provides insight for balancing interacting factors
41
DVF: next steps
• Evaluated different architectures
– How much no-ECC, ECC, NVM?
• Evaluate software and applications
– ABFT
– C/R
– TMR
– Containment domains
– Fault tolerant MPI
• End-to-End analysis
– Where should we bear the cost for resiliency?
• Not everwhere!
41
43
Summary
• Our community has major challenges in HPC as we move to extreme scale – Power, Performance, Resilience, Productivity
– Major shifts in architectures, software, applications • Not just HPC: Most uncertainty in two decades
• New technologies emerging to address some of these challenges – Heterogeneous computing
– Nonvolatile memory
• Consequently, we now have critical situations in – Portable programming models
– Performance prediction for procurement, optimization, etc
• Aspen is a tool we have developed for performance prediction – Co-design applications and architectures for performance, power, resiliency
44
Acknowledgements
• Contributors and Sponsors
– Future Technologies Group: http://ft.ornl.gov
– US Department of Energy Office of Science
• DOE Vancouver Project: https://ft.ornl.gov/trac/vancouver
• DOE Blackcomb Project: https://ft.ornl.gov/trac/blackcomb
• DOE ExMatEx Codesign Center: http://codesign.lanl.gov
• DOE Cesar Codesign Center: http://cesar.mcs.anl.gov/
• DOE Exascale Efforts: http://science.energy.gov/ascr/research/computer-science/
– Scalable Heterogeneous Computing Benchmark team: http://bit.ly/shocmarx
– US National Science Foundation Keeneland Project: http://keeneland.gatech.edu
– US DARPA
– NVIDIA CUDA Center of Excellence