Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Performance and Power Co-Design of Exascale Systems and Applications
Adolfy Hoisie
Work with Kevin Barker, Darren Kerbyson, Abhinav Vishnu
Performance and Architecture Lab (PAL)
Pacific Northwest National Laboratory
5th Parallel Tools Workshop Dresden, September 27 , 2011
Outline
Static performance modeling
Dynamic modeling Modeling for Exascale
Tentative conclusions
The fallacy of simple metrics: efficiency
Example 1: Efficiency of applications
Example 2: Efficiency of systems
– Code A on Machine X » (500 MFLOPS Peak per CPU, 2 FLOPS per CP): » Time = 522 sec.; MFLOPS = 26.1 (5.2% of peak)
– Code A on Machine Y » (3600 MFLOPS Peak per CPU, 4 FLOPS per CP): » Time = 91.1 sec; MFLOPS = 113.0 (3.1% of peak)
Solver Flops Flops Mflop/s % Peak Time (s) Original 64 % 29.8 x 109 448.8 5.6 % 66.351
Optimized 25 % 8.2 x 109 257.7 3.2 % 31.905
Rough taxonomy of modeling
Simulation » Greatest architectural flexibility but impractical for real
applications
Trace-driven experiments » Results often lack generality
Quasi-analytical modeling » Can tackle full apps on full machines » Uses a set of input knobs » Tool-neutral
Benchmarking » Limited to current implementation of the code » Limited to currently-available architectures » Difficult to distinguish between real performance and machine
idiosyncrasies
Attributes of a Performance Model Encapsulates application behavior
– Abstracts application into communication and computation components
– Focuses on first-order effects, ignoring distracting details
Separates performance concerns – Inherent properties of application structure (e.g., data
dependencies) – System performance characteristics (e.g., MPI latency)
Performance Prediction
Code Model
System Model
+
Code
System + Execution
problem
configuration
Determine SW parameters
A Performance Modeling Process Flow
Identification of application characteristics Test new configurations (HW and/or SW)
Verify current performance
Run b’marks on system
Code Construct (or refine) application
model
Acquire performance
characteristics
Micro- benchmarks
Specifications Future (promised) performance
Combine Use model
Compare systems
Propose future systems
…
Data structures
Decomposition
Memory usage
Parallel activities
Frequency of use
…
Run code on system
Model can be trusted
Validate (compare model to
measured)
System(s)
Partial list of modeled systems & codes
Machines – ASCI Q – ASCI BlueMountain – ASCI White – ASCI Red – CRAY T3E – Earth Simulator – Itanium-2 cluster – BlueGene/L – BlueGene/P – CRAY X-1 – ASC Red Storm – ASC Purple – IBM PERCS – IBM Blue Waters – Clearspeed accelerators – SiCortex SC5832 – Roadrunner – Jaguar – ……..
Codes – SWEEP3D – SAGE – TYCHO – Partisn – LBMHD – HYCOM – MCNP – POP – KRAK – RF-CTH – CICE – S3D – VPIC – GTC – ……..
Modeling in action as a co-design process– IBM PERCS
Modeling used to explore and guide design of PERCS using application suite (HPCS phase 1 & 2)
Design feedback loop got used with increasing speed Explored numerous configurations and options
PERCS simulator Application(s)
Simulated run-time
(1PE, 1chip)
System Design Network topology
Latency Bandwidth Contention …
cores per chip Performance
Model
Large-scale Performance Predictions
IBM
PNNL
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2.0
FC
1
OC
S-F
C 1
OC
S-F
C 2
OC
S-D 2D 3D FT
Run
time
Rat
io v
s. B
est N
etw
ork
HYCOMLBMHDRF-CTH2KRAKSAGESweep3DPOP
Topology comparison through co-design
Example: 2,048 PE job (256-node system, 64-way) – FC Fully-connected 1-hop – OCS 1-hop or 2-hop – 2D, 3D meshes – FT Fat-tree – OCS-D OCS-Dynamic
Best hardware latency of 50ns, 4GB/s links
Graph shows relative performance of each network
relative to the best performing network
Modeling as a co-design tool
Where is the time being spent ? – ~63% Compute on Cell – ~20% Latency (Cell <-> AMD) – ~5% Bandwidth (Cell <-> AMD) – ~8% Latency (Infiniband) – ~3% Bandwidth (Infiniband)
Pipeline unavoidable Latency dominates
communication (Cell <-> AMD is major component)
Uses ‘probable’ HW parameters
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 2 4 8 16 32 64 128
1CU
2CU
4CU
8CU
12C
U
16C
U
18C
U
Node Count
Inter-node (Bandwidth)Inter-node (Latency)AMD <-> Cell (Bandwidth)AMD <-> Cell (Latency)Compute_Pipe (Cell)Compute_Block (Cell)
An example of modeling in action
Assumptions (hypothetical system): – Weak-scaling – Assumed subgrids – Processing time per cell – Inter-PE (on Accelerator)
» Bandwidth =1GB/s, » Latency = 50ns
– Inter-node (MPI) » Bandwidth = 1.6GB/s, » Latency = 4µs
0
5
10
15
20
25
30
1 2 4 8 16 32 64 128
256
512
1024
2048
4096
8192
16384
Compute processor / AD count
Cycl
e-tim
e (m
s)
AD=1PEAD=2PEsAD=4PEsAD=8PEsAD=16PEsAD=32PEsAD=64PEsAD=128PEs
At largest scale, 16,384 compute processors & 16,384 accelerators – Performance improvement is ~3.5x when using Accelerators with 128x more
Pes
Challenges ahead: Performance from concurrency with faults and power
12
2008 2018 (?)
System Peak 1.4 peta 1 Exa
Power 2.5MW 20MW
System Memory 0.3 PB 32-64 PB
Node Performance 425GF 1 TF – 10TF
Node Concurrency 40 O(1,000) – O(10,000)
System Size (nodes) 3,240 1K – 100K
System Concurrency 128,160 1bill
MTTI Days < day
1st Petascale (Roadrunner) 1st Exascale (?)
" System Architecture: connectivity " Technology innovations: chip architecture, chip stacking, optical networks " Multi-dimensional: Performance + Power + Resilience
Economics show the shift in importance from performance to including power& FT
13
Current predictions of exa-flop system power requirements:
0 50 100 DOE goal
MW 400+
Nvidia 05/11
uHPC 2010
Intel 03/11
IBM 12/10 (BlueGene)
> Expected energy cost / year at best 20 M$ (@ 1M$ per Mwyr) " If system costs 100M$ then >half total-cost will be Energy (5yr system life)
!" #!" $!" %!" &!" '!" (!" )!" *!" +!" #!!" ##!" #$!"
,-./"-0123"
456789:"9;<=37>;;=7<"
,/"?900@=6<>6=0A="
B-./"C77=DD"
,E"F>:"
!"#$%&'()'*+,'+%&'-.'
Data based from B. Dally, IPDPS keynote, May 2011
What can you do with a nJ ?
~30 flops = 1 DM on chip ~60 flops = 1 DM off-chip
" It’s all about the data movement " Locality, Locality, Locality
Towards Exascale: Exploration of deep memory hierarchies Architectural factors
– Swim lanes: multi-core vs. heterogeneous – Fused CPU/GPU will impact on memory performance – Deeper memory Hierarchies for Power as well as performance
Application factors – Greater concurrency, greater locality, less synchronization – Greater focus on data/memory factors
14 Page Access Frequency (Hz)
Cou
nt
%
GTC
Sage
Memory access phase behavior indicates potential power saving “windows of opportunity”.
Less frequently used pages can be migrated to low-power memory
Changes of direction in modeling
" Performance at what cost ? " Reliability at what cost ?
" Looking at Performance, Power and Reliability will lead to multi-dimensional optimizations: " Trade-offs " Performance at what power " Reliability at what power " Data-movement costs " Power steering
Reliability
Power
Co-Design of power constrained systems
Modeling can be used to quantify power consumption as well as performance
Measurement of current components and simulation of future technologies
Optimization directed by modeled predictions
16
Modeling
Optimization Measurement/Simulation
Feedback cycle can represent both off-line and on-line activities: 1. Static design-space exploration 2. Dynamic application/resource steering
Measuring Power Today
Without specialized hardware, direct power measurement not possible
So, indirect methods have been proposed – Determining power from temperature
» Processor temperature is easy to measure » But it is difficult to correlate temperature with activity
– Determining power from performance counters » Complex relationship between processor activity and power
For higher accuracy, dedicated measurement hardware is needed
Measuring Power Today
Power measurement hardware comes in two flavors External to the compute node (e.g., Watts Up)
– Measurement device sits between power socket and compute node
– Often relatively inexpensive (O($100)), scalable to clusters – Typically low temporal and spatial fidelity (e.g., 1Hz, cannot
separate consumed power on a component basis)
Internal – Home-grown solution requiring “surgery” inside the node – Single-node solutions; not scalable to clusters – Hardware vendors utilize custom boards not available to
research community
Where Do We Want To Be? Tools at the single-node level
– Where’s my Power-PAPI? Extend the concept of performance counters to power counters
» Valid power counters may vary by architecture » Determining power requires sampling voltage and current,
which my inhibit temporal resolution leading to “stale” data – Software control
» The PNNL-Power library has this capability, but measurements are coarse-grained
» Goal is to associate measurements with software activities – Requires close collaboration with hardware community
Tools at the cluster level – Aggregating data across nodes within the cluster (including
network) – Again, analogous to performance tools today – Limits to scalability?
Expanding Modeling Methodology to include Power
Power modeling at scale similar to performance modeling – Application behaviors in common – Resource metrics different (time, power etc)
Obtaining characteristics will be different, e.g. – Cycle-accurate simulation + micro-benchmarks + for performance – Cycle-accurate power simulator + micro-benchmarks + for power
Mirror performance approach e.g. – Early design: estimate core, memory, communication power – Later design: cycle accurate power simulation & refined network /
communication power – Implementation (small-scale): measurement possible – Implementation (large-scale): validation of system power
Issues Level of abstraction for modeling?
– Depends on definition of system power – Depends on validation of existing system
System space to be explored? – Dimensions in design space -> parameterization – Range of space of interest, & what would a baseline look-like?
Tool design and development Workload (of common interest ?)
– Use of many applications
Analysis: Design space – Power budget allocation -> performance/energy optimization
Analysis: Dynamic possibilities – Power steering
Analysis: Comparison to other possible future systems
Use iterative design-flow
A few general remarks
Modeling applied in practice: system and application design, analysis, prediction, and testing
Modeling is the quantitative tool of co-design Power, performance, and reliability modeling will be
the triad to model on the path to Exascale Significant gaps exist in methodology development
and practice Investment needs to accompany system and
application development for Exascale Power/energy is not the single domain of any level
of the stack – but we need dynamic, quantitative tools