Performance and Power Co-Design of Exascale Systems and ...tools.zih.tu-dresden.de/2011/downloads/Hoisie-perf-and-power-co-design.pdf · Performance and Power Co-Design of Exascale

Performance and Power Co-Design of Exascale Systems and Applications

Adolfy Hoisie

Work with Kevin Barker, Darren Kerbyson, Abhinav Vishnu

Performance and Architecture Lab (PAL)

Pacific Northwest National Laboratory

5th Parallel Tools Workshop Dresden, September 27 , 2011

Outline

  Static performance modeling

  Dynamic modeling   Modeling for Exascale

  Tentative conclusions

The fallacy of simple metrics: efficiency

  Example 1: Efficiency of applications

  Example 2: Efficiency of systems

–  Code A on Machine X »  (500 MFLOPS Peak per CPU, 2 FLOPS per CP): »  Time = 522 sec.; MFLOPS = 26.1 (5.2% of peak)

–  Code A on Machine Y »  (3600 MFLOPS Peak per CPU, 4 FLOPS per CP): »  Time = 91.1 sec; MFLOPS = 113.0 (3.1% of peak)

Solver Flops Flops Mflop/s % Peak Time (s) Original 64 % 29.8 x 109 448.8 5.6 % 66.351

Optimized 25 % 8.2 x 109 257.7 3.2 % 31.905

Rough taxonomy of modeling

  Simulation »  Greatest architectural flexibility but impractical for real

applications

  Trace-driven experiments »  Results often lack generality

  Quasi-analytical modeling »  Can tackle full apps on full machines »  Uses a set of input knobs »  Tool-neutral

  Benchmarking »  Limited to current implementation of the code »  Limited to currently-available architectures »  Difficult to distinguish between real performance and machine

idiosyncrasies

Attributes of a Performance Model   Encapsulates application behavior

–  Abstracts application into communication and computation components

–  Focuses on first-order effects, ignoring distracting details

  Separates performance concerns –  Inherent properties of application structure (e.g., data

dependencies) –  System performance characteristics (e.g., MPI latency)

Performance Prediction

Code Model

System Model

+

Code

System + Execution

problem

configuration

Determine SW parameters

A Performance Modeling Process Flow

Identification of application characteristics Test new configurations (HW and/or SW)

Verify current performance

Run b’marks on system

Code Construct (or refine) application

model

Acquire performance

characteristics

Micro- benchmarks

Specifications Future (promised) performance

Combine Use model

Compare systems

Propose future systems

…

Data structures

Decomposition

Memory usage

Parallel activities

Frequency of use

…

Run code on system

Model can be trusted

Validate (compare model to

measured)

System(s)

Partial list of modeled systems & codes

  Machines –  ASCI Q –  ASCI BlueMountain –  ASCI White –  ASCI Red –  CRAY T3E –  Earth Simulator –  Itanium-2 cluster –  BlueGene/L –  BlueGene/P –  CRAY X-1 –  ASC Red Storm –  ASC Purple –  IBM PERCS –  IBM Blue Waters –  Clearspeed accelerators –  SiCortex SC5832 –  Roadrunner –  Jaguar –  ……..

  Codes –  SWEEP3D –  SAGE –  TYCHO –  Partisn –  LBMHD –  HYCOM –  MCNP –  POP –  KRAK –  RF-CTH –  CICE –  S3D –  VPIC –  GTC –  ……..

Modeling in action as a co-design process– IBM PERCS

  Modeling used to explore and guide design of PERCS using application suite (HPCS phase 1 & 2)

  Design feedback loop got used with increasing speed   Explored numerous configurations and options

PERCS simulator Application(s)

Simulated run-time

(1PE, 1chip)

System Design Network topology

Latency Bandwidth Contention …

cores per chip Performance

Model

Large-scale Performance Predictions

IBM

PNNL

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2.0

FC

1

OC

S-F

C 1

OC

S-F

C 2

OC

S-D 2D 3D FT

Run

time

Rat

io v

s. B

est N

etw

ork

HYCOMLBMHDRF-CTH2KRAKSAGESweep3DPOP

Topology comparison through co-design

  Example: 2,048 PE job (256-node system, 64-way) –  FC Fully-connected 1-hop –  OCS 1-hop or 2-hop –  2D, 3D meshes –  FT Fat-tree –  OCS-D OCS-Dynamic

  Best hardware latency of 50ns, 4GB/s links

  Graph shows relative performance of each network

relative to the best performing network

Modeling as a co-design tool

  Where is the time being spent ? –  ~63% Compute on Cell –  ~20% Latency (Cell <-> AMD) –  ~5% Bandwidth (Cell <-> AMD) –  ~8% Latency (Infiniband) –  ~3% Bandwidth (Infiniband)

  Pipeline unavoidable   Latency dominates

communication (Cell <-> AMD is major component)

  Uses ‘probable’ HW parameters

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 2 4 8 16 32 64 128

1CU

2CU

4CU

8CU

12C

U

16C

U

18C

U

Node Count

Inter-node (Bandwidth)Inter-node (Latency)AMD <-> Cell (Bandwidth)AMD <-> Cell (Latency)Compute_Pipe (Cell)Compute_Block (Cell)

An example of modeling in action

  Assumptions (hypothetical system): –  Weak-scaling –  Assumed subgrids –  Processing time per cell –  Inter-PE (on Accelerator)

»  Bandwidth =1GB/s, »  Latency = 50ns

–  Inter-node (MPI) »  Bandwidth = 1.6GB/s, »  Latency = 4µs

0

5

10

15

20

25

30

1 2 4 8 16 32 64 128

256

512

1024

2048

4096

8192

16384

Compute processor / AD count

Cycl

e-tim

e (m

s)

AD=1PEAD=2PEsAD=4PEsAD=8PEsAD=16PEsAD=32PEsAD=64PEsAD=128PEs

  At largest scale, 16,384 compute processors & 16,384 accelerators –  Performance improvement is ~3.5x when using Accelerators with 128x more

Pes

Challenges ahead: Performance from concurrency with faults and power

12

2008 2018 (?)

System Peak 1.4 peta 1 Exa

Power 2.5MW 20MW

System Memory 0.3 PB 32-64 PB

Node Performance 425GF 1 TF – 10TF

Node Concurrency 40 O(1,000) – O(10,000)

System Size (nodes) 3,240 1K – 100K

System Concurrency 128,160 1bill

MTTI Days < day

1st Petascale (Roadrunner) 1st Exascale (?)

" System Architecture: connectivity " Technology innovations: chip architecture, chip stacking, optical networks " Multi-dimensional: Performance + Power + Resilience

Economics show the shift in importance from performance to including power& FT

13

Current predictions of exa-flop system power requirements:

0 50 100 DOE goal

MW 400+

Nvidia 05/11

uHPC 2010

Intel 03/11

IBM 12/10 (BlueGene)

> Expected energy cost / year at best 20 M$ (@ 1M$ per Mwyr) " If system costs 100M$ then >half total-cost will be Energy (5yr system life)

!" #!" $!" %!" &!" '!" (!" )!" *!" +!" #!!" ##!" #$!"

,-./"-0123"

456789:"9;<=37>;;=7<"

,/"?900@=6<>6=0A="

B-./"C77=DD"

,E"F>:"

!"#$%&'()'*+,'+%&'-.'

Data based from B. Dally, IPDPS keynote, May 2011

What can you do with a nJ ?

~30 flops = 1 DM on chip ~60 flops = 1 DM off-chip

" It’s all about the data movement " Locality, Locality, Locality

Towards Exascale: Exploration of deep memory hierarchies   Architectural factors

–  Swim lanes: multi-core vs. heterogeneous –  Fused CPU/GPU will impact on memory performance –  Deeper memory Hierarchies for Power as well as performance

  Application factors –  Greater concurrency, greater locality, less synchronization –  Greater focus on data/memory factors

14 Page Access Frequency (Hz)

Cou

nt

%

GTC

Sage

Memory access phase behavior indicates potential power saving “windows of opportunity”.

Less frequently used pages can be migrated to low-power memory

Changes of direction in modeling

" Performance at what cost ? " Reliability at what cost ?

" Looking at Performance, Power and Reliability will lead to multi-dimensional optimizations: " Trade-offs " Performance at what power " Reliability at what power " Data-movement costs " Power steering

Reliability

Power

Co-Design of power constrained systems

  Modeling can be used to quantify power consumption as well as performance

  Measurement of current components and simulation of future technologies

  Optimization directed by modeled predictions

16

Modeling

Optimization Measurement/Simulation

Feedback cycle can represent both off-line and on-line activities: 1. Static design-space exploration 2. Dynamic application/resource steering

Measuring Power Today

  Without specialized hardware, direct power measurement not possible

  So, indirect methods have been proposed –  Determining power from temperature

»  Processor temperature is easy to measure »  But it is difficult to correlate temperature with activity

–  Determining power from performance counters »  Complex relationship between processor activity and power

  For higher accuracy, dedicated measurement hardware is needed

Measuring Power Today

Power measurement hardware comes in two flavors   External to the compute node (e.g., Watts Up)

–  Measurement device sits between power socket and compute node

–  Often relatively inexpensive (O($100)), scalable to clusters –  Typically low temporal and spatial fidelity (e.g., 1Hz, cannot

separate consumed power on a component basis)

  Internal –  Home-grown solution requiring “surgery” inside the node –  Single-node solutions; not scalable to clusters –  Hardware vendors utilize custom boards not available to

research community

Where Do We Want To Be?   Tools at the single-node level

–  Where’s my Power-PAPI? Extend the concept of performance counters to power counters

»  Valid power counters may vary by architecture »  Determining power requires sampling voltage and current,

which my inhibit temporal resolution leading to “stale” data –  Software control

»  The PNNL-Power library has this capability, but measurements are coarse-grained

»  Goal is to associate measurements with software activities –  Requires close collaboration with hardware community

  Tools at the cluster level –  Aggregating data across nodes within the cluster (including

network) –  Again, analogous to performance tools today –  Limits to scalability?

Expanding Modeling Methodology to include Power

  Power modeling at scale similar to performance modeling –  Application behaviors in common –  Resource metrics different (time, power etc)

  Obtaining characteristics will be different, e.g. –  Cycle-accurate simulation + micro-benchmarks + for performance –  Cycle-accurate power simulator + micro-benchmarks + for power

  Mirror performance approach e.g. –  Early design: estimate core, memory, communication power –  Later design: cycle accurate power simulation & refined network /

communication power –  Implementation (small-scale): measurement possible –  Implementation (large-scale): validation of system power

Issues   Level of abstraction for modeling?

–  Depends on definition of system power –  Depends on validation of existing system

  System space to be explored? –  Dimensions in design space -> parameterization –  Range of space of interest, & what would a baseline look-like?

  Tool design and development   Workload (of common interest ?)

–  Use of many applications

  Analysis: Design space –  Power budget allocation -> performance/energy optimization

  Analysis: Dynamic possibilities –  Power steering

  Analysis: Comparison to other possible future systems

  Use iterative design-flow

A few general remarks

  Modeling applied in practice: system and application design, analysis, prediction, and testing

  Modeling is the quantitative tool of co-design   Power, performance, and reliability modeling will be

the triad to model on the path to Exascale   Significant gaps exist in methodology development

and practice   Investment needs to accompany system and

application development for Exascale   Power/energy is not the single domain of any level

of the stack – but we need dynamic, quantitative tools

Documents

Performance and Power Co-Design of Exascale Systems and ...tools.zih.tu-dresden.de/2011/downloads/Hoisie-perf-and-power-co-design.pdf · Performance and Power Co-Design of Exascale