Unexplored energy aspects of scalable heterogeneous ... · • Optical Interconnection Networks for Ultra-High Bandwidth Energy Efficient Data Movement in HPC, ISC2015 session on

Unexplored energy aspects of scalable heterogeneous computing systems

Holger Fröning Computer Engineering Group

Institute of Computer Engineering Ruprecht-Karls University of Heidelberg

HiPEAC Computing Systems Week, Milano, IT, 23.09.2015

Abstract

The concurrency galore is currently defining computing at all levels. leading to a vast amount of parallelism even for small computing systems. Technology constraints prohibit a reversal of this trend. and the still unsatisfied need for more computing power has led to a pervasive use of accelerators to speed up computations. While this helped to overcome the end of Dennard scaling and significantly increased the energy efficiency of computations. communication is left unaddressed. However. current analyses show that a large fraction of the power consumption originates to communication. In addition. this fraction will actually increase for future technologies. making communication more expensive than computation in terms of energy and time. Being amplified by the advent of Big Data. we are observing a fundamental transition to communication-centric systems composed of heterogeneous computing units. This talk will review some of our explorations and optimizations in the area of energy-efficient computing. While for instance cooling. power distribution. specialized processors and other aspects have already received plenty of attention. we focus instead on areas that are unexplored yet. but in our opinion also contribute significantly to overall power and energy consumption. This talk will conclude with some observations about the impact of technology trends on energy and anticipated related research questions.

!2

Post-‐Dennard performance scaling The good part

• Multi/Many-core trend won’t revert • End of Dennard scaling • Technological constraints • Single-threaded execution model obsolete

• Technology diversity is pervasive • Various processors – x86, ARM, GPUs, MICs,

FPGAs. … • Storage - DRAM, FLASH, SSD, spinning disks. … • Interconnect – {10,40,100}GE, IWARP, IB. … • Systems – IBM Blue Gene, Cray Blue Waters. … • Amazon EC2 – On-Demand GPUs • Upcoming technologies: PCM, photonics, die

stacking. …

!3

Post-‐Dennard performance scaling The bleak part

!4

Kathy Yelick, 2009

Power and Energy

• Energy consumption • Technological,

economically and ecologically constraints

• OPEX and CAPEX • Not only affects

EXASCALE • 20MW = ~20M$/yr

• You should know when you‘re off- or on-die • Note on optical

interconnects • Data movement is more

expensive than computation

!5

US DOE, Scien,fic Grand Challenges: Architectures and Technology for Extreme Scale Compu,ng, San Diego, CA, 2009,

Exascale trend

• Who is still sure about that 20MW number? • DOE goal

• 20MW = 50 GFLOPs/Watt • Sustained, not theoretical peak • Best number today is approx. 30

GFLOPs/Watt (peak) • Sustained at scale: 5 GFLOPs/Watt

• Human brain • 20W • 1E06x more energy-efficient

!6

Year Presenter Statement? forgot <10MW

2010 Craig Stunkel <20-‐25MW

2010 Bill Dally 15-‐20MW

2011 Kathy Yelick 20MW

2012 William Harrod 20MW

2013 Horst Simon 20-‐30MW

2015 Keren Bergmann max. 100MW

It is not the question when we will reach Exascale, the question is when does it come within 20MW

(Another) fundamental transition

!7

Fundamental*transi.on*to*communica.on1centric*systems*composed*of*heterogeneous*compu.ng*

units*

Mul.1/Many1core*revolu.on*in*combina.on*with*Big*Data*

Energy*constraints* Technology*diversity*

Need*for*specializa.on* Improved*sharing*

Energy in scalable interconnection networks

Does it matter after all?

• Pitfall: don’t make assumptions based on maximum power ratings • At TDP, processors outshine

anything • But are processors always

operating at 100% load? • Energy-proportional: at x%

load, a component should only consume x% energy

!9

Quote “It's a myth that interconnect

power is important” Commercial company, panel

presentation, mid 2015

0%#

10%#

20%#

30%#

40%#

50%#

60%#

70%#

80%#

90%#

100%#

TDP#

Component(power(share(

CPUs# GPU# Memory# Network#

It does!

• System power • Scalable energy-efficient network • Direct network, integrated switches

• Dynamic range of components • Many memory-bound applications

• E.g., emerging integer applications (R. Murphy, Sandia) & graph computations • DFS & BFS • Connected Components • Isomorphism • Shortest Path • Graph Partitioning • BLAST (alignment search) • zChaff (satisfiability)

• Exception: compute-bound applications with perfect overlap

!10

0%#

10%#

20%#

30%#

40%#

50%#

60%#

70%#

80%#

90%#

100%#

TDP# Idle#

Component(power(share(

CPUs# GPU# Memory# Network#

It does! (Seconding opinions)

• We need energy-proportional components • Processors have already improved significantly

• Lesson learned from embedded systems: anything matters

!11

Dennis Abts, Michael R. Marty, Philip M. Wells, Peter Klausler, and Hong Liu. 2010. Energy proportional datacenter networks. ISCA ’10

S. Rumley. et al., "Design Methodology for Optimizing Optical Interconnection Networks in High Performance Systems”, ISC 2015.

• Google paper on energy-‐proportional networks: up to 50% on network power,32k nodes: 1.1MW for folded CLOS, 0.7MW for flattened butterfly

• S. Rumsey et. al. (ISC2015): networks continue to consume ~20% of system power even using optical links

• DOE Report on Top 10 Exascale Challenges: “Interconnect technology: Increasing the performance and energy efficiency of data movement”

A short analysis of application/system verbosity

• Verbosity (B/FLOP): inverse of operational intensity (FLOP/B) • Inspired by Keren Bergman

• Optical Interconnection Networks for Ultra-High Bandwidth Energy Efficient Data Movement in HPC, ISC2015 session on On-Chip & Off-Chip Interconnection Networks for Future HPC Systems

• Examples today: 18pJ/bit (PCIe G3), ~25-30pJ/bit (electrical link), 10pJ/bit (optical cable, on top) • All these examples are not energy-proportional!

!12

Power budget 100 50 20 MWattEnergy efficiency 10 20 50 GFLOPs/JEnergy per FLOP 100 50 20 pJNetwork power (20%) 20 10 4 MWattNetwork budget per FLOP 20 10 4 pJVerbosity in B/FLOP Network budget per bit [pJ/bit]

1.000 2.5 1.3 0.5 Amdahl's rule0.100 25.0 12.5 5.0 Anticipated case, Sequoia0.017 147.1 73.5 29.4 Titan0.001 2500.0 1250.0 500.0 Tianhe-‐2

Where do my Joules go?

• Serialization technology dominates power consumption • Clock recovery, high f, equalization, pre-emphasis, …

!13

71%$

14%$

15%$

Power&share&for&NIC&with&integrated&switch&

Links$(6)$

PCIe$

Core$

0.0#

0.5#

1.0#

1.5#

2.0#

2.5#

3.0#

3.5#

4#Lanes# 8#Lanes# 12#Lanes#Po

wer&con

sump-

on&[n

ormalized

]&

Link&power&scaling&

2.5#GHz#

5#GHz#

10#GHz#

• It is link width that matters, not frequency • CML = Current Mode Logic • Linear scaling for 10GHz case

• This leads us to many research thrusts!

Research thrust 1: specialized communication

Beyond CPU-‐centric communication

!15

Source'Node'

GPU'

CPU'

NIC'PCIe'root'

GPU'memory'

Host'memory'

Target'Node'

GPU'

CPU'

NIC'PCIe'root'

GPU'memory'

Host'memory'

100x

Start-‐up latency of 1.5usec

Start-‐up latency of 15usec

GPU-controlled Put/Get (IBVERBS)

“… a bad semantic match between communication primitives required by the application and those provided by the network.” - DOE Subcommittee Report, Top Ten Exascale Research Challenges. 02/10/2014

Allreduce – Power and Energy analysis

!16

Lena Oden, Benjamin Klenk and Holger Fröning, Energy-‐Efficient Collecive Reduce and Allreduce Operaions on Distributed GPUs, 14th IEEE/ACM Internaional Symposium

on Cluster, Cloud and Grid Compuing (CCGrid2014), May 26-‐29, 2014, Chicago, IL, US.

For this case: saved 50% of the energy

nbody_small nbody_large sum_small sum_large himeno randomAccess

benchmarks

perfo

rman

ce n

orm

alize

d to

MPI

0.0

1.0

2.0

3.0

2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12

GGAS RMA

NB−S NB−L sum−S sum−L RA Himeno avg.

benchmarksener

gy c

onsu

mpt

ion

norm

alize

d to

MPI

0.0

0.4

0.8

1.2

ggas rma

Towards specialized communication models for heterogeneous systems

• 12 nodes (each 2x Intel Ivy Bridge, Nvidia K20, Extoll FPGA) • Normalized: >1 better performance, <1 worse performance • We’re currently working on bringing this solution to a wider user community

!17

Benjamin Klenk, Lena Oden, Holger Fröning, Analyzing Communicaion Models for Distributed Thread-‐Collaboraive Processors in Terms of Energy and Time, 2015 IEEE Interna,onal

Symposium on Performance Analysis of Systems and SoIware (ISPASS 2015), Philadelphia, PA, March 29-‐31, 2015.

Research thrust 2: Integrated power models

Integrated Power Model

!19

PCIe power

• Power is all about serialization, remember? • PCIe for HPC is typically 16x, communication is bulky • Nodes will grow wider with more memory and cores

• Newer PCIe 3.0 & 4.0 systems support L0s “standby states” • Active State Power

Management (ASPM) • Experimental analysis

yields frustrating results => Alternative methodology required

!20

Jeffrey Young, Richard Vuduc, The Near-Term Implications of Network Low-Power States and Next-Generation Interconnects on Power Modeling, Workshop

on Modeling and Simulation of Systems and Applications (MODSIM), 2015.

Extending network simulation for power

• Power-aware network simulation by extending OMNET++ • Power states for each link

• Electrical and optical links • State selection logic in front

of each link • Various policies possible,

transition time matters! • Using traces, not synthetic traffic

• Joint effort with Pedro Garcia et al. (UCLM) & Jeff Young (GT) • Infiniband, Ethernet, EXTOLL, PCIe • Interested? Talk to us!

!21

0.0#

0.5#

1.0#

1.5#

2.0#

2.5#

3.0#

3.5#

4#Lanes# 8#Lanes# 12#Lanes#

Power&con

sump-

on&[n

ormalized

]&

Link&power&scaling&

2.5#GHz#

5#GHz#

10#GHz#

Research questions

• Dynamic range - which power state granularity is required? • It seems that wide links provide better opportunities

• State transitioning - how much time can we tolerate? • Reported numbers vary from O(1)ns over 10us to milliseconds • Can we buffer or predict traffic well enough?

• Compensate potential oversubscription - which techniques do we need to tolerate congestions? • Traditional congestion management? • Path diversity? • Adaptive routing?

• Predictability - what are the key characteristics to predict power consumption? • Can I tell power consumption from looking at the code? • Simulation or modeling?

• Performance modeling languages like ASPEN seem very promising

!22

Conclusion

Past? Future? History usually repeats.

!24

Solves power for processing, not

necessarily power for data movement

Summary

• Exascale networks • Performance is mainly a question of costs (IMHO) • Resilience is challenging! • Power contribution is key for sustainable Exascale computing

• We anticipate that huge efforts will be made on improving energy efficiency for processing and memory • Heterogeneity • Processing-in-memory (PIM) • => Power fraction of the network increases!

• Don’t assume that all HPC workloads will result in high loads • Memory-bound applications, graph traversals and computations, …

• Energy-proportional networks matter! • In general: it will be all about data movements, computations will be

almost for free

!25

Credits Contribuions: Lena Oden (former PhD student), Benjamin Klenk (PhD student), Alexander Matz (PhD

student), Felix Zahn (PhD student) Discussions: Sudha Yalamanchili (Georgia Tech), Jeff Young (Georgia Tech), Pedro Garcia et al. (UCLM),

Maximilian Thürmer & Markus Müller (Heidelberg University) Sponsoring: Nvidia, Xilinx, German Excellence Iniiaive, Google

Current main interacZons

!26

Thank you!

Documents

Unexplored energy aspects of scalable heterogeneous ... · • Optical Interconnection Networks for Ultra-High Bandwidth Energy Efficient Data Movement in HPC, ISC2015 session on