Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Unexplored energy aspects of scalable heterogeneous computing systems
Holger Fröning Computer Engineering Group
Institute of Computer Engineering Ruprecht-Karls University of Heidelberg
HiPEAC Computing Systems Week, Milano, IT, 23.09.2015
Abstract
The concurrency galore is currently defining computing at all levels. leading to a vast amount of parallelism even for small computing systems. Technology constraints prohibit a reversal of this trend. and the still unsatisfied need for more computing power has led to a pervasive use of accelerators to speed up computations. While this helped to overcome the end of Dennard scaling and significantly increased the energy efficiency of computations. communication is left unaddressed. However. current analyses show that a large fraction of the power consumption originates to communication. In addition. this fraction will actually increase for future technologies. making communication more expensive than computation in terms of energy and time. Being amplified by the advent of Big Data. we are observing a fundamental transition to communication-centric systems composed of heterogeneous computing units. This talk will review some of our explorations and optimizations in the area of energy-efficient computing. While for instance cooling. power distribution. specialized processors and other aspects have already received plenty of attention. we focus instead on areas that are unexplored yet. but in our opinion also contribute significantly to overall power and energy consumption. This talk will conclude with some observations about the impact of technology trends on energy and anticipated related research questions.
!2
Post-‐Dennard performance scaling The good part
• Multi/Many-core trend won’t revert • End of Dennard scaling • Technological constraints • Single-threaded execution model obsolete
• Technology diversity is pervasive • Various processors – x86, ARM, GPUs, MICs,
FPGAs. … • Storage - DRAM, FLASH, SSD, spinning disks. … • Interconnect – {10,40,100}GE, IWARP, IB. … • Systems – IBM Blue Gene, Cray Blue Waters. … • Amazon EC2 – On-Demand GPUs • Upcoming technologies: PCM, photonics, die
stacking. …
!3
Post-‐Dennard performance scaling The bleak part
!4
Kathy Yelick, 2009
Power and Energy
• Energy consumption • Technological,
economically and ecologically constraints
• OPEX and CAPEX • Not only affects
EXASCALE • 20MW = ~20M$/yr
• You should know when you‘re off- or on-die • Note on optical
interconnects • Data movement is more
expensive than computation
!5
US DOE, Scien,fic Grand Challenges: Architectures and Technology for Extreme Scale Compu,ng, San Diego, CA, 2009,
Exascale trend
• Who is still sure about that 20MW number? • DOE goal
• 20MW = 50 GFLOPs/Watt • Sustained, not theoretical peak • Best number today is approx. 30
GFLOPs/Watt (peak) • Sustained at scale: 5 GFLOPs/Watt
• Human brain • 20W • 1E06x more energy-efficient
!6
Year Presenter Statement? forgot <10MW
2010 Craig Stunkel <20-‐25MW
2010 Bill Dally 15-‐20MW
2011 Kathy Yelick 20MW
2012 William Harrod 20MW
2013 Horst Simon 20-‐30MW
2015 Keren Bergmann max. 100MW
It is not the question when we will reach Exascale, the question is when does it come within 20MW
(Another) fundamental transition
!7
Fundamental*transi.on*to*communica.on1centric*systems*composed*of*heterogeneous*compu.ng*
units*
Mul.1/Many1core*revolu.on*in*combina.on*with*Big*Data*
Energy*constraints* Technology*diversity*
Need*for*specializa.on* Improved*sharing*
Energy in scalable interconnection networks
Does it matter after all?
• Pitfall: don’t make assumptions based on maximum power ratings • At TDP, processors outshine
anything • But are processors always
operating at 100% load? • Energy-proportional: at x%
load, a component should only consume x% energy
!9
Quote “It's a myth that interconnect
power is important” Commercial company, panel
presentation, mid 2015
0%#
10%#
20%#
30%#
40%#
50%#
60%#
70%#
80%#
90%#
100%#
TDP#
Component(power(share(
CPUs# GPU# Memory# Network#
It does!
• System power • Scalable energy-efficient network • Direct network, integrated switches
• Dynamic range of components • Many memory-bound applications
• E.g., emerging integer applications (R. Murphy, Sandia) & graph computations • DFS & BFS • Connected Components • Isomorphism • Shortest Path • Graph Partitioning • BLAST (alignment search) • zChaff (satisfiability)
• Exception: compute-bound applications with perfect overlap
!10
0%#
10%#
20%#
30%#
40%#
50%#
60%#
70%#
80%#
90%#
100%#
TDP# Idle#
Component(power(share(
CPUs# GPU# Memory# Network#
It does! (Seconding opinions)
• We need energy-proportional components • Processors have already improved significantly
• Lesson learned from embedded systems: anything matters
!11
Dennis Abts, Michael R. Marty, Philip M. Wells, Peter Klausler, and Hong Liu. 2010. Energy proportional datacenter networks. ISCA ’10
S. Rumley. et al., "Design Methodology for Optimizing Optical Interconnection Networks in High Performance Systems”, ISC 2015.
• Google paper on energy-‐proportional networks: up to 50% on network power,32k nodes: 1.1MW for folded CLOS, 0.7MW for flattened butterfly
• S. Rumsey et. al. (ISC2015): networks continue to consume ~20% of system power even using optical links
• DOE Report on Top 10 Exascale Challenges: “Interconnect technology: Increasing the performance and energy efficiency of data movement”
A short analysis of application/system verbosity
• Verbosity (B/FLOP): inverse of operational intensity (FLOP/B) • Inspired by Keren Bergman
• Optical Interconnection Networks for Ultra-High Bandwidth Energy Efficient Data Movement in HPC, ISC2015 session on On-Chip & Off-Chip Interconnection Networks for Future HPC Systems
• Examples today: 18pJ/bit (PCIe G3), ~25-30pJ/bit (electrical link), 10pJ/bit (optical cable, on top) • All these examples are not energy-proportional!
!12
Power budget 100 50 20 MWattEnergy efficiency 10 20 50 GFLOPs/JEnergy per FLOP 100 50 20 pJNetwork power (20%) 20 10 4 MWattNetwork budget per FLOP 20 10 4 pJVerbosity in B/FLOP Network budget per bit [pJ/bit]
1.000 2.5 1.3 0.5 Amdahl's rule0.100 25.0 12.5 5.0 Anticipated case, Sequoia0.017 147.1 73.5 29.4 Titan0.001 2500.0 1250.0 500.0 Tianhe-‐2
Where do my Joules go?
• Serialization technology dominates power consumption • Clock recovery, high f, equalization, pre-emphasis, …
!13
71%$
14%$
15%$
Power&share&for&NIC&with&integrated&switch&
Links$(6)$
PCIe$
Core$
0.0#
0.5#
1.0#
1.5#
2.0#
2.5#
3.0#
3.5#
4#Lanes# 8#Lanes# 12#Lanes#Po
wer&con
sump-
on&[n
ormalized
]&
Link&power&scaling&
2.5#GHz#
5#GHz#
10#GHz#
• It is link width that matters, not frequency • CML = Current Mode Logic • Linear scaling for 10GHz case
• This leads us to many research thrusts!
Research thrust 1: specialized communication
Beyond CPU-‐centric communication
!15
Source'Node'
GPU'
CPU'
NIC'PCIe'root'
GPU'memory'
Host'memory'
Target'Node'
GPU'
CPU'
NIC'PCIe'root'
GPU'memory'
Host'memory'
100x
Start-‐up latency of 1.5usec
Start-‐up latency of 15usec
GPU-controlled Put/Get (IBVERBS)
“… a bad semantic match between communication primitives required by the application and those provided by the network.” - DOE Subcommittee Report, Top Ten Exascale Research Challenges. 02/10/2014
Allreduce – Power and Energy analysis
!16
Lena Oden, Benjamin Klenk and Holger Fröning, Energy-‐Efficient Collecive Reduce and Allreduce Operaions on Distributed GPUs, 14th IEEE/ACM Internaional Symposium
on Cluster, Cloud and Grid Compuing (CCGrid2014), May 26-‐29, 2014, Chicago, IL, US.
For this case: saved 50% of the energy
nbody_small nbody_large sum_small sum_large himeno randomAccess
benchmarks
perfo
rman
ce n
orm
alize
d to
MPI
0.0
1.0
2.0
3.0
2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12
GGAS RMA
NB−S NB−L sum−S sum−L RA Himeno avg.
benchmarksener
gy c
onsu
mpt
ion
norm
alize
d to
MPI
0.0
0.4
0.8
1.2
ggas rma
Towards specialized communication models for heterogeneous systems
• 12 nodes (each 2x Intel Ivy Bridge, Nvidia K20, Extoll FPGA) • Normalized: >1 better performance, <1 worse performance • We’re currently working on bringing this solution to a wider user community
!17
Benjamin Klenk, Lena Oden, Holger Fröning, Analyzing Communicaion Models for Distributed Thread-‐Collaboraive Processors in Terms of Energy and Time, 2015 IEEE Interna,onal
Symposium on Performance Analysis of Systems and SoIware (ISPASS 2015), Philadelphia, PA, March 29-‐31, 2015.
Research thrust 2: Integrated power models
Integrated Power Model
!19
PCIe power
• Power is all about serialization, remember? • PCIe for HPC is typically 16x, communication is bulky • Nodes will grow wider with more memory and cores
• Newer PCIe 3.0 & 4.0 systems support L0s “standby states” • Active State Power
Management (ASPM) • Experimental analysis
yields frustrating results => Alternative methodology required
!20
Jeffrey Young, Richard Vuduc, The Near-Term Implications of Network Low-Power States and Next-Generation Interconnects on Power Modeling, Workshop
on Modeling and Simulation of Systems and Applications (MODSIM), 2015.
Extending network simulation for power
• Power-aware network simulation by extending OMNET++ • Power states for each link
• Electrical and optical links • State selection logic in front
of each link • Various policies possible,
transition time matters! • Using traces, not synthetic traffic
• Joint effort with Pedro Garcia et al. (UCLM) & Jeff Young (GT) • Infiniband, Ethernet, EXTOLL, PCIe • Interested? Talk to us!
!21
0.0#
0.5#
1.0#
1.5#
2.0#
2.5#
3.0#
3.5#
4#Lanes# 8#Lanes# 12#Lanes#
Power&con
sump-
on&[n
ormalized
]&
Link&power&scaling&
2.5#GHz#
5#GHz#
10#GHz#
Research questions
• Dynamic range - which power state granularity is required? • It seems that wide links provide better opportunities
• State transitioning - how much time can we tolerate? • Reported numbers vary from O(1)ns over 10us to milliseconds • Can we buffer or predict traffic well enough?
• Compensate potential oversubscription - which techniques do we need to tolerate congestions? • Traditional congestion management? • Path diversity? • Adaptive routing?
• Predictability - what are the key characteristics to predict power consumption? • Can I tell power consumption from looking at the code? • Simulation or modeling?
• Performance modeling languages like ASPEN seem very promising
!22
Conclusion
Past? Future? History usually repeats.
!24
Solves power for processing, not
necessarily power for data movement
Summary
• Exascale networks • Performance is mainly a question of costs (IMHO) • Resilience is challenging! • Power contribution is key for sustainable Exascale computing
• We anticipate that huge efforts will be made on improving energy efficiency for processing and memory • Heterogeneity • Processing-in-memory (PIM) • => Power fraction of the network increases!
• Don’t assume that all HPC workloads will result in high loads • Memory-bound applications, graph traversals and computations, …
• Energy-proportional networks matter! • In general: it will be all about data movements, computations will be
almost for free
!25
Credits Contribuions: Lena Oden (former PhD student), Benjamin Klenk (PhD student), Alexander Matz (PhD
student), Felix Zahn (PhD student) Discussions: Sudha Yalamanchili (Georgia Tech), Jeff Young (Georgia Tech), Pedro Garcia et al. (UCLM),
Maximilian Thürmer & Markus Müller (Heidelberg University) Sponsoring: Nvidia, Xilinx, German Excellence Iniiaive, Google
Current main interacZons
!26
Thank you!