Challenges and Opportunities in Designing Energy-Efficient High-Performance Computing Platforms

Challenges and Opportunities in Designing Energy-Efficient High-Performance Computing Platforms

Chita R. Das

High Performance Computing LaboratoryDepartment of Computer Science & Engineering

The Pennsylvania State University EEHiPC-December 19, 2010

High Performance

Computing LAB

2

Talk Outline

Technology Scaling Challenges State-of-the-art Design Challenges Opportunity: Heterogeneous Architectures

Technology – 3D, TFET, Optics, STT-RAM Processor – new devices, Core heterogeneity Memory – STT-RAM, PCM, etc Interconnect – Network heterogeneity

Conclusions

High Performance

Computing LAB Computing Walls

3

Data from ITRS 2008

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

2021

2022

0

5

10

15

20

25

Total

Num

ber o

f cor

es p

er c

hip

(nor

mal

ized

to 2

008)

Moore’s Law

High Performance


4

Data from ITRS 2008

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

2021

2022

0

5

10

15

20

25Total Powered on for 100 Watts

Num

ber o

f cor

es p

er c

hip

(nor

mal

ized

to 2

008)

Utilization and Power Wall

3x1x

P ≈ CV2fLower V can reduce PBut, speed decreases with V

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

2021

2022

0

0.2

0.4

0.6

0.8

1

1.2

Supp

ly V

olta

ge (V

)• High performance MOS started out with 12V• Current high-perf. μPs have 1V supply =>

(12/1)2 = 144x over 28 years.• Only (1/0.6)2 = 2.77x left in next 12 years!

High Performance


5

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

2021

2022

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

Num

ber o

f tot

al p

acka

ge p

ins Memory bandwidth:

Pin count increases only 4x compared to 25x increase in cores

Data from ITRS 2009

High Performance

Computing LAB

Reliability Wall

6

Computing Walls

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

2021

20220

0.2

0.4

0.6

0.8

1

1.2

Requirements in reduction of failure rate per transistor for reliable IC

Proj

ecte

d R

equi

red

Dec

reas

e in

Fa

ilure

rate

per

tran

sist

or

(nor

mal

ized

to 2

007)

Data from ITRS 2007

Failure rate per transistor must decrease exponentially as we go deeper into the sub-nm regime

High Performance


7

2008

2010

2012

2014

2016

2018

2020

2022

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

Global wire RC delay(ps)

Local wire RC delay(ps)

Dela

y (p

icos

econ

ds)

Global wires no longer scale

High Performance

Computing LAB

8

64b Off-ChipChannel

64pJ/word

Multi-core Processor16 nm technology25K FPU (64b)37.5 TFLOPS150 W (compute only)

• The energy required to move a 64b-word across the die is equivalent to the energy for 10 Flops

• Traditional designs have approximately 75% of energy consumed by overhead

20mm

10mm 20pJ4 cycles

64b FPU0.015 mm2

4pJ/op

3GHz

64b 1 mm channel

2pJ/word

Performance = ParallelismEfficiency = Locality

Bill Harrod, DARPA, IPTO, 2009

State-of-the-Art in Architecture Design

High Performance

Computing LAB

Operation Energy(pJ) DPFLOPs Insts*I$Fetch 33 0.67 2.0RegisterAccess 10.5 0.2 0.6Access 3 op. D$ 100 2 6Access 3 op. L2D$ 460 9 27Access 3 op. off-chip 762 15 45Access 3 op. from DRAM

6000 120 360

Energy dominated by data and instruction movement.

* Insts column gives number of average instructions that can be performed for this energy.

Energy cost for different operations

High Performance

Computing LAB

D

Conventional Architecture (90nm)Energy is dominated by overhead

Conventional Architecture3.0E-10 1.0E-10 2.0E-10

4.8E-10 1.3E-10

FPULocalGlobal

Off-chipOverheadDRAM

1.4E-8Dally

High Performance

Computing LAB Where is this overhead coming from?

Complex microarchitecture: OOO, register renaming, branch prediction….

Complex memory hierarchy High/unnecessary data movement Orthogonal design style Limited understanding of application

requirements

12

High Performance

Computing LAB Both Put Together….

Power becomes the deciding factor for designing HPC systemsJoules/operation

Hardware acquisition cost no more dominates in terms of TCO

13

High Performance

Computing LAB IT Infrastructure Optimization: The new Math

- 14 -

Until Now: Minimize equipment, software/licenses, and service/management costs.

Going Forward: Power and Physical Infrastructure costs to house the IT become equally important.

Become “Greener” in the process.

Installed base(M units)

0

Spending(US$B)

$0

$50

$100

$150

$200

$250

$300

1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

POWER & COOLING COSTS SERVER MANAGEMENT COSTNEW SERVER SPENDING

5101520253035404550

Source: IDC

Becoming comparable!

1.00 W

DC-DC(0.18 W)

1.18 W

AC-DC(0.31 W)

1.49 W

Power Distribution

(0.04 W)

1.53 W

UPS(0.14 W)

1.67 W

Cooling(1.07 W)

2.84 W

2.74 W

Building Switchgear/ Transformer

(0.10 W)

1 Watt consumed at the server cascades to approximately 2.84 watts of total consumption

Server Component

(1 W)

Source: EmersonCumulative consumption

High Performance

Computing LAB A Holistic Green Strategy

Facilities

Operations, Office spaces,Factories, Travel and Transportation, Energy sourcing, …

UPS, Power distribution, Chillers, Lighting, Real estate

Support

Servers StorageNetworking

CoreGreening

Of Technology

Technology forGreening

Source: A. Sivasubramaniam

High Performance

Computing LAB

4

Bill Dally’s strawman processor architecture• Possible processor design methodology for achieving 28 pJ/Fflop• Requires optimization of communication, computation and memory

components

2.5nJ/FLOP

Conventional Design631pJ/FLOP

Minimize Overhead

28pJ/FLOP

Minimize DRAM energy

Processor Power Efficiency Based on ExtremeScale Study

Bill Harrod, DARPA, IPTO, 2009

High Performance

Computing LAB

19

Opportunity- Heterogeneous Architectures

• Multicore era• Heterogeneous multicore architectures provide the most compelling

architectural trajectory to mitigate these problems

Hybrid cores: Big, small, accelerators, GPUs

Hybrid memory sub-system: SRAM, TFET, STT-RAM

Heterogeneous interconnect

High Performance

Computing LAB A Holistic Design Paradigm

20

Heterogeinity in device/circuits

Heterogeinity in micro-arch.

Heterogeinity in memory des.

Heterogeinity in interconnect

High Performance

Computing LAB Technology Heterogeneity

21

TFETs provide higher performance than CMOS based designs at lower voltages

• Heterogeneity in technology: • CMOS based scaling is expected to continue till 2022• Exploiting emerging technologies to design different

cores/components is promising because it can enable cores with power/performance tradeoffs that were not possible before.

V/F scaling of CMOS and TFET devices

High Performance

Computing LAB Processor Cores

22

Big core

Small core

GPGPUs

Accelerators/ ASIC

Latency critical

Throughput critical

BW critical

Latency/ time critical

Heterogeneous Compute Nodes

High Performance

Computing LAB

Role of novel technologies in memory systems

23

Memory Architecture

Comparison of memory technologies

High Performance

Computing LAB Heterogeneous Interconnect

24

Non-uniformity is due to: non-edge symmetric network and X-Y routing. So,

• Why clock the routers at the same frequency: Variable frequency routers for designing NoCs

• Why allocate all routers similar area/buffer/link resources: Heterogeneous routers/NoC

Buffer Utilization Link Utilization

High Performance

Computing LAB Software Support

Compiler support Thread remapping to minimize power: migrate threads to TFET

cores to reduce power Dynamic instruction morphing: instructions of a thread are

morphed to match the heterogeneous hardware the thread is mapped to by the runtime system

OS support Heterogeneity aware scheduling support Run-time thread migration support

25

High Performance

Computing LAB Current research in HPCL

26

Problems with Current NoCs

• NoC power consumption is a concern today

11

36

21

4

28

Clock distribution Dual FPMACs IMEM+ DMEM 10-port RF Router + links

• With technology scaling, NoC power can be as high as 40-60W for 128 nodes2

Intel 80 core tile power profile1

1. A 5-GHz Mesh Interconnect for A Teraflops Processor–Y. Hoskote, S. Vangal, A. Singh, N. Borkar, S. Borkar in IEEE MICRO 20072. Networks for Multi-core Chips:A contrarian view - S. Borkar in Special Session at ISLPED 2007

High Performance

Computing LAB Network performance/power

27

0.01 0.04000000000000

01

0.08000000000000

02

0.16 0.24 0.28 0.32000000000000

2

0.34000000000000

1

0.36 0.38000000000000

2

0.39000000000000

2

0.405

101520253035

01234567

Normalized Power Normilized Latency

Injection Ratio (flits/node/cycle)

Rat

io o

f Pow

er In

crea

se

Rat

io o

f Lat

ency

Incr

ease

Normalized Latency

`

Observation:@low load: low power consumption

@high load: high power consumption and congestion

The proposed approach1

@low load: optimize for performance (reduce zero load latency and accelerate flits)

@high load: manage congestion and power

1. A Case for Dynamic Frequency Tuning in On-Chip Networks, MICRO 2009

High Performance

Computing LAB Frequency Tuning Rationale

28

Congested

Throttle/ Frequency is lowered

No change

No changeUpstream router throttles

depending upon its total buffer utilization

Frequency is boosted

High Performance

Computing LAB

Performance/Power improvement with RAFT

29

0 0.1 0.2 0.3 0.40

10

20

30

40Latency with UR

BaseCase

FreqThrtl

FreqBoost

FreqTune

Injection Ratio(flits/node/cycle)

Late

ncy

(in N

anos

econ

ds)

• FreqTune gives both power reduction and throughput improvement

• 36% reduction in latency, 31% increase in throughput and 14% power reduction across all traffic patterns

FreqBoost at low load (optimize performance)

FreqThrtl at high load (optimize performance and power)

High Performance

Computing LAB A Case for Heterogeneous NoCs

30

• Using the same amount of link resources and fewer buffer resources as a homogeneous network, this proposal demonstrates that a carefully designed heterogeneous network can reduce average latency, improve network throughput and reduce power

• Explore types, number and placement of heterogeneous routers in the network

Small routerBigrouter

Narrow link Wide link

High Performance

Computing LAB

HeteroNoC Performance-Power Envelope

31

EDP Latency Power0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

192 128 256 HeteroNoCs

Rat

io

• 22% throughput improvement• 25% latency reduction• 28% power reduction

High Performance

Computing LAB

32

3D Stacking = Increased Locality!

Many more neighbors within a few minutes of reach!

High Performance

Computing LAB

33

Reduced Global Interconnect Length

L

L

Delay/Power Reduction Bandwidth Increase Smaller Footprint Mixed Technology Integration

High Performance

Computing LAB 3D routers for 3D networks

34

One router in one grid (Total area = 4L2)

Stack layers in 3D (Total area = L2)

Stack routers components in 3D (Total area = L2)

Results from MIRA: A Multi-layered On-Chip Interconnect Router Architecture, ISCA 2008

High Performance

Computing LAB Conclusions

Need a coherent approach to address the submicron technology problems in designing energy-eficient HPC systems

Heterogeneous multicores can address these problems and would be the future architecture trajectory

But, design of such systems is extremely complex

Needs an integrated technology-hardware-software-application approach

35

High Performance

Computing LAB HPCL Collaborators

36

Faculty:Vijaykrishnan NarayananYuan XieAnand SivasubramaniamMahmut Kandemir

Students:Sueng-Hwan LimBikash SharmaAdwait JogAsit MishraReetuparna DasDongkook ParkJongman Kim

Partially supported by:NSF, DARPA, DOE, Intel,IBM, HP, Google,Samsung

High Performance

Computing LAB

THANK YOU !!!

Questions???

37

Documents

Challenges and Opportunities in Designing Energy-Efficient High-Performance Computing Platforms