34
Challenges and Opportunities in Designing Energy-Efficient High- Performance Computing Platforms Chita R. Das High Performance Computing Laboratory Department of Computer Science & Engineering The Pennsylvania State University EEHiPC-December 19, 2010

Challenges and Opportunities in Designing Energy-Efficient High-Performance Computing Platforms

  • Upload
    lilac

  • View
    34

  • Download
    0

Embed Size (px)

DESCRIPTION

Challenges and Opportunities in Designing Energy-Efficient High-Performance Computing Platforms. Chita R. Das High Performance Computing Laboratory Department of Computer Science & Engineering The Pennsylvania State University EEHiPC-December 19, 2010. Talk Outline. - PowerPoint PPT Presentation

Citation preview

Page 1: Challenges and Opportunities in Designing Energy-Efficient High-Performance Computing Platforms

Challenges and Opportunities in Designing Energy-Efficient High-Performance Computing Platforms

Chita R. Das

High Performance Computing LaboratoryDepartment of Computer Science & Engineering

The Pennsylvania State University EEHiPC-December 19, 2010

Page 2: Challenges and Opportunities in Designing Energy-Efficient High-Performance Computing Platforms

High Performance

Computing LAB

2

Talk Outline

Technology Scaling Challenges State-of-the-art Design Challenges Opportunity: Heterogeneous Architectures

Technology – 3D, TFET, Optics, STT-RAM Processor – new devices, Core heterogeneity Memory – STT-RAM, PCM, etc Interconnect – Network heterogeneity

Conclusions

Page 3: Challenges and Opportunities in Designing Energy-Efficient High-Performance Computing Platforms

High Performance

Computing LAB Computing Walls

3

Data from ITRS 2008

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

2021

2022

0

5

10

15

20

25

Total

Num

ber o

f cor

es p

er c

hip

(nor

mal

ized

to 2

008)

Moore’s Law

Page 4: Challenges and Opportunities in Designing Energy-Efficient High-Performance Computing Platforms

High Performance

Computing LAB Computing Walls

4

Data from ITRS 2008

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

2021

2022

0

5

10

15

20

25Total Powered on for 100 Watts

Num

ber o

f cor

es p

er c

hip

(nor

mal

ized

to 2

008)

Utilization and Power Wall

3x1x

P ≈ CV2fLower V can reduce PBut, speed decreases with V

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

2021

2022

0

0.2

0.4

0.6

0.8

1

1.2

Supp

ly V

olta

ge (V

)• High performance MOS started out with 12V• Current high-perf. μPs have 1V supply =>

(12/1)2 = 144x over 28 years.• Only (1/0.6)2 = 2.77x left in next 12 years!

Page 5: Challenges and Opportunities in Designing Energy-Efficient High-Performance Computing Platforms

High Performance

Computing LAB Computing Walls

5

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

2021

2022

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

Num

ber o

f tot

al p

acka

ge p

ins Memory bandwidth:

Pin count increases only 4x compared to 25x increase in cores

Data from ITRS 2009

Page 6: Challenges and Opportunities in Designing Energy-Efficient High-Performance Computing Platforms

High Performance

Computing LAB

Reliability Wall

6

Computing Walls

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

2021

20220

0.2

0.4

0.6

0.8

1

1.2

Requirements in reduction of failure rate per transistor for reliable IC

Proj

ecte

d R

equi

red

Dec

reas

e in

Fa

ilure

rate

per

tran

sist

or

(nor

mal

ized

to 2

007)

Data from ITRS 2007

Failure rate per transistor must decrease exponentially as we go deeper into the sub-nm regime

Page 7: Challenges and Opportunities in Designing Energy-Efficient High-Performance Computing Platforms

High Performance

Computing LAB Computing Walls

7

2008

2010

2012

2014

2016

2018

2020

2022

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

Global wire RC delay(ps)

Local wire RC delay(ps)

Dela

y (p

icos

econ

ds)

Global wires no longer scale

Page 8: Challenges and Opportunities in Designing Energy-Efficient High-Performance Computing Platforms

High Performance

Computing LAB

8

64b Off-ChipChannel

64pJ/word

Multi-core Processor16 nm technology25K FPU (64b)37.5 TFLOPS150 W (compute only)

• The energy required to move a 64b-word across the die is equivalent to the energy for 10 Flops

• Traditional designs have approximately 75% of energy consumed by overhead

20mm

10mm 20pJ4 cycles

64b FPU0.015 mm2

4pJ/op

3GHz

64b 1 mm channel

2pJ/word

Performance = ParallelismEfficiency = Locality

Bill Harrod, DARPA, IPTO, 2009

State-of-the-Art in Architecture Design

Page 9: Challenges and Opportunities in Designing Energy-Efficient High-Performance Computing Platforms

High Performance

Computing LAB

Operation Energy(pJ) DPFLOPs Insts*I$Fetch 33 0.67 2.0RegisterAccess 10.5 0.2 0.6Access 3 op. D$ 100 2 6Access 3 op. L2D$ 460 9 27Access 3 op. off-chip 762 15 45Access 3 op. from DRAM

6000 120 360

Energy dominated by data and instruction movement.

* Insts column gives number of average instructions that can be performed for this energy.

Energy cost for different operations

Page 10: Challenges and Opportunities in Designing Energy-Efficient High-Performance Computing Platforms

High Performance

Computing LAB

D

Conventional Architecture (90nm)Energy is dominated by overhead

Conventional Architecture3.0E-10 1.0E-10 2.0E-10

4.8E-10 1.3E-10

FPULocalGlobal

Off-chipOverheadDRAM

1.4E-8Dally

Page 11: Challenges and Opportunities in Designing Energy-Efficient High-Performance Computing Platforms

High Performance

Computing LAB Where is this overhead coming from?

Complex microarchitecture: OOO, register renaming, branch prediction….

Complex memory hierarchy High/unnecessary data movement Orthogonal design style Limited understanding of application

requirements

12

Page 12: Challenges and Opportunities in Designing Energy-Efficient High-Performance Computing Platforms

High Performance

Computing LAB Both Put Together….

Power becomes the deciding factor for designing HPC systemsJoules/operation

Hardware acquisition cost no more dominates in terms of TCO

13

Page 13: Challenges and Opportunities in Designing Energy-Efficient High-Performance Computing Platforms

High Performance

Computing LAB IT Infrastructure Optimization: The new Math

- 14 -

Until Now: Minimize equipment, software/licenses, and service/management costs.

Going Forward: Power and Physical Infrastructure costs to house the IT become equally important.

Become “Greener” in the process.

Installed base(M units)

0

Spending(US$B)

$0

$50

$100

$150

$200

$250

$300

1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

POWER & COOLING COSTS SERVER MANAGEMENT COSTNEW SERVER SPENDING

5101520253035404550

Source: IDC

Becoming comparable!

1.00 W

DC-DC(0.18 W)

1.18 W

AC-DC(0.31 W)

1.49 W

Power Distribution

(0.04 W)

1.53 W

UPS(0.14 W)

1.67 W

Cooling(1.07 W)

2.84 W

2.74 W

Building Switchgear/ Transformer

(0.10 W)

1 Watt consumed at the server cascades to approximately 2.84 watts of total consumption

Server Component

(1 W)

Source: EmersonCumulative consumption

Page 14: Challenges and Opportunities in Designing Energy-Efficient High-Performance Computing Platforms

High Performance

Computing LAB A Holistic Green Strategy

Facilities

Operations, Office spaces,Factories, Travel and Transportation, Energy sourcing, …

UPS, Power distribution, Chillers, Lighting, Real estate

Support

Servers StorageNetworking

CoreGreening

Of Technology

Technology forGreening

Source: A. Sivasubramaniam

Page 15: Challenges and Opportunities in Designing Energy-Efficient High-Performance Computing Platforms

High Performance

Computing LAB

4

Bill Dally’s strawman processor architecture• Possible processor design methodology for achieving 28 pJ/Fflop• Requires optimization of communication, computation and memory

components

2.5nJ/FLOP

Conventional Design631pJ/FLOP

Minimize Overhead

28pJ/FLOP

Minimize DRAM energy

Processor Power Efficiency Based on ExtremeScale Study

Bill Harrod, DARPA, IPTO, 2009

Page 16: Challenges and Opportunities in Designing Energy-Efficient High-Performance Computing Platforms

High Performance

Computing LAB

19

Opportunity- Heterogeneous Architectures

• Multicore era• Heterogeneous multicore architectures provide the most compelling

architectural trajectory to mitigate these problems

Hybrid cores: Big, small, accelerators, GPUs

Hybrid memory sub-system: SRAM, TFET, STT-RAM

Heterogeneous interconnect

Page 17: Challenges and Opportunities in Designing Energy-Efficient High-Performance Computing Platforms

High Performance

Computing LAB A Holistic Design Paradigm

20

Heterogeinity in device/circuits

Heterogeinity in micro-arch.

Heterogeinity in memory des.

Heterogeinity in interconnect

Page 18: Challenges and Opportunities in Designing Energy-Efficient High-Performance Computing Platforms

High Performance

Computing LAB Technology Heterogeneity

21

TFETs provide higher performance than CMOS based designs at lower voltages

• Heterogeneity in technology: • CMOS based scaling is expected to continue till 2022• Exploiting emerging technologies to design different

cores/components is promising because it can enable cores with power/performance tradeoffs that were not possible before.

V/F scaling of CMOS and TFET devices

Page 19: Challenges and Opportunities in Designing Energy-Efficient High-Performance Computing Platforms

High Performance

Computing LAB Processor Cores

22

Big core

Small core

GPGPUs

Accelerators/ ASIC

Latency critical

Throughput critical

BW critical

Latency/ time critical

Heterogeneous Compute Nodes

Page 20: Challenges and Opportunities in Designing Energy-Efficient High-Performance Computing Platforms

High Performance

Computing LAB

Role of novel technologies in memory systems

23

Memory Architecture

Comparison of memory technologies

Page 21: Challenges and Opportunities in Designing Energy-Efficient High-Performance Computing Platforms

High Performance

Computing LAB Heterogeneous Interconnect

24

Non-uniformity is due to: non-edge symmetric network and X-Y routing. So,

• Why clock the routers at the same frequency: Variable frequency routers for designing NoCs

• Why allocate all routers similar area/buffer/link resources: Heterogeneous routers/NoC

Buffer Utilization Link Utilization

Page 22: Challenges and Opportunities in Designing Energy-Efficient High-Performance Computing Platforms

High Performance

Computing LAB Software Support

Compiler support Thread remapping to minimize power: migrate threads to TFET

cores to reduce power Dynamic instruction morphing: instructions of a thread are

morphed to match the heterogeneous hardware the thread is mapped to by the runtime system

OS support Heterogeneity aware scheduling support Run-time thread migration support

25

Page 23: Challenges and Opportunities in Designing Energy-Efficient High-Performance Computing Platforms

High Performance

Computing LAB Current research in HPCL

26

Problems with Current NoCs

• NoC power consumption is a concern today

11

36

21

4

28

Clock distribution Dual FPMACs IMEM+ DMEM 10-port RF Router + links

• With technology scaling, NoC power can be as high as 40-60W for 128 nodes2

Intel 80 core tile power profile1

1. A 5-GHz Mesh Interconnect for A Teraflops Processor–Y. Hoskote, S. Vangal, A. Singh, N. Borkar, S. Borkar in IEEE MICRO 20072. Networks for Multi-core Chips:A contrarian view - S. Borkar in Special Session at ISLPED 2007

Page 24: Challenges and Opportunities in Designing Energy-Efficient High-Performance Computing Platforms

High Performance

Computing LAB Network performance/power

27

0.01 0.04000000000000

01

0.08000000000000

02

0.16 0.24 0.28 0.32000000000000

2

0.34000000000000

1

0.36 0.38000000000000

2

0.39000000000000

2

0.405

101520253035

01234567

Normalized Power Normilized Latency

Injection Ratio (flits/node/cycle)

Rat

io o

f Pow

er In

crea

se

Rat

io o

f Lat

ency

Incr

ease

Normalized Latency

`

Observation:@low load: low power consumption

@high load: high power consumption and congestion

The proposed approach1

@low load: optimize for performance (reduce zero load latency and accelerate flits)

@high load: manage congestion and power

1. A Case for Dynamic Frequency Tuning in On-Chip Networks, MICRO 2009

Page 25: Challenges and Opportunities in Designing Energy-Efficient High-Performance Computing Platforms

High Performance

Computing LAB Frequency Tuning Rationale

28

Congested

Throttle/ Frequency is lowered

No change

No changeUpstream router throttles

depending upon its total buffer utilization

Frequency is boosted

Page 26: Challenges and Opportunities in Designing Energy-Efficient High-Performance Computing Platforms

High Performance

Computing LAB

Performance/Power improvement with RAFT

29

0 0.1 0.2 0.3 0.40

10

20

30

40Latency with UR

BaseCase

FreqThrtl

FreqBoost

FreqTune

Injection Ratio(flits/node/cycle)

Late

ncy

(in N

anos

econ

ds)

• FreqTune gives both power reduction and throughput improvement

• 36% reduction in latency, 31% increase in throughput and 14% power reduction across all traffic patterns

FreqBoost at low load (optimize performance)

FreqThrtl at high load (optimize performance and power)

Page 27: Challenges and Opportunities in Designing Energy-Efficient High-Performance Computing Platforms

High Performance

Computing LAB A Case for Heterogeneous NoCs

30

• Using the same amount of link resources and fewer buffer resources as a homogeneous network, this proposal demonstrates that a carefully designed heterogeneous network can reduce average latency, improve network throughput and reduce power

• Explore types, number and placement of heterogeneous routers in the network

Small routerBigrouter

Narrow link Wide link

Page 28: Challenges and Opportunities in Designing Energy-Efficient High-Performance Computing Platforms

High Performance

Computing LAB

HeteroNoC Performance-Power Envelope

31

EDP Latency Power0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

192 128 256 HeteroNoCs

Rat

io

• 22% throughput improvement• 25% latency reduction• 28% power reduction

Page 29: Challenges and Opportunities in Designing Energy-Efficient High-Performance Computing Platforms

High Performance

Computing LAB

32

3D Stacking = Increased Locality!

Many more neighbors within a few minutes of reach!

Page 30: Challenges and Opportunities in Designing Energy-Efficient High-Performance Computing Platforms

High Performance

Computing LAB

33

Reduced Global Interconnect Length

L

L

Delay/Power Reduction Bandwidth Increase Smaller Footprint Mixed Technology Integration

Page 31: Challenges and Opportunities in Designing Energy-Efficient High-Performance Computing Platforms

High Performance

Computing LAB 3D routers for 3D networks

34

One router in one grid (Total area = 4L2)

Stack layers in 3D (Total area = L2)

Stack routers components in 3D (Total area = L2)

Results from MIRA: A Multi-layered On-Chip Interconnect Router Architecture, ISCA 2008

Page 32: Challenges and Opportunities in Designing Energy-Efficient High-Performance Computing Platforms

High Performance

Computing LAB Conclusions

Need a coherent approach to address the submicron technology problems in designing energy-eficient HPC systems

Heterogeneous multicores can address these problems and would be the future architecture trajectory

But, design of such systems is extremely complex

Needs an integrated technology-hardware-software-application approach

35

Page 33: Challenges and Opportunities in Designing Energy-Efficient High-Performance Computing Platforms

High Performance

Computing LAB HPCL Collaborators

36

Faculty:Vijaykrishnan NarayananYuan XieAnand SivasubramaniamMahmut Kandemir

Students:Sueng-Hwan LimBikash SharmaAdwait JogAsit MishraReetuparna DasDongkook ParkJongman Kim

Partially supported by:NSF, DARPA, DOE, Intel,IBM, HP, Google,Samsung

Page 34: Challenges and Opportunities in Designing Energy-Efficient High-Performance Computing Platforms

High Performance

Computing LAB

THANK YOU !!!

Questions???

37