© 2005, Kevin Skadron Designing Cool Chips in an Era of Gigascale Integration: History, Challenges, and Opportunities Kevin Skadron LAVA/HotSpot Lab Dept

© 2

005,

Kev

in S

kadr

on

Designing Cool Chips in an Era of Gigascale Integration:

History, Challenges, and Opportunities

Kevin Skadron

LAVA/HotSpot LabDept. of Computer Science

University of VirginiaCharlottesville, VA

2

© 2

005,

Kev

in S

kadr

on

“Cooking-Aware” Computing?

3

© 2

005,

Kev

in S

kadr

on

ITRS Projections

• These are targets, doubtful that they are feasible

• Growth in power density means cooling costs continue to grow

• High-performance designs seem to be shifting away from clock frequency toward # cores

ITRS 2004

Year 2003 2006 2010 2013 2016Tech node (nm) 100 70 45 32 22Vdd (high perf) (V) 1.2 1.1 1.0 0.9 0.8Vdd (low power) (V) 1.0 0.9 0.7 0.6 0.5Frequency (high perf) (GHz) 3.0 6.8 15.1 23.0 39.7

High-perf w/ heatsink 149 180 198 198 198Cost-performance 80 98 120 138 158Hand-held 2.1 2.4 2.8 3.0 3.0

Max power (W)

2001 – was 0.4

2001 – was 288

4

© 2

005,

Kev

in S

kadr

on

Power EvolutionM

ax

Po

we

r (W

att

s)

i386 i386

i486 i486

Pentium® Pentium®

Pentium® w/MMX tech.

Pentium® w/MMX tech.

1

10

100

Pentium® Pro Pentium® Pro

Pentium® II Pentium® II

Pentium® 4Pentium® 4Pentium® 4Pentium® 4

??

Pentium® III Pentium® III

Source: Intel

Zero-Sum Architecture!

5

© 2

005,

Kev

in S

kadr

on

Leakage – A Growing Problem• The fraction of leakage power is increasing exponentially with

each generation• Also exponentially dependent on temperature• Curiously, ITRS 2004 projections are lower than what industry

is currently reporting• Changes tradeoffs! Idle logic hurts, e.g. CMPs

Static power/ Dynamic Power

0

10

20

30

40

50

60

70

298

303

308

313

318

323

328

333

338

343

348

353

358

363

368

373

Temperature(K)

Pe

rcen

tag

e

180nm 130nm 100nm 90nm 80nm 70nm

Increasingratioacrossgenerations

(DataderivedfromITRS2001)

6

© 2

005,

Kev

in S

kadr

on

Thermal Packaging is Expensive• Nvidia GeForce 5900 card – “dustbuster”

Source: Tech-Report.com

7

© 2

005,

Kev

in S

kadr

on

Architecture Trends• High-performance market

• “Fat” (wide, superscalar) CPUs and high frequencies giving way to multiple cores, plateau in frequencies

– Huge number of multi-core product announcements

– # cores might be the next marketing buzz

• Multiple threads per core

– This probably won’t scale – limit of 2-4 thread contexts

• Interesting example: Sun Niagara

– 8 4-threaded cores

• Across all market segments• Growing integration (SoC)

• Specialized co-processors and offload engines

• Growing heterogeneity• Part of the programming model in SoCs

• Not part of the programming model in CMPs!

8

© 2

005,

Kev

in S

kadr

on

Basketball Analogy• Recent trends in high-performance

processors are like building a team around Shaq when you have a limited budget

• Huge salary (power) to one player• Huge ego, team friction (heat)• Shaq can’t get much better (except possibly his

free throws) (diminishing returns)• New trend: multiple CPUs on a chip (CMP/SoC)

• Don’t need superstars (less power per core, better energy efficiency)

• Choose team players (better heat distribution)• Performance scales linearly with cores• Heterogeneous cores possible (SoCs)• Detroit Pistons

9

© 2

005,

Kev

in S

kadr

on

Talk Outline• Different philosophies of Power-Aware

design• Energy efficient vs. low power vs. temperature-

aware

• Power Management Techniques• Dynamic• Static• Temperature

• Summary of Important Challenges

• My perspective tends to be architecture-centric, and slanted toward high-performance desktop/server/etc. CPUs

10

© 2

005,

Kev

in S

kadr

on

Metrics

• Power• Average power, instantaneous power, peak power

• Energy• Energy (MIPS/W)

• Energy-Delay product (MIPS2/W)

• Energy-Delay2 product (MIPS3/W) – voltage independent!

• Temperature• Correlated with power density over sufficiently

large time periods

• No good figures of merit for trading off thermal efficiency against performance, area, or energy efficiency

(Zyuban, GVLSI’02)

Low-Power DesignPower-Aware/

Energy-EfficientDesign

Temperature-Aware Design

Design for power delivery

11

© 2

005,

Kev

in S

kadr

on

12

© 2

005,

Kev

in S

kadr

on

Circuit Techniques • Transistor sizing

• Signal and clock gating

• Dynamic vs. static logic

• Circuit restructuring

• Low power caches, register files, queues

• These typically reduce the capacitance being switched

13

© 2

005,

Kev

in S

kadr

on

Clock Gating, Signal Gating

• Implementation• Simple gate that replaces

one buffer in the clock tree• Signal gating is similar, helps

avoid glitches• Delay is generally not a concern

except at fine granularities

• Choice of circuit design andclock gating style can have a dramatic effect on temperaturedistribution

““Disabling a functional block when it is not required for an extended Disabling a functional block when it is not required for an extended periodperiod””

signal

ctrl

functionalunit

functionalunit

14

© 2

005,

Kev

in S

kadr

on

Circuit Restructuring• Pipeline (tolerate smaller, longer-latency circuitry)• Parallelize (can reduce frequency)• Reorder inputs so that most active input is closest

to output (reduces switched capacitance)• Restructure gates (equivalent functions are not

equivalent in switched capacitance)

Logic BlockLogic BlockFreq = 1Vdd = 1Throughput = 1Power = 1Area = 1 Pwr Den = 1

Vdd

Logic BlockLogic Block

Freq = 0.5Vdd = 0.5Throughput = 1Power = 0.25Area = 2Pwr Den = 0.125

Vdd/2


Example: Parallelizing (maintain throughput)

Source: Shekhar Borkar, keynote presentation, MICRO-37, 2004

15

© 2

005,

Kev

in S

kadr

on

Architectural-Level Techniques• Sleep modes• Pipeline depth• Energy-efficient front end

• Branch prediction accuracy is a major determinant of pipeline activity -> spending more power in the branch predictor can be worthwhile if it improves accuracy

• Integration (e.g. multiple cores)• Multi-threading• Dynamic voltage/frequency scaling• Multi clock domain architectures (similar to GALS)• Power islands• Encoding/compression

• Can reduce both switched capacitance and cross talk• Application specific hardware

• Co-processors, functional units, etc.• Compiler techniques

Prevalent

Growing or Imminent

16

© 2

005,

Kev

in S

kadr

on

Optimal Pipeline Depth

Hartstein and Puzak, ACM TACO, Dec. 2004

• Increased power and diminishing returns vs. increased throughput

• 5-10 stages, 15-30 FO4• Srinivasan et al, MICRO-35, Hartstein and Puzak, ACM TACO,

Dec. 2004

Single issue

4-wide issue

Pipeline Stages

17

© 2

005,

Kev

in S

kadr

on



• Integration (e.g. multiple cores)• Multi-threading• Dynamic voltage/frequency scaling• Multi clock domain architectures (similar to GALS)• Power islands• Encoding/compression



Prevalent

Growing or Imminent

18

© 2

005,

Kev

in S

kadr

on

Multi-threading• Do more useful work per unit time

• Amortize overhead and leakage

• Switch-on-event MT• Switch on cache misses, etc. (Ex: Sun Niagara

“throughput computing”)

• Can even rotate among threads every instruction (Tera/Cray)

• Simultaneous Multithreading/HyperThreading• For superscalar – eliminate waste

• Intel Pentium 4, IBM POWER5, Alpha 21464

19

© 2

005,

Kev

in S

kadr

on



• Integration (e.g. multiple cores)• Multi-threading• Dynamic voltage/frequency scaling

• Limits• Multi clock domain architectures (similar to GALS)• Power islands• Encoding/compression



Prevalent

Growing or Imminent

20

© 2

005,

Kev

in S

kadr

on

Compiler Techniques for Low Power• Basic idea is for the compiler to identify

opportunities for using low-power modes

• Compiler-guided DVS• Reduce voltage in memory-bound program regions

– Hsu and Kremer, ISLPED’01, PLDI’03; Xie et al, PLDI’03• Dynamic resource configuration/hibernation

• Deactivate modules when they won’t be used for a long time (>> sleep/wakeup time)…avoids waiting for timeout

– Heath et al, PACT’02• Profile/compiler-guided adaptation

• Subroutine-guided (“positional”) adapation (Huang et al, ISCA’03)

– Uses profiling and a hierarchy of low-power modes

• Much work in this area – this only touches the surface

21

© 2

005,

Kev

in S

kadr

on

22

© 2

005,

Kev

in S

kadr

on

Static Power Dissipation• Static power: dissipation due to leakage

current

• Exponentially dependent on T, Vdd, Vth

• Most important sources of static power: subthreshold leakage and gate leakage• We will focus on subthreshold

• Gate leakage has essentially been ignored – New gate insulation materials may solve problem

23

© 2

005,

Kev

in S

kadr

on

Thermal Runaway• The leakage-temperature feedback can lead

to a positive feedback loop• Temperature increases leakage increases

temperature increases leakage increases • …

Source: www.usswisconsin.org

24

© 2

005,

Kev

in S

kadr

on

A Smorgasbord• Transistor sizing• Multi Vth

• Dynamic threshold voltage – reverse body bias – Transmeta Efficeon• Transmeta uses runtime compilation and load monitoring to

select thresholds

• Stack effect• Sleep transistors• DVS

• Coarse or fine grained

• Low leakage caches, register files, queues• Techniques for reducing gate leakage• Hurry up and wait

• Low leakage: maintain min possible V, f• High leakage: use high V/f to finish work quickly, then go to

sleep

25

© 2

005,

Kev

in S

kadr

on

Sleep Transistors

• Recent work suggests that a properly sized, low-Vth footer transistor can preserve enough leakage to keep the cell active (Li et al, PACT’02; Agarwal et al, DAC’02)• Great care must be taken when

switching back to full voltage: noise can flip bits

• Extra latency may be necessarywhen re-activating

• Similar to principles in sub-threshold computing• Ex – sensor motes for wireless

sensor networks

• Concerns about susceptibility to SEU


26

© 2

005,

Kev

in S

kadr

on

A Smorgasbord• Transistor sizing• Multi Vth

• Dynamic threshold voltage – reverse body bias – Transmeta Efficeon• Transmeta uses runtime compilation and load monitoring to

select thresholds

• Stack effect• Sleep transistors• DVS

• Coarse or fine grained

• Low leakage caches, register files, queues• Techniques for reducing gate leakage• Hurry up and wait

• Low leakage: maintain min possible V, f• High leakage: use high V/f to finish work quickly, then go to

sleep

27

© 2

005,

Kev

in S

kadr

on

28

© 2

005,

Kev

in S

kadr

on

Worst-Case leads to Over-design• Average case temperature lower than worst-case

• Aggressive clock gating

• Application variations

• Underutilized resources, e.g. FP units during integer code

• Currently 20-40% difference

Source: Gunther et al, ITJ 2001

Reduced targetpower density

Reduced coolingcost

TDP

29

© 2

005,

Kev

in S

kadr

on

Temporal, Spatial VariationsTemperature variationof SPEC applu over time

Localized hot spots dictate cooling solution

30

© 2

005,

Kev

in S

kadr

on

Temperature-Aware Design• Worst-case design is wasteful

• Power management is not sufficient for chip-level thermal management

• Must target blocks with high power density

• When they are hot

• Spreading heat helps

– Even if energy not affected

– Even if average temperature goes up

• This also helps reduce leakage

31

© 2

005,

Kev

in S

kadr

on

Role of Architecture?Dynamic thermal management (DTM)

• Automatic hardware response when temp. exceeds cooling• Cut power density at runtime, on demand• Trade reduced costs for occasional performance loss

• Architecture natural granularity for thermal management• Activity, temperature correlated within arch. units• DTM response can target hottest unit: permits fine-tuned

response compared to OS or package• Modern architectures offer rich opportunities for remapping

computation– e.g., CMPs/SoCs, graphics processors, tiled architectures– e.g., register file

• Thermal engineering must consider role of architecture

• Thermal engineers and architects need to collaborate

32

© 2

005,

Kev

in S

kadr

on

Existing DTM Implementations• Intel Pentium 4: Global clock gating with

shut-down fail-safe

• Intel Pentium M: Dynamic voltage scaling

• Transmeta Crusoe: Dynamic voltage scaling

• IBM Power 5: Probably fetch gating

• ACPI: OS configurable combination of passive & active cooling

• These solutions sacrifice time (slower or stalled execution) to reduce power density

• Better: a solution in “space”• Tradeoff between exacerbating leakage (more idle logic) or

reducing leakage (lower temperatures)

33

© 2

005,

Kev

in S

kadr

on

Alternative: Migrating Computation

This is only a simplistic illustrative example

34

© 2

005,

Kev

in S

kadr

on

Space vs. Time• Moving the hotspot, rather than throttling it,

reduces performance overhead by almost 60%

1.270

1.359

1.231

1.112

1.00

1.10

1.20

1.30

1.40

DVS FG Hyb MC

Slo

wd

ow

n F

ac

tor

Time Space

The greater the replication and spread,

the greater the opportunities

35

© 2

005,

Kev

in S

kadr

on

36

Sources of VariationsSources of Variations

0

50

100

150

200

250

He

at

Flu

x (

W/c

m2

)

Heat Flux (W/cm2)Results in Vcc variation

40

50

60

70

80

90

100

110

Te

mp

era

ture

(C

)

Temperature Variation (°C)Hot spots

10

100

1000

10000

1000 500 250 130 65 32

Technology Node (nm)

Me

an

Nu

mb

er

of

Do

pa

nt

Ato

ms

Random Dopant Fluctuations

0.01

0.1

1

1980 1990 2000 2010 2020

micron

10

100

1000

nm

193nm193nm248nm248nm

365nm365nmLithographyLithographyWavelengthWavelength

65nm65nm90nm90nm

130nm130nm

GenerationGeneration

GapGap

45nm45nm32nm32nm 13nm 13nm

EUVEUV

180nm180nm

Source: Mark Bohr, Intel

Sub-wavelength Lithography

Source: S

hekhar Borkar, keynote presentation, M

ICR

O-37, 2004

37

Impact of Static VariationsImpact of Static Variations

130nm

30%

5X

FrequencyFrequency~30%~30%

LeakageLeakagePowerPower~5-10X~5-10X

0.90.9

1.01.0

1.11.1

1.21.2

1.31.3

1.41.4

11 22 33 44 55Normalized Leakage (Isb)Normalized Leakage (Isb)

No

rmal

ized

Fre

qu

en

cyN

orm

aliz

ed F

req

ue

ncy


38

© 2

005,

Kev

in S

kadr

on

Parameter Variations• Parameter variations mess everything up!• T variation in Vcc, leakage T• Vcc speed variation, leakage T• Manufacturing (L, W, Vth, etc) speed, Vcc, T• Packaging variations (TIM, roughness) T

• Some transistors/functional units won’t work, some will be lousy, some will fail over time, and some will be intermittent

• Guard banding won’t work• Design devolves to worst component, can’t easily bound

intermittent behavior• T/P problems may no longer be limited to specific

units• Makes dynamic logic even more difficult

39

© 2

005,

Kev

in S

kadr

on

Future Architectures• Asymmetry unavoidable

• Specialized units (part of programming model)

• Power management (can try to hide this)

• Thermal throttling (hard to hide this)

• Parameter variations (hard to hide this without extreme performance loss)

40

© 2

005,

Kev

in S

kadr

on

Raw Architecture (MIT)

ComputeProcessor

Routers

On-chip networks

Source: MIT RAW project

Only one of many examples of tiled architectures

41

© 2

005,

Kev

in S

kadr

on

Future Architectures• Increasing integration, e.g. increasing

# cores, e.g. Niagara

• Clustered architectures

• Tiled architectures

• Multiple voltage islands

• Asymmetry unavoidable• Specialized units (part of programming model)

• Power management (can try to hide this)

• Thermal throttling (hard to hide this)

• Parameter variations (hard to hide this without extreme performance loss)

• Increasing problems with yield, failures in time(Redundancy: costly; graceful degradation: introduces asymmetry)

42

© 2

005,

Kev

in S

kadr

on

Power and Thermal Security• A consequence of designing for expected

rather than worst-case conditions

• Energy-drain attacks

• Voltage stability attacks (dI/dt)

• Thermal attacks• Thermal throttling

• Denial of service

• Direct physical damage

43

© 2

005,

Kev

in S

kadr

on

Summary• Reviewed current techniques for managing

dynamic power, leakage power, temperature• A major obstacle with architectural techniques

is the difficulty of predicting performance impact

• Continuing integration makes power an ever-present concern

• Thermal limits and parameter variations are becoming serious obstacles

• Spread heat in space, not time

• Security challenges

44

© 2

005,

Kev

in S

kadr

on

Soap-Box• Architecture solutions are essential

• Thermal engineers, circuit designers, CAD designers, and architects all need to work together

• Joint infrastructure• Simulators – esp. pre-RTL tools

• Test chips

– Ex: Combine architecture and circuit research on a single test chip

45

© 2

005,

Kev

in S

kadr

on

46

© 2

005,

Kev

in S

kadr

on

More Info

http://www.cs.virginia.edu/~skadron

LAVA Lab

47

© 2

005,

Kev

in S

kadr

on

Backup Slides

48

© 2

005,

Kev

in S

kadr

on

Hot Chips are No Longer Cool!W

att

s/c

m2

1

10

100

1000

i386i386i486i486

Pentium® Pentium®

Pentium® ProPentium® Pro

Pentium® IIPentium® IIPentium® IIIPentium® IIIHot plateHot plate

Nuclear ReactorNuclear ReactorNuclear ReactorNuclear Reactor

* “New Microarchitecture Challenges in the Coming Generations of CMOS Process * “New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies” – Fred Pollack, Intel Corp. Micro32 conference key note - 1999.Technologies” – Fred Pollack, Intel Corp. Micro32 conference key note - 1999.

Pentium® 4Pentium® 4

RocketRocketNozzleNozzleRocketRocketNozzleNozzle

Today’slaptops:

SIA

49

© 2

005,

Kev

in S

kadr

on

ITRS quotes – thermal challenges• For small dies with high pad count, high power

density, or high frequency, “operating temperature, etc for these devices exceed the capabilities of current assembly and packaging technology.”

• “Thermal envelopes imposed by affordable packaging discourage very deep pipelining.”

• Intel recently canceled its NetBurst microarchitecture

– Press reports suggest thermal envelopes were a factor

50

© 2

005,

Kev

in S

kadr

on

Thermal Packaging is Expensive• P4 packaging

Source: Intel web site

51

© 2

005,

Kev

in S

kadr

on

• Laptops and other constrained form factors

Thermal Packaging is Expensive

52

© 2

005,

Kev

in S

kadr

on

Trends in Battery Technology• Battery lifetime is increasing perhaps 8-

10%/yr.(Powers, Proc. of IEEE 1995)

• Not keeping up with rate of growth in energy consumption

Source: Rabaey 1995, cited in Irwin et al, “Low Power Design Methodologies, Hardware and Software Issues”, tutorial at PACT 2000

53

© 2

005,

Kev

in S

kadr

on

Dynamic Power Consumption• Power dissipated due to switching activity

• A capacitance is charged and discharged

Vdd

Charge/discharge at the frequency Charge/discharge at the frequency ffP=a CLV2 f

Ec=1/2CLV2

Ed=1/2CLV2

54

© 2

005,

Kev

in S

kadr

on

Transistor Sizing• Transistor sizing plays an important role to reduce

power

• Delay ~ (k / ln K)

• Power ~ K / (K-1)

• Optimum K for both power and delay must be pursued

C0 C1 CN-1 CN

K = Ci/Ci-1

55

© 2

005,

Kev

in S

kadr

on

Signal Gating

• Implementation• Simple gate

• Tristate buffer

• ...

• Control signal needed• Generation requires additional logic

• Especially helps to prevent power dissipation due to glitches

““techniques to mask unwanted switching activities from propagating techniques to mask unwanted switching activities from propagating forward, causing unnecessary power dissipationforward, causing unnecessary power dissipation””

signal

ctrl

Output

5656

Cache DesignCache Design

Banked organizationBanked organization Targets switched capacitanceTargets switched capacitance CCaccessaccess = R = R C C C Ccellcell / B/ B

Dividing word line Dividing word line Same effect for wordlinesSame effect for wordlines

Reducing voltage swingsReducing voltage swings Sense amplifiers used to detect VSense amplifiers used to detect Vdiffdiff across bitlines across bitlines

Read operation can complete as soon as VRead operation can complete as soon as Vdiff diff is detectedis detected Limiting voltage swing saves a fraction of powerLimiting voltage swing saves a fraction of power

Pulse word linesPulse word lines Enabling the word line for the time needed to discharge bitcell Enabling the word line for the time needed to discharge bitcell

voltagevoltage Designer needs to estimate access time and implement a pulse Designer needs to estimate access time and implement a pulse

generatorgenerator

5757

Low Power Register File DesignLow Power Register File Design

RF’s usually single-ended bitlinesRF’s usually single-ended bitlines Modified storage cellModified storage cell

Lot of zeros fetched from the RFLot of zeros fetched from the RF Bitline connections are modified to eliminate bitline discharge Bitline connections are modified to eliminate bitline discharge

when reading a zerowhen reading a zero

Tseng and Asanovic, ICSD, 2000Zyuban and Kogge, ISLPED 1998

5858

Efficient Issue QueueEfficient Issue Queue

Useful comparisonUseful comparison Empty entries and ready entries consume energyEmpty entries and ready entries consume energy

• Wakeup of empty entries can be disabledWakeup of empty entries can be disabled Gating off precharge logic using valid bitGating off precharge logic using valid bit

• Wakeup of ready sources can be disabledWakeup of ready sources can be disabled Gating off precharge logic using ready bitGating off precharge logic using ready bit

Folegnani and Gonzalez, ISCA 2001Folegnani and Gonzalez, ISCA 2001

Energy-efficient ComparatorsEnergy-efficient Comparators Traditional comparators dissipate energy on a mismatch in any Traditional comparators dissipate energy on a mismatch in any

bit position.bit position. 10%-20% of source operands match each cycle10%-20% of source operands match each cycle Solution: comparators that dissipate energy in a matchSolution: comparators that dissipate energy in a match

Kuckuc Kuckuc et alet al, ISLPED 2001, ISLPED 2001

5959

Multi Clock Domain ArchitectureMulti Clock Domain Architecture

Domains must be carefully chosenDomains must be carefully chosen Small cost on communicationsSmall cost on communications Re-using existing structures for cross-domain Re-using existing structures for cross-domain

synchronizatoinsynchronizatoin

ExampleExample 5 domains5 domains

• Front-endFront-end• Integer unitInteger unit• FP unitFP unit• On-chip cache unitOn-chip cache unit• Main memoryMain memory

6060

Multi Clock Domain ArchitectureMulti Clock Domain Architecture

L2unifiedcache

L1d-cache

LSQ

MemoryFront-end

branchpredict renameL1

i-cache

fetch dispatchIFQ

int.registerfile

int.FUs

IIQ

Integer

fp.registerfile

fp.FUs

FIQ

Floating Point

MainMemory

CPU

Magklis et al, ISCA 2003

6161

Multi Clock Domain ArchitectureMulti Clock Domain Architecture AdvantagesAdvantages

Local clock design is not aware of global skewLocal clock design is not aware of global skew Each domain limited by its local critical path, allowing Each domain limited by its local critical path, allowing

higher frequencieshigher frequencies Different voltage regulators allow for a finer-grain Different voltage regulators allow for a finer-grain

energy controlenergy control Frequency/voltage of each domain can be tailored to Frequency/voltage of each domain can be tailored to

its dynamic requirementsits dynamic requirements Clock Power is reducedClock Power is reduced

DrawbacksDrawbacks Complexity and penalty of synchronizersComplexity and penalty of synchronizers Feasibility of multiple voltage regulatorsFeasibility of multiple voltage regulators

6262

Sleep Modes Sleep Modes

ACPI: Advance Configuration and Power InterfaceACPI: Advance Configuration and Power Interface Developed by Microsoft, HP, Toshiba, Phoenix and IntelDeveloped by Microsoft, HP, Toshiba, Phoenix and Intel Replaces APM and PnP BIOSReplaces APM and PnP BIOS

Establishes interfaces for OS-directed power-Establishes interfaces for OS-directed power-managementmanagement

Defines various power states, e.g. Cx, Sx… with various Defines various power states, e.g. Cx, Sx… with various power-performance tradeoffs—OS can choosepower-performance tradeoffs—OS can choose

63

© 2

005,

Kev

in S

kadr

on

Dynamic Voltage/Frequency Scaling• Allow the device to dynamically adapt the

voltage (and the frequency)• P ~ Vdd

2

• F ~ Vdd/(Vdd-Vth)k

• Tradeoff between power reductions and delay increase

• But this is a vey powerful paradigm– Approx. quadratic or cubic reduction in power

(power density) relative to frequency reduction– Most other techniques are linear with respect

to perf. loss– DVS switching overhead must be taken into

account (PLL, etc.)

64

© 2

005,

Kev

in S

kadr

on

DVS “Critical Power Slope”• It may be more efficient not to use DVS, and

to run at the highest possible frequency, then go into a sleep mode!• Depends on power dissipation in sleep mode vs.

power dissipation at lowest voltage

• This has been formalized as the critical power slope (Miyoshi et al, ICS’02):• mcritical = (Pfmin

– Pidle) / fmin

• If the actual slope m = (Pf - Pfmin) / (f – fmin) < mcritical

then it is more energy efficient to run at the highest frequency, then go to sleep

• Switching overheads must be taken into account

65

© 2

005,

Kev

in S

kadr

on

Multi Clock Domain Architecture• Multiple clock domains inside the processor

• Globally-asynchronous locally synchronous (GALS) clock style

• Independent voltage/frequency scaling among domains

• Synchronizers to ensure inter-domain communication

66

© 2

005,

Kev

in S

kadr

on

Application-Specific Hardware• Specialized logic is usually much lower power

• Co-processors• Ex: TCP/IP offload, codecs, etc.

• Functional units• Ex: Intel SSE, specialized arithmetic (e.g., graphics), etc.

• Ex: Custom instructions in configurable cores (e.g., Tensilica)

• Specific example: Zoran ER4525 – cell phone• ARM microcontroller, no DSP!

• Video capture & pre/post processing

• Video codec

• 2D/3D rendering

• Video display

• Security

67

© 2

005,

Kev

in S

kadr

on

Power Savings for Real Time Systems• Soft vs. hard real time

• Most work has focused on DVS scheduling

• Example: Multimedia apps must process every frame within a time limit• Slow down the processor to just meet deadlines

– Based on frame type (Hughes et al MICRO 2001)

– Based on queue occupancy (Lu et al, ICCD 2003)

68

Leakage ControlLeakage ControlBody Bias

VddVbp

Vbn-Ve

+Ve

2-10X2-10XReductionReduction

Sleep Transistor



Stack Effect

Equal Loading



69

© 2

005,

Kev

in S

kadr

on

Low-Leakage Caches• Gated-Vdd/Vss (Powell et al, ISLPED’00; Kaxiras et al, ISCA-28)

• Uses sleep transistor on Vdd/ground for each cache line

• Typically considered non-state-preserving, but recent work (Agarwal et al, DAC’02) suggests that gated-Vss it may preserve state

• Many algorithms for determining when to gate

• Simplest (Kaxiras et al, ISCA-28): Two-bit access counter and decay interval

• Adaptive decay intervals - hard

• Drowsy cache (Flautner et al, ISCA-29)• Uses dual supply voltages: normal Vdd and a low Vdd close to the

threshold voltage

• State preserving, but requires an extra cycle to wake up – two extra cycles if tags are decayed

• State preservation using leakage currents (Li et al, PACT’02; Agarwal et al, DAC’02)• Similar to gated-Vss but designed to keep supply voltage high enough

to preserve state (100-120 mV)

70

© 2

005,

Kev

in S

kadr

on

DVS• Chip or “island” granularity

• Leakage depends exponentially on Vdd

• Fine granularities• Requires routing multiple Vdd’s, voltage

steppers

• Dynamic switching• Instead of Vth or sleep transistor, can use low

voltage to put a logic block to sleep

• Reliability questions

– Low voltage reduces Qcrit

71

© 2

005,

Kev

in S

kadr

on

Gate Leakage• Not clear if new oxide materials will arrive in time• Any technique that reduces Vdd helps• Otherwise it seems difficult to develop architecture

techniques that directly attack gate leakage• In fact, very little work has been done in this area

• One example: domino gates (Hamzaoglu & Stan, ISLPED’02)• Replace traditional NMOS pull-down network with a PMOS

pull-up network• Gate leakage is greater in NMOS than PMOS• But PMOS domino gate is slower

• Note: Gate oxide so thin - especially prone to manufacturing variations

72

© 2

005,

Kev

in S

kadr

on

Application Variations

• Wide variation across applications

• Architectural and technology trends are making it worse, e.g. simultaneous multithreading (SMT)

370

380

390

400

410

420

gzip mcf swim mgrid applu eon mesa

Ke

lvin

370

380

390

400

410

420

gzip mcf swim mgrid applu eon mesa

Ke

lvin

ST

SMT

73

© 2

005,

Kev

in S

kadr

on

Dynamic Thermal Management (DTM)(Brooks and Martonosi, HPCA 2001)

Time

Tem

pera

ture

DTM Disabled DTM/Response Engaged

Designed for Cooling Capacity w/out DTM

DTM TriggerLevel

Designed for Cooling Capacity w/ DTM

SystemCost Savings

Source: David Brooks 2002

74

© 2

005,

Kev

in S

kadr

on

Thermal Modeling• Want a fine-grained, dynamic model of

temperature• At a granularity architects can reason about• That accounts for adjacency and package• That does not require detailed designs• That is fast enough for practical use

HotSpot - a compact model based on thermal R, C• Parameterized to automatically derive a model

based on various…– Architectures– Power models– Floorplans– Thermal Packages

75

© 2

005,

Kev

in S

kadr

on

Our Model (Lateral)

76

© 2

005,

Kev

in S

kadr

on

Our Model (Lateral and Vertical)

Interface material(not shown)

Derived from material and geometric properties

77

© 2

005,

Kev

in S

kadr

on

Validation• Validated and calibrated using MICRED test

chips• 9x9 array of power dissipators and sensors• Compared to HotSpot configured with same grid,

package

• Within 7% for both steady-state and transient step-response

• Interface material (chip/spreader) matters a lot

78

© 2

005,

Kev

in S

kadr

on

HotSpot• Time evolution of temperature is driven by

unit activities and power dissipations averaged over 10K cycles• Power dissipations can come from any power

simulator, act as “current sources” in RC circuit

• Simulation overhead in Wattch/SimpleScalar: < 1%

• Requires models of• Floorplan: important for adjacency

• Package: important for spreading and time constants

79

© 2

005,

Kev

in S

kadr

on

Hybrid DTM• DVS is attractive because of its cubic advantage

• P V2f

• This factor dominates when DTM must be aggressive

• But changing DVS setting can be costly

– Resynchronize PLL

– Sensitive to sensor noise spurious changes

• Fetch gating is attractive because it can use instruction level parallelism to reduce impact of DTM

• Only effective when DTM is mild

• So use both!

80

© 2

005,

Kev

in S

kadr

on

Migrating Computation• When one unit overheats, migrate its

functionality to a distant, spare unit (MC)• Spare register file (Skadron et al. 2003)

• Separate core (CMP) (Heo et al. 2003)

• Microarchitectural clusters

• etc.

• Raises many interesting issues• Cost-benefit tradeoff for that area

• Use both resources (scheduling)

• Extra power for long-distance communication

• Floorplanning

81

© 2

005,

Kev

in S

kadr

on

Hybrid DTM, cont.• Combine fetch gating with DVS

• When DVS is better, use it

• Otherwise use fetch gating

• Determined by magnitude of temperature overshoot

• Crossover at FG duty cycle of 3

• FG has low overhead: helps reduce cost of sensor noise

1.0

1.1

1.2

1.3

20 5 2Duty Cycle

Slo

wd

ow

n

1.0

1.1

1.2

1.3

1.4

05101520Duty Cycle

Slo

wd

ow

n

FG

DVSHyb

82

© 2

005,

Kev

in S

kadr

on

Hybrid DTM, cont.

• DVS doesn’t need more than two settings for thermal control

• Lower voltage cools chip faster

• FG by itself does need multiple duty cycles and hence requires PI control

• But in a hybrid configuration, FG does not require PI control

• FG is only used at mild DTM settings

• Can pick one fixed duty cycle

• This is beneficial because feedback control is vulnerable to noise

83

© 2

005,

Kev

in S

kadr

on

Sensors• Almost half of DTM overhead is due to

• Guard banding due to offset errors and lack of co-located sensors

• Spurious sensor readings due to noise

• Need localized, fine-grained sensing• Need new sensor designs that are cheap

and can be used liberally – co-locate with hotspots

• But these may be imprecise

• Many sensor designs look promising• Need new data fusion techniques to reduce

imprecision, possibly combine heterogeneous sensors

84

© 2

005,

Kev

in S

kadr

on

Multi-clustered Microarchitecture

Documents

© 2005, Kevin Skadron Designing Cool Chips in an Era of Gigascale Integration: History, Challenges, and Opportunities Kevin Skadron LAVA/HotSpot Lab Dept