Upload
madison-wells
View
218
Download
0
Embed Size (px)
Citation preview
© 2
005,
Kev
in S
kadr
on
Designing Cool Chips in an Era of Gigascale Integration:
History, Challenges, and Opportunities
Kevin Skadron
LAVA/HotSpot LabDept. of Computer Science
University of VirginiaCharlottesville, VA
2
© 2
005,
Kev
in S
kadr
on
“Cooking-Aware” Computing?
3
© 2
005,
Kev
in S
kadr
on
ITRS Projections
• These are targets, doubtful that they are feasible
• Growth in power density means cooling costs continue to grow
• High-performance designs seem to be shifting away from clock frequency toward # cores
ITRS 2004
Year 2003 2006 2010 2013 2016Tech node (nm) 100 70 45 32 22Vdd (high perf) (V) 1.2 1.1 1.0 0.9 0.8Vdd (low power) (V) 1.0 0.9 0.7 0.6 0.5Frequency (high perf) (GHz) 3.0 6.8 15.1 23.0 39.7
High-perf w/ heatsink 149 180 198 198 198Cost-performance 80 98 120 138 158Hand-held 2.1 2.4 2.8 3.0 3.0
Max power (W)
2001 – was 0.4
2001 – was 288
4
© 2
005,
Kev
in S
kadr
on
Power EvolutionM
ax
Po
we
r (W
att
s)
i386 i386
i486 i486
Pentium® Pentium®
Pentium® w/MMX tech.
Pentium® w/MMX tech.
1
10
100
Pentium® Pro Pentium® Pro
Pentium® II Pentium® II
Pentium® 4Pentium® 4Pentium® 4Pentium® 4
??
Pentium® III Pentium® III
Source: Intel
Zero-Sum Architecture!
5
© 2
005,
Kev
in S
kadr
on
Leakage – A Growing Problem• The fraction of leakage power is increasing exponentially with
each generation• Also exponentially dependent on temperature• Curiously, ITRS 2004 projections are lower than what industry
is currently reporting• Changes tradeoffs! Idle logic hurts, e.g. CMPs
Static power/ Dynamic Power
0
10
20
30
40
50
60
70
298
303
308
313
318
323
328
333
338
343
348
353
358
363
368
373
Temperature(K)
Pe
rcen
tag
e
180nm 130nm 100nm 90nm 80nm 70nm
Increasingratioacrossgenerations
(DataderivedfromITRS2001)
6
© 2
005,
Kev
in S
kadr
on
Thermal Packaging is Expensive• Nvidia GeForce 5900 card – “dustbuster”
Source: Tech-Report.com
7
© 2
005,
Kev
in S
kadr
on
Architecture Trends• High-performance market
• “Fat” (wide, superscalar) CPUs and high frequencies giving way to multiple cores, plateau in frequencies
– Huge number of multi-core product announcements
– # cores might be the next marketing buzz
• Multiple threads per core
– This probably won’t scale – limit of 2-4 thread contexts
• Interesting example: Sun Niagara
– 8 4-threaded cores
• Across all market segments• Growing integration (SoC)
• Specialized co-processors and offload engines
• Growing heterogeneity• Part of the programming model in SoCs
• Not part of the programming model in CMPs!
8
© 2
005,
Kev
in S
kadr
on
Basketball Analogy• Recent trends in high-performance
processors are like building a team around Shaq when you have a limited budget
• Huge salary (power) to one player• Huge ego, team friction (heat)• Shaq can’t get much better (except possibly his
free throws) (diminishing returns)• New trend: multiple CPUs on a chip (CMP/SoC)
• Don’t need superstars (less power per core, better energy efficiency)
• Choose team players (better heat distribution)• Performance scales linearly with cores• Heterogeneous cores possible (SoCs)• Detroit Pistons
9
© 2
005,
Kev
in S
kadr
on
Talk Outline• Different philosophies of Power-Aware
design• Energy efficient vs. low power vs. temperature-
aware
• Power Management Techniques• Dynamic• Static• Temperature
• Summary of Important Challenges
• My perspective tends to be architecture-centric, and slanted toward high-performance desktop/server/etc. CPUs
10
© 2
005,
Kev
in S
kadr
on
Metrics
• Power• Average power, instantaneous power, peak power
• Energy• Energy (MIPS/W)
• Energy-Delay product (MIPS2/W)
• Energy-Delay2 product (MIPS3/W) – voltage independent!
• Temperature• Correlated with power density over sufficiently
large time periods
• No good figures of merit for trading off thermal efficiency against performance, area, or energy efficiency
(Zyuban, GVLSI’02)
Low-Power DesignPower-Aware/
Energy-EfficientDesign
Temperature-Aware Design
Design for power delivery
11
© 2
005,
Kev
in S
kadr
on
12
© 2
005,
Kev
in S
kadr
on
Circuit Techniques • Transistor sizing
• Signal and clock gating
• Dynamic vs. static logic
• Circuit restructuring
• Low power caches, register files, queues
• These typically reduce the capacitance being switched
13
© 2
005,
Kev
in S
kadr
on
Clock Gating, Signal Gating
• Implementation• Simple gate that replaces
one buffer in the clock tree• Signal gating is similar, helps
avoid glitches• Delay is generally not a concern
except at fine granularities
• Choice of circuit design andclock gating style can have a dramatic effect on temperaturedistribution
““Disabling a functional block when it is not required for an extended Disabling a functional block when it is not required for an extended periodperiod””
signal
ctrl
functionalunit
functionalunit
14
© 2
005,
Kev
in S
kadr
on
Circuit Restructuring• Pipeline (tolerate smaller, longer-latency circuitry)• Parallelize (can reduce frequency)• Reorder inputs so that most active input is closest
to output (reduces switched capacitance)• Restructure gates (equivalent functions are not
equivalent in switched capacitance)
Logic BlockLogic BlockFreq = 1Vdd = 1Throughput = 1Power = 1Area = 1 Pwr Den = 1
Vdd
Logic BlockLogic Block
Freq = 0.5Vdd = 0.5Throughput = 1Power = 0.25Area = 2Pwr Den = 0.125
Vdd/2
Logic BlockLogic Block
Example: Parallelizing (maintain throughput)
Source: Shekhar Borkar, keynote presentation, MICRO-37, 2004
15
© 2
005,
Kev
in S
kadr
on
Architectural-Level Techniques• Sleep modes• Pipeline depth• Energy-efficient front end
• Branch prediction accuracy is a major determinant of pipeline activity -> spending more power in the branch predictor can be worthwhile if it improves accuracy
• Integration (e.g. multiple cores)• Multi-threading• Dynamic voltage/frequency scaling• Multi clock domain architectures (similar to GALS)• Power islands• Encoding/compression
• Can reduce both switched capacitance and cross talk• Application specific hardware
• Co-processors, functional units, etc.• Compiler techniques
Prevalent
Growing or Imminent
16
© 2
005,
Kev
in S
kadr
on
Optimal Pipeline Depth
Hartstein and Puzak, ACM TACO, Dec. 2004
• Increased power and diminishing returns vs. increased throughput
• 5-10 stages, 15-30 FO4• Srinivasan et al, MICRO-35, Hartstein and Puzak, ACM TACO,
Dec. 2004
Single issue
4-wide issue
Pipeline Stages
17
© 2
005,
Kev
in S
kadr
on
Architectural-Level Techniques• Sleep modes• Pipeline depth• Energy-efficient front end
• Branch prediction accuracy is a major determinant of pipeline activity -> spending more power in the branch predictor can be worthwhile if it improves accuracy
• Integration (e.g. multiple cores)• Multi-threading• Dynamic voltage/frequency scaling• Multi clock domain architectures (similar to GALS)• Power islands• Encoding/compression
• Can reduce both switched capacitance and cross talk• Application specific hardware
• Co-processors, functional units, etc.• Compiler techniques
Prevalent
Growing or Imminent
18
© 2
005,
Kev
in S
kadr
on
Multi-threading• Do more useful work per unit time
• Amortize overhead and leakage
• Switch-on-event MT• Switch on cache misses, etc. (Ex: Sun Niagara
“throughput computing”)
• Can even rotate among threads every instruction (Tera/Cray)
• Simultaneous Multithreading/HyperThreading• For superscalar – eliminate waste
• Intel Pentium 4, IBM POWER5, Alpha 21464
19
© 2
005,
Kev
in S
kadr
on
Architectural-Level Techniques• Sleep modes• Pipeline depth• Energy-efficient front end
• Branch prediction accuracy is a major determinant of pipeline activity -> spending more power in the branch predictor can be worthwhile if it improves accuracy
• Integration (e.g. multiple cores)• Multi-threading• Dynamic voltage/frequency scaling
• Limits• Multi clock domain architectures (similar to GALS)• Power islands• Encoding/compression
• Can reduce both switched capacitance and cross talk• Application specific hardware
• Co-processors, functional units, etc.• Compiler techniques
Prevalent
Growing or Imminent
20
© 2
005,
Kev
in S
kadr
on
Compiler Techniques for Low Power• Basic idea is for the compiler to identify
opportunities for using low-power modes
• Compiler-guided DVS• Reduce voltage in memory-bound program regions
– Hsu and Kremer, ISLPED’01, PLDI’03; Xie et al, PLDI’03• Dynamic resource configuration/hibernation
• Deactivate modules when they won’t be used for a long time (>> sleep/wakeup time)…avoids waiting for timeout
– Heath et al, PACT’02• Profile/compiler-guided adaptation
• Subroutine-guided (“positional”) adapation (Huang et al, ISCA’03)
– Uses profiling and a hierarchy of low-power modes
• Much work in this area – this only touches the surface
21
© 2
005,
Kev
in S
kadr
on
22
© 2
005,
Kev
in S
kadr
on
Static Power Dissipation• Static power: dissipation due to leakage
current
• Exponentially dependent on T, Vdd, Vth
• Most important sources of static power: subthreshold leakage and gate leakage• We will focus on subthreshold
• Gate leakage has essentially been ignored – New gate insulation materials may solve problem
23
© 2
005,
Kev
in S
kadr
on
Thermal Runaway• The leakage-temperature feedback can lead
to a positive feedback loop• Temperature increases leakage increases
temperature increases leakage increases • …
Source: www.usswisconsin.org
24
© 2
005,
Kev
in S
kadr
on
A Smorgasbord• Transistor sizing• Multi Vth
• Dynamic threshold voltage – reverse body bias – Transmeta Efficeon• Transmeta uses runtime compilation and load monitoring to
select thresholds
• Stack effect• Sleep transistors• DVS
• Coarse or fine grained
• Low leakage caches, register files, queues• Techniques for reducing gate leakage• Hurry up and wait
• Low leakage: maintain min possible V, f• High leakage: use high V/f to finish work quickly, then go to
sleep
25
© 2
005,
Kev
in S
kadr
on
Sleep Transistors
• Recent work suggests that a properly sized, low-Vth footer transistor can preserve enough leakage to keep the cell active (Li et al, PACT’02; Agarwal et al, DAC’02)• Great care must be taken when
switching back to full voltage: noise can flip bits
• Extra latency may be necessarywhen re-activating
• Similar to principles in sub-threshold computing• Ex – sensor motes for wireless
sensor networks
• Concerns about susceptibility to SEU
Logic BlockLogic Block
26
© 2
005,
Kev
in S
kadr
on
A Smorgasbord• Transistor sizing• Multi Vth
• Dynamic threshold voltage – reverse body bias – Transmeta Efficeon• Transmeta uses runtime compilation and load monitoring to
select thresholds
• Stack effect• Sleep transistors• DVS
• Coarse or fine grained
• Low leakage caches, register files, queues• Techniques for reducing gate leakage• Hurry up and wait
• Low leakage: maintain min possible V, f• High leakage: use high V/f to finish work quickly, then go to
sleep
27
© 2
005,
Kev
in S
kadr
on
28
© 2
005,
Kev
in S
kadr
on
Worst-Case leads to Over-design• Average case temperature lower than worst-case
• Aggressive clock gating
• Application variations
• Underutilized resources, e.g. FP units during integer code
• Currently 20-40% difference
Source: Gunther et al, ITJ 2001
Reduced targetpower density
Reduced coolingcost
TDP
29
© 2
005,
Kev
in S
kadr
on
Temporal, Spatial VariationsTemperature variationof SPEC applu over time
Localized hot spots dictate cooling solution
30
© 2
005,
Kev
in S
kadr
on
Temperature-Aware Design• Worst-case design is wasteful
• Power management is not sufficient for chip-level thermal management
• Must target blocks with high power density
• When they are hot
• Spreading heat helps
– Even if energy not affected
– Even if average temperature goes up
• This also helps reduce leakage
31
© 2
005,
Kev
in S
kadr
on
Role of Architecture?Dynamic thermal management (DTM)
• Automatic hardware response when temp. exceeds cooling• Cut power density at runtime, on demand• Trade reduced costs for occasional performance loss
• Architecture natural granularity for thermal management• Activity, temperature correlated within arch. units• DTM response can target hottest unit: permits fine-tuned
response compared to OS or package• Modern architectures offer rich opportunities for remapping
computation– e.g., CMPs/SoCs, graphics processors, tiled architectures– e.g., register file
• Thermal engineering must consider role of architecture
• Thermal engineers and architects need to collaborate
32
© 2
005,
Kev
in S
kadr
on
Existing DTM Implementations• Intel Pentium 4: Global clock gating with
shut-down fail-safe
• Intel Pentium M: Dynamic voltage scaling
• Transmeta Crusoe: Dynamic voltage scaling
• IBM Power 5: Probably fetch gating
• ACPI: OS configurable combination of passive & active cooling
• These solutions sacrifice time (slower or stalled execution) to reduce power density
• Better: a solution in “space”• Tradeoff between exacerbating leakage (more idle logic) or
reducing leakage (lower temperatures)
33
© 2
005,
Kev
in S
kadr
on
Alternative: Migrating Computation
This is only a simplistic illustrative example
34
© 2
005,
Kev
in S
kadr
on
Space vs. Time• Moving the hotspot, rather than throttling it,
reduces performance overhead by almost 60%
1.270
1.359
1.231
1.112
1.00
1.10
1.20
1.30
1.40
DVS FG Hyb MC
Slo
wd
ow
n F
ac
tor
Time Space
The greater the replication and spread,
the greater the opportunities
35
© 2
005,
Kev
in S
kadr
on
36
Sources of VariationsSources of Variations
0
50
100
150
200
250
He
at
Flu
x (
W/c
m2
)
Heat Flux (W/cm2)Results in Vcc variation
40
50
60
70
80
90
100
110
Te
mp
era
ture
(C
)
Temperature Variation (°C)Hot spots
10
100
1000
10000
1000 500 250 130 65 32
Technology Node (nm)
Me
an
Nu
mb
er
of
Do
pa
nt
Ato
ms
Random Dopant Fluctuations
0.01
0.1
1
1980 1990 2000 2010 2020
micron
10
100
1000
nm
193nm193nm248nm248nm
365nm365nmLithographyLithographyWavelengthWavelength
65nm65nm90nm90nm
130nm130nm
GenerationGeneration
GapGap
45nm45nm32nm32nm 13nm 13nm
EUVEUV
180nm180nm
Source: Mark Bohr, Intel
Sub-wavelength Lithography
Source: S
hekhar Borkar, keynote presentation, M
ICR
O-37, 2004
37
Impact of Static VariationsImpact of Static Variations
130nm
30%
5X
FrequencyFrequency~30%~30%
LeakageLeakagePowerPower~5-10X~5-10X
0.90.9
1.01.0
1.11.1
1.21.2
1.31.3
1.41.4
11 22 33 44 55Normalized Leakage (Isb)Normalized Leakage (Isb)
No
rmal
ized
Fre
qu
en
cyN
orm
aliz
ed F
req
ue
ncy
Source: Shekhar Borkar, keynote presentation, MICRO-37, 2004
38
© 2
005,
Kev
in S
kadr
on
Parameter Variations• Parameter variations mess everything up!• T variation in Vcc, leakage T• Vcc speed variation, leakage T• Manufacturing (L, W, Vth, etc) speed, Vcc, T• Packaging variations (TIM, roughness) T
• Some transistors/functional units won’t work, some will be lousy, some will fail over time, and some will be intermittent
• Guard banding won’t work• Design devolves to worst component, can’t easily bound
intermittent behavior• T/P problems may no longer be limited to specific
units• Makes dynamic logic even more difficult
39
© 2
005,
Kev
in S
kadr
on
Future Architectures• Asymmetry unavoidable
• Specialized units (part of programming model)
• Power management (can try to hide this)
• Thermal throttling (hard to hide this)
• Parameter variations (hard to hide this without extreme performance loss)
40
© 2
005,
Kev
in S
kadr
on
Raw Architecture (MIT)
ComputeProcessor
Routers
On-chip networks
Source: MIT RAW project
Only one of many examples of tiled architectures
41
© 2
005,
Kev
in S
kadr
on
Future Architectures• Increasing integration, e.g. increasing
# cores, e.g. Niagara
• Clustered architectures
• Tiled architectures
• Multiple voltage islands
• Asymmetry unavoidable• Specialized units (part of programming model)
• Power management (can try to hide this)
• Thermal throttling (hard to hide this)
• Parameter variations (hard to hide this without extreme performance loss)
• Increasing problems with yield, failures in time(Redundancy: costly; graceful degradation: introduces asymmetry)
42
© 2
005,
Kev
in S
kadr
on
Power and Thermal Security• A consequence of designing for expected
rather than worst-case conditions
• Energy-drain attacks
• Voltage stability attacks (dI/dt)
• Thermal attacks• Thermal throttling
• Denial of service
• Direct physical damage
43
© 2
005,
Kev
in S
kadr
on
Summary• Reviewed current techniques for managing
dynamic power, leakage power, temperature• A major obstacle with architectural techniques
is the difficulty of predicting performance impact
• Continuing integration makes power an ever-present concern
• Thermal limits and parameter variations are becoming serious obstacles
• Spread heat in space, not time
• Security challenges
44
© 2
005,
Kev
in S
kadr
on
Soap-Box• Architecture solutions are essential
• Thermal engineers, circuit designers, CAD designers, and architects all need to work together
• Joint infrastructure• Simulators – esp. pre-RTL tools
• Test chips
– Ex: Combine architecture and circuit research on a single test chip
45
© 2
005,
Kev
in S
kadr
on
46
© 2
005,
Kev
in S
kadr
on
More Info
http://www.cs.virginia.edu/~skadron
LAVA Lab
47
© 2
005,
Kev
in S
kadr
on
Backup Slides
48
© 2
005,
Kev
in S
kadr
on
Hot Chips are No Longer Cool!W
att
s/c
m2
1
10
100
1000
i386i386i486i486
Pentium® Pentium®
Pentium® ProPentium® Pro
Pentium® IIPentium® IIPentium® IIIPentium® IIIHot plateHot plate
Nuclear ReactorNuclear ReactorNuclear ReactorNuclear Reactor
* “New Microarchitecture Challenges in the Coming Generations of CMOS Process * “New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies” – Fred Pollack, Intel Corp. Micro32 conference key note - 1999.Technologies” – Fred Pollack, Intel Corp. Micro32 conference key note - 1999.
Pentium® 4Pentium® 4
RocketRocketNozzleNozzleRocketRocketNozzleNozzle
Today’slaptops:
SIA
49
© 2
005,
Kev
in S
kadr
on
ITRS quotes – thermal challenges• For small dies with high pad count, high power
density, or high frequency, “operating temperature, etc for these devices exceed the capabilities of current assembly and packaging technology.”
• “Thermal envelopes imposed by affordable packaging discourage very deep pipelining.”
• Intel recently canceled its NetBurst microarchitecture
– Press reports suggest thermal envelopes were a factor
50
© 2
005,
Kev
in S
kadr
on
Thermal Packaging is Expensive• P4 packaging
Source: Intel web site
51
© 2
005,
Kev
in S
kadr
on
• Laptops and other constrained form factors
Thermal Packaging is Expensive
52
© 2
005,
Kev
in S
kadr
on
Trends in Battery Technology• Battery lifetime is increasing perhaps 8-
10%/yr.(Powers, Proc. of IEEE 1995)
• Not keeping up with rate of growth in energy consumption
Source: Rabaey 1995, cited in Irwin et al, “Low Power Design Methodologies, Hardware and Software Issues”, tutorial at PACT 2000
53
© 2
005,
Kev
in S
kadr
on
Dynamic Power Consumption• Power dissipated due to switching activity
• A capacitance is charged and discharged
Vdd
Charge/discharge at the frequency Charge/discharge at the frequency ffP=a CLV2 f
Ec=1/2CLV2
Ed=1/2CLV2
54
© 2
005,
Kev
in S
kadr
on
Transistor Sizing• Transistor sizing plays an important role to reduce
power
• Delay ~ (k / ln K)
• Power ~ K / (K-1)
• Optimum K for both power and delay must be pursued
C0 C1 CN-1 CN
K = Ci/Ci-1
55
© 2
005,
Kev
in S
kadr
on
Signal Gating
• Implementation• Simple gate
• Tristate buffer
• ...
• Control signal needed• Generation requires additional logic
• Especially helps to prevent power dissipation due to glitches
““techniques to mask unwanted switching activities from propagating techniques to mask unwanted switching activities from propagating forward, causing unnecessary power dissipationforward, causing unnecessary power dissipation””
signal
ctrl
Output
5656
Cache DesignCache Design
Banked organizationBanked organization Targets switched capacitanceTargets switched capacitance CCaccessaccess = R = R C C C Ccellcell / B/ B
Dividing word line Dividing word line Same effect for wordlinesSame effect for wordlines
Reducing voltage swingsReducing voltage swings Sense amplifiers used to detect VSense amplifiers used to detect Vdiffdiff across bitlines across bitlines
Read operation can complete as soon as VRead operation can complete as soon as Vdiff diff is detectedis detected Limiting voltage swing saves a fraction of powerLimiting voltage swing saves a fraction of power
Pulse word linesPulse word lines Enabling the word line for the time needed to discharge bitcell Enabling the word line for the time needed to discharge bitcell
voltagevoltage Designer needs to estimate access time and implement a pulse Designer needs to estimate access time and implement a pulse
generatorgenerator
5757
Low Power Register File DesignLow Power Register File Design
RF’s usually single-ended bitlinesRF’s usually single-ended bitlines Modified storage cellModified storage cell
Lot of zeros fetched from the RFLot of zeros fetched from the RF Bitline connections are modified to eliminate bitline discharge Bitline connections are modified to eliminate bitline discharge
when reading a zerowhen reading a zero
Tseng and Asanovic, ICSD, 2000Zyuban and Kogge, ISLPED 1998
5858
Efficient Issue QueueEfficient Issue Queue
Useful comparisonUseful comparison Empty entries and ready entries consume energyEmpty entries and ready entries consume energy
• Wakeup of empty entries can be disabledWakeup of empty entries can be disabled Gating off precharge logic using valid bitGating off precharge logic using valid bit
• Wakeup of ready sources can be disabledWakeup of ready sources can be disabled Gating off precharge logic using ready bitGating off precharge logic using ready bit
Folegnani and Gonzalez, ISCA 2001Folegnani and Gonzalez, ISCA 2001
Energy-efficient ComparatorsEnergy-efficient Comparators Traditional comparators dissipate energy on a mismatch in any Traditional comparators dissipate energy on a mismatch in any
bit position.bit position. 10%-20% of source operands match each cycle10%-20% of source operands match each cycle Solution: comparators that dissipate energy in a matchSolution: comparators that dissipate energy in a match
Kuckuc Kuckuc et alet al, ISLPED 2001, ISLPED 2001
5959
Multi Clock Domain ArchitectureMulti Clock Domain Architecture
Domains must be carefully chosenDomains must be carefully chosen Small cost on communicationsSmall cost on communications Re-using existing structures for cross-domain Re-using existing structures for cross-domain
synchronizatoinsynchronizatoin
ExampleExample 5 domains5 domains
• Front-endFront-end• Integer unitInteger unit• FP unitFP unit• On-chip cache unitOn-chip cache unit• Main memoryMain memory
6060
Multi Clock Domain ArchitectureMulti Clock Domain Architecture
L2unifiedcache
L1d-cache
LSQ
MemoryFront-end
branchpredict renameL1
i-cache
fetch dispatchIFQ
int.registerfile
int.FUs
IIQ
Integer
fp.registerfile
fp.FUs
FIQ
Floating Point
MainMemory
CPU
Magklis et al, ISCA 2003
6161
Multi Clock Domain ArchitectureMulti Clock Domain Architecture AdvantagesAdvantages
Local clock design is not aware of global skewLocal clock design is not aware of global skew Each domain limited by its local critical path, allowing Each domain limited by its local critical path, allowing
higher frequencieshigher frequencies Different voltage regulators allow for a finer-grain Different voltage regulators allow for a finer-grain
energy controlenergy control Frequency/voltage of each domain can be tailored to Frequency/voltage of each domain can be tailored to
its dynamic requirementsits dynamic requirements Clock Power is reducedClock Power is reduced
DrawbacksDrawbacks Complexity and penalty of synchronizersComplexity and penalty of synchronizers Feasibility of multiple voltage regulatorsFeasibility of multiple voltage regulators
6262
Sleep Modes Sleep Modes
ACPI: Advance Configuration and Power InterfaceACPI: Advance Configuration and Power Interface Developed by Microsoft, HP, Toshiba, Phoenix and IntelDeveloped by Microsoft, HP, Toshiba, Phoenix and Intel Replaces APM and PnP BIOSReplaces APM and PnP BIOS
Establishes interfaces for OS-directed power-Establishes interfaces for OS-directed power-managementmanagement
Defines various power states, e.g. Cx, Sx… with various Defines various power states, e.g. Cx, Sx… with various power-performance tradeoffs—OS can choosepower-performance tradeoffs—OS can choose
63
© 2
005,
Kev
in S
kadr
on
Dynamic Voltage/Frequency Scaling• Allow the device to dynamically adapt the
voltage (and the frequency)• P ~ Vdd
2
• F ~ Vdd/(Vdd-Vth)k
• Tradeoff between power reductions and delay increase
• But this is a vey powerful paradigm– Approx. quadratic or cubic reduction in power
(power density) relative to frequency reduction– Most other techniques are linear with respect
to perf. loss– DVS switching overhead must be taken into
account (PLL, etc.)
64
© 2
005,
Kev
in S
kadr
on
DVS “Critical Power Slope”• It may be more efficient not to use DVS, and
to run at the highest possible frequency, then go into a sleep mode!• Depends on power dissipation in sleep mode vs.
power dissipation at lowest voltage
• This has been formalized as the critical power slope (Miyoshi et al, ICS’02):• mcritical = (Pfmin
– Pidle) / fmin
• If the actual slope m = (Pf - Pfmin) / (f – fmin) < mcritical
then it is more energy efficient to run at the highest frequency, then go to sleep
• Switching overheads must be taken into account
65
© 2
005,
Kev
in S
kadr
on
Multi Clock Domain Architecture• Multiple clock domains inside the processor
• Globally-asynchronous locally synchronous (GALS) clock style
• Independent voltage/frequency scaling among domains
• Synchronizers to ensure inter-domain communication
66
© 2
005,
Kev
in S
kadr
on
Application-Specific Hardware• Specialized logic is usually much lower power
• Co-processors• Ex: TCP/IP offload, codecs, etc.
• Functional units• Ex: Intel SSE, specialized arithmetic (e.g., graphics), etc.
• Ex: Custom instructions in configurable cores (e.g., Tensilica)
• Specific example: Zoran ER4525 – cell phone• ARM microcontroller, no DSP!
• Video capture & pre/post processing
• Video codec
• 2D/3D rendering
• Video display
• Security
67
© 2
005,
Kev
in S
kadr
on
Power Savings for Real Time Systems• Soft vs. hard real time
• Most work has focused on DVS scheduling
• Example: Multimedia apps must process every frame within a time limit• Slow down the processor to just meet deadlines
– Based on frame type (Hughes et al MICRO 2001)
– Based on queue occupancy (Lu et al, ICCD 2003)
68
Leakage ControlLeakage ControlBody Bias
VddVbp
Vbn-Ve
+Ve
2-10X2-10XReductionReduction
Sleep Transistor
Logic BlockLogic Block
2-1000X2-1000XReductionReduction
Stack Effect
Equal Loading
5-10X5-10XReductionReduction
Source: Shekhar Borkar, keynote presentation, MICRO-37, 2004
69
© 2
005,
Kev
in S
kadr
on
Low-Leakage Caches• Gated-Vdd/Vss (Powell et al, ISLPED’00; Kaxiras et al, ISCA-28)
• Uses sleep transistor on Vdd/ground for each cache line
• Typically considered non-state-preserving, but recent work (Agarwal et al, DAC’02) suggests that gated-Vss it may preserve state
• Many algorithms for determining when to gate
• Simplest (Kaxiras et al, ISCA-28): Two-bit access counter and decay interval
• Adaptive decay intervals - hard
• Drowsy cache (Flautner et al, ISCA-29)• Uses dual supply voltages: normal Vdd and a low Vdd close to the
threshold voltage
• State preserving, but requires an extra cycle to wake up – two extra cycles if tags are decayed
• State preservation using leakage currents (Li et al, PACT’02; Agarwal et al, DAC’02)• Similar to gated-Vss but designed to keep supply voltage high enough
to preserve state (100-120 mV)
70
© 2
005,
Kev
in S
kadr
on
DVS• Chip or “island” granularity
• Leakage depends exponentially on Vdd
• Fine granularities• Requires routing multiple Vdd’s, voltage
steppers
• Dynamic switching• Instead of Vth or sleep transistor, can use low
voltage to put a logic block to sleep
• Reliability questions
– Low voltage reduces Qcrit
71
© 2
005,
Kev
in S
kadr
on
Gate Leakage• Not clear if new oxide materials will arrive in time• Any technique that reduces Vdd helps• Otherwise it seems difficult to develop architecture
techniques that directly attack gate leakage• In fact, very little work has been done in this area
• One example: domino gates (Hamzaoglu & Stan, ISLPED’02)• Replace traditional NMOS pull-down network with a PMOS
pull-up network• Gate leakage is greater in NMOS than PMOS• But PMOS domino gate is slower
• Note: Gate oxide so thin - especially prone to manufacturing variations
72
© 2
005,
Kev
in S
kadr
on
Application Variations
• Wide variation across applications
• Architectural and technology trends are making it worse, e.g. simultaneous multithreading (SMT)
370
380
390
400
410
420
gzip mcf swim mgrid applu eon mesa
Ke
lvin
370
380
390
400
410
420
gzip mcf swim mgrid applu eon mesa
Ke
lvin
ST
SMT
73
© 2
005,
Kev
in S
kadr
on
Dynamic Thermal Management (DTM)(Brooks and Martonosi, HPCA 2001)
Time
Tem
pera
ture
DTM Disabled DTM/Response Engaged
Designed for Cooling Capacity w/out DTM
DTM TriggerLevel
Designed for Cooling Capacity w/ DTM
SystemCost Savings
Source: David Brooks 2002
74
© 2
005,
Kev
in S
kadr
on
Thermal Modeling• Want a fine-grained, dynamic model of
temperature• At a granularity architects can reason about• That accounts for adjacency and package• That does not require detailed designs• That is fast enough for practical use
HotSpot - a compact model based on thermal R, C• Parameterized to automatically derive a model
based on various…– Architectures– Power models– Floorplans– Thermal Packages
75
© 2
005,
Kev
in S
kadr
on
Our Model (Lateral)
76
© 2
005,
Kev
in S
kadr
on
Our Model (Lateral and Vertical)
Interface material(not shown)
Derived from material and geometric properties
77
© 2
005,
Kev
in S
kadr
on
Validation• Validated and calibrated using MICRED test
chips• 9x9 array of power dissipators and sensors• Compared to HotSpot configured with same grid,
package
• Within 7% for both steady-state and transient step-response
• Interface material (chip/spreader) matters a lot
78
© 2
005,
Kev
in S
kadr
on
HotSpot• Time evolution of temperature is driven by
unit activities and power dissipations averaged over 10K cycles• Power dissipations can come from any power
simulator, act as “current sources” in RC circuit
• Simulation overhead in Wattch/SimpleScalar: < 1%
• Requires models of• Floorplan: important for adjacency
• Package: important for spreading and time constants
79
© 2
005,
Kev
in S
kadr
on
Hybrid DTM• DVS is attractive because of its cubic advantage
• P V2f
• This factor dominates when DTM must be aggressive
• But changing DVS setting can be costly
– Resynchronize PLL
– Sensitive to sensor noise spurious changes
• Fetch gating is attractive because it can use instruction level parallelism to reduce impact of DTM
• Only effective when DTM is mild
• So use both!
80
© 2
005,
Kev
in S
kadr
on
Migrating Computation• When one unit overheats, migrate its
functionality to a distant, spare unit (MC)• Spare register file (Skadron et al. 2003)
• Separate core (CMP) (Heo et al. 2003)
• Microarchitectural clusters
• etc.
• Raises many interesting issues• Cost-benefit tradeoff for that area
• Use both resources (scheduling)
• Extra power for long-distance communication
• Floorplanning
81
© 2
005,
Kev
in S
kadr
on
Hybrid DTM, cont.• Combine fetch gating with DVS
• When DVS is better, use it
• Otherwise use fetch gating
• Determined by magnitude of temperature overshoot
• Crossover at FG duty cycle of 3
• FG has low overhead: helps reduce cost of sensor noise
1.0
1.1
1.2
1.3
20 5 2Duty Cycle
Slo
wd
ow
n
1.0
1.1
1.2
1.3
1.4
05101520Duty Cycle
Slo
wd
ow
n
FG
DVSHyb
82
© 2
005,
Kev
in S
kadr
on
Hybrid DTM, cont.
• DVS doesn’t need more than two settings for thermal control
• Lower voltage cools chip faster
• FG by itself does need multiple duty cycles and hence requires PI control
• But in a hybrid configuration, FG does not require PI control
• FG is only used at mild DTM settings
• Can pick one fixed duty cycle
• This is beneficial because feedback control is vulnerable to noise
83
© 2
005,
Kev
in S
kadr
on
Sensors• Almost half of DTM overhead is due to
• Guard banding due to offset errors and lack of co-located sensors
• Spurious sensor readings due to noise
• Need localized, fine-grained sensing• Need new sensor designs that are cheap
and can be used liberally – co-locate with hotspots
• But these may be imprecise
• Many sensor designs look promising• Need new data fusion techniques to reduce
imprecision, possibly combine heterogeneous sensors
84
© 2
005,
Kev
in S
kadr
on
Multi-clustered Microarchitecture