View
11
Download
4
Category
Preview:
Citation preview
High Performance Processors in a Power Limited World
Sam NaffzigerAMD Senior Fellow
Outline
Today’s processor design landscapeTrends
Issues making designer’s lives difficultPower limitsScaling effects
Design opportunitiesCircuit levelArchitectural
Summary
The All Consuming Quest for Greater Performance at Lower Cost
Increasing Transistor Density
Moore’s Law has served us well.
Increasing Performance
Processor Frequency vs. Time
The amazing frequency increases of the past decade have leveled off – Why?
MPU Performance vs Time
100
1000
10000
Jan-97
Jan-98
Jan-99
Jan-00
Jan-01
Jan-02
Jan-03
Jan-04
Jan-05
Jan-06
Jan-07
Jan-08
Jan-09
Perf
orm
ance
(MH
z)
Power Limits Process Issues
4GHz
Outline
Today’s processor design landscapeTrends
Issues making designer’s lives difficultPower limitsScaling effects
Design opportunitiesCircuit levelArchitectural
Summary
Power Consumption Background
Source: Holt, HotChips 17
Power has always challenged circuit integration
Bipolar → NMOS → CMOS
We’ve been bailed out by technology in the past
Scaling Background
2005 ITRS Projections of Vt and Vdd
0
500
1000
1500
2000
2500
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
mV Vt
Vdd
Source: Horowitz et al, IEDM 2005
0.1
1
10
Jan-85 Jan-88 Jan-91 Jan-94 Jan-97 Jan-00 Jan-03
Feat Size (um)
Vdd
Power/10
Dennard Scaling
Scaling doesn’t bail us out any more
Realistic power limit
The The Processor Processor DesignerDesigner
Process Scaling
Process Scaling
Issues
Issues
Power Consumption Background
P ≈ CTOT·α·F·Vdd2 + NTOT·α·F·Vdd·ICO + NON·ILEAK·VddSwitching Power Crossover Power Leakage Power
Reducing VddReducing CTOT
Reducing ILEAK, ICO
Reducing α
The Process guys have had the biggest impact on these
But now, not only are those improvements fading, but we have a host of new challenges
VariationVoltage droopWire non-scaling
Outline
Today’s processor design landscapeTrends
Issues making designer’s lives difficultPower limitsScaling effects
Design opportunitiesCircuit levelArchitectural
Summary
Scaling Issues
The Silicon Age Still on a Roll, But …
Medium High Very HighVariability
Energy scaling will slow down>0.5>0.5>0.35Energy/Logic Op scaling
11111111RC Delay
Low Probability High ProbabilityAlternate, 3G etc
11
2016
High Probability Low ProbabilityBulk Planar CMOS
Delay scaling will slow down>0.7~0.70.7Delay = CV/I scaling
8162232456590Technology Node (nm)
2018201420122010200820062004High Volume Manufacturing
But … all this scaling has some nasty side effectsITRS RoadmapITRS RoadmapITRS Roadmap
65nm 45nm 32nm 22nm2007 2010 2013
PDSOI FDSOI
2016
bulk
stressors + substrateengineering
+ high µmaterials
MuGFETMuCFET
elec
tros
tatic
con
trol
SiONpoly
high kmetal
gate stack
planar 3D
Source: European Nanoelectronics Initiative Advisory Council (ENIAC)
Device Variation Reverse Scales
Source: Pelgrom, IEEE lecture 5/11/06
Variations subtract directly off cycle time
power efficiency dropsCircuit margins degrade
The Problem:Atoms don’t scale
One impact of variation is leakage spreads
Fmax (a.u)
Sidd
(A)
1.25X Fmax
>3X SIDDspread
Note:
Chip SIDD set by “smallest” gates; Fmax set by slowest gates;
Scaling Intrinsically Hurts Supply Integrity
0%
20%
40%
60%
80%
100%
120%
250nm 180nm 130nm 90nm
Technology Node
Leakage %
Normalized Vdd
Vdd Droop %
Source: Bose, Hotchips 17
With power per core staying constant but area, voltage and cycle times dropping, we have a big challengeRequiring a higher voltage to hit frequency is a quadratic power impact
Outline
Today’s processor design landscapeTrends
Issues making designer’s lives difficultPower limitsScaling effects
Design opportunitiesCircuit levelArchitectural
Summary
Scaling Issues
Circuit Designer
Some Ways to Shoulder the Variation Burden: Adaptive clocking
Empirically set the clock edge to optimize frequency
Higher granularity → more variation tolerance
LBIST and GA search algorithms show promise for per-part optimization
ProgrammableDelay Buffers
Some Ways to Shoulder the Variation Burden: Self Healing Designs
Simplest example is cache ECC on memory arrays
Next level is Intel’s Pellston technology implemented on Montecito and Tulsa
Disable defective lines detected by multiple ECC errors
Future directions involve self-checking with redundant logic and retry
Predict result through parity, residues or redundant logicOn an error, replay calculation before committing architectural stateIf replay correct, it was a transient error (particle strike, Vdddroop, random noise coupling etc.)If incorrect can reduce frequency, increase voltage or retry with an alternate execution path
Some Ways to Shoulder the Variation Burden: Self Healing Designs
From Fall Microprocessor From Fall Microprocessor Forum 2006Forum 2006
Adaptive Supply Voltage
Energy / Operation
Channel Length
Nom
Short
Long VddLow
High
Per-part and dynamic voltage management are key
More range flexibility and finer grain response will provide differentiation
Integrated Power and Thermal Management
“Fuse and forget” is no longer viable
Too much variation in environment, manufacturing and operating conditionsSome means of dynamic optimization needed
Fmax (a.u)
Sidd
(A)
Integrated Power and Thermal Management
An autonomous programmable controller enables real time optimizations
An embedded controller provides the needed flexibility
OS interfacingMulti-core managementPer-part optimization
Scaling IssuesOutline
Today’s processor design landscapeTrends
Issues making designer’s lives difficultPower limitsScaling effects
Design opportunitiesCircuit levelArchitectural
Summary Chip Architect
Traversing the Power ContourPowe r Consumption
Channel Length
Nom
Short
Long VddLow
High
P ≈ CTOT·α·F·Vdd2 + NTOT·α·F·Vdd·ICO + NON·ILEAK·VddSwitching Power Crossover Power Leakage Power
Frequency
Channe l Length
Nom
Short
Long VddLow
High
Traversing the Power Contour
Traversing the Power Contour for a Given Implementation
Energy / Operation
Channel Length
Nom
Short
Long VddLow
High
Max performance
Best power efficiency
0
0.2
0.4
0.6
0.8
1
1.2
Performance^3 / Watt
Channel LengthNom
Short
Long VddLowNom
High
For Comparing Architectural Efficiency, Performance3/W is most effective
• The industry has been moving from “hyper-pipelining”with short pipe stages, to something more moderate
Pipe stages
0
5
10
15
20
25
30
35
Williamette(2000)
Prescott(2004)
Core2(2006)
Opteron Power6
~2X power efficient
Optimal Pipeline Depth
A Look at Mobile Processor Power
A Look at Mobile System Power
Mobile System Power
TDP Average Power
Rest of systemChipsetMemory controllerCPUMemory
If a laptop burned TDP power all the time, battery life would be measured in minutes
How do we get mobile average power so much lower than TDP?
MobileMark 2002 Tj 95 1800MHz 1.35V
Core Power
IO Power0
5
10
15
20
25
30
35
0 10 20 30 40 50 60 70
Time(min)
Pow
er (W
)[C
1, P
N!,
& C
3 en
able
d]
Adobe FlashMSOutlook
MSWord
MSPpt
MSExce
MSPpt
MSExcel
MS Word
MS WordandNetscape
The Answer: Take Advantage of Typically Low CPU Utilization
Processor Utilization During Normal Laptop Usage as Approximated by MobileMark
Reducing Power and Cooling Requirements with Processor Performance States
PP--StateStateHIGHHIGH
LOWLOW
P02600MHz
1.40V~95watts
P12400MHz
1.35V~90watts
P22200MHz
1.30V~76watts
P32000MHz
1.25V~65watts
P41800MHz
1.20V~55watts
P51000MHz
1.10V~32watts
PROCESSORPROCESSORUTILIZATIONUTILIZATION
Up to 75% power savings (at idle)!
Average CPU Core Power(measured at CPU)
0
5
10
15
20
25
10500 Connections(~62% CPU Utilization)
5000 Connections(~40% CPU Utilization)
Idle(in OS)
Pow
er (W
)
AMD PowerNow!TM DISABLED
AMD PowerNow!TM ENABLED
-33%
-62%-75%
Additionally “C-states” reduce power further by cutting clocks completely and dropping voltage to retention levels
Improving Peak Performance per Watt
Watts/(Spec*Vdd*Vdd*L)
0.01
0.1
1
0 1 10 100 1000Spec2000*L
Source: Horowitz et al, IEDM 2005
Adding Features to Increase Performance
•Increasing execution efficiency has, historically hurt power efficiency•However, the cubic reduction of power with V/F scaling has tended to make this a good tradeoff
0
0.5
1
1.5
2
2.5
3
3.5
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2
IPC
energy/opfrequencyperformance
Adding Features to Increase Performance Works with V/F Scaling
0
0.5
1
1.5
2
2.5
3
3.5
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2
IPC
energy/opfrequencyperformance
If we hit VMINhowever, the game is over
Voltage scaling has it limitsMore power efficient designs have an
advantageHigh power designs get penalized due
to higher di/dt, higher temperatures etc.
How Hard is Improving Existing Processors?Watts/(Spec*Vdd*Vdd*L)
0.01
0.1
1
0 1 10 100 1000Spec2000*L
Source: Horowitz et al, IEDM 2005
Peak performance costs more energy/operation
Most of the Big hitter improvements have been heavily mined already Next generation AMD cores
have >> 50% of clocks gated off even for high power code
Current and Next Generation Core Comparison
ClockClock
Logic
Logic
Gen1 Peak Gen2 Peak
C s
witc
h
Despite more features, next gen core has substantially lower Cap
Multi-Core to the Rescue?
Sounds like a great story, what’s the catch?
CoreCore
CacheCache
Voltage =1Frequency =1Area =1Power =1Perf =1Perf/Watt =1
CoreCoreCoreCore
CacheCache
Voltage =.85Frequency =.85Area =2Power =1Perf ≈1.7Perf/Watt ≈1.7
Multi-Core to the Rescue?
Some of the catches:
What if you’re already at VMIN? Need to cut frequency in half to stay within power limit
How much parallelizable code is really out there?
More compute capacity means more IO and memory bandwidth demands …
CoreCoreCoreCore
CacheCache
CoreCore
CacheCache
Voltage =1Frequency =1Area =1Power =1Perf =1Perf/Watt =1
Voltage =.85Frequency =.85Area =2Power =1Perf ≈1.7Perf/Watt ≈1.7
Multi-Core Issues: Amdahl’s Law
There is almost always a portion of an application that cannot be parallelized
This portion becomes a bottleneck as the number of threads is increased
A typical value is in the range of 10%
CoreCoreCoreCore
CacheCache
CoreCore
CacheCache
Voltage =1Frequency =1Area =1Power =1Perf =1Perf/Watt =1
Voltage =.85Frequency =.85Area =2Power =1Perf ≈1.7Perf/Watt ≈1.7
multi-core speedup with serial code and constant power considered
0.002.004.006.008.00
10.00
1 2 3 4 5 6 7 8
cores
spee
dup 0
5%
10%
15%
20%
ideal
Just 10% serial code drops 8 core performance improvement by 41%Ideal
Multi-Core Issues: IO Power
All those extra cores need their own data …
IO power in terms of W/Gb/s has been pretty constant in the range of 20mW for years
CoreCoreCoreCore
CacheCache
CoreCore
CacheCache
Voltage =1Frequency =1Area =1Power =1Perf =1Perf/Watt =1
Voltage =.85Frequency =.85Area =2Power =1Perf ≈1.7Perf/Watt ≈1.7
• If we increase IO power accordingly, but hold total chip power constant with V/F scaling, things get worse• Overall performance drops by another 10% or so …
multi-core speedup with serial code, constant power+ IO power considered
0.002.004.006.008.00
10.00
1 2 3 4 5 6 7 8
cores
spee
dup 0
5%
10%
15%
20%
ideal
Ideal
The Transition to Parallel Applications
Parallel Applications
Small number of applications (worked by experts for 10+ yrs)
Awkward development, analysis and debug environments
Parallel programming is hard!
Amdahl’s law is still a law…
SW productivity is already in a crisis this worsens things!
Single-threaded Applications
Most of today’s applications
Well understood optimization techniques
Advanced development, analysis and debug tools
Conceptually, easy to think about
Establishing an appropriate balance is key for managing this important transition
Other Architectural Directions: Integration
Not only does the integration of more system components (i.e. memory controllers, IO etc.) improve performance
Bose, HotChips 17
Typical Server Power Breakdown
Integration reduces power significantly as wellIO communication overhead dropsCPU integrated power management can dynamically optimizePower efficiency of special function components (i.e. graphics accelerators, network processors etc.) greatly exceeds that of general purpose CPUs
I/O HubI/O HubUSBUSB
PCIPCI
PCIeTM
Bridge
PCIeTM
BridgePCIeTM
Bridge
PCIeTM
Bridge
I/O HubI/O Hub
8 GB/S
8 GB/S 8 GB/S
8 GB/S
USBUSB
PCIPCI
XMBXMBXMBXMB XMBXMB XMBXMB
SRQ
Crossbar
HTMem.Ctrlr
SRQ
Crossbar
HTMem.Ctrlr
SRQ
Crossbar
HTMem.Ctrlr
SRQ
Crossbar
HTMem.Ctrlr
System-level Power Consumption
Dual-Core Packages with legacy technology• 692 watts for processors (173w each)• 48 watts for external memory controller
95% More Power
Dual-Core AMD Opteron™ processors• 380 watts for processors (95w each)
• Integrated memory controllers
MCPMCP MCPMCPMCPMCP MCPMCP
Chip
XChip
XChip
XChip
XChip
XChip
XChip
XChip
X
Source: Mixture of publicly available data sheets and AMD internal estimates. Actual system power measurements may vary based on configuration and components used
I/O HubI/O HubMemory
Controller Hub
Memory Controller
Hub
PCI-E Bridge
PCI-E BridgePCI-E Bridge
PCI-E BridgePCIeTM
Bridge
PCIeTM
Bridge
I/O HubI/O HubUSBUSB
PCIPCI
PCIeTM
Bridge
PCIeTM
BridgePCIeTM
Bridge
PCIeTM
Bridge
I/O HubI/O Hub
8 GB/S
8 GB/S 8 GB/S
8 GB/S
USBUSB
PCIPCI
XMBXMBXMBXMB XMBXMB XMBXMB
SRQ
Crossbar
HTMem.Ctrlr
SRQ
Crossbar
HTMem.Ctrlr
SRQ
Crossbar
HTMem.Ctrlr
SRQ
Crossbar
HTMem.Ctrlr
System-level Power Consumption
380 watts380 watts
8.58.5wattswatts
8.58.5wattswatts
8.58.5wattswatts
8.58.5wattswatts
Dual-Core Packages with legacy technology• 692 watts for processors (173w each)• 48 watts for external memory controller
95% More Power
Dual-Core AMD Opteron™ processors• 380 watts for processors (95w each)
• Integrated memory controllers
740 watts 380 watts
MCPMCP MCPMCPMCPMCP MCPMCP
Chip
XChip
XChip
XChip
XChip
XChip
XChip
XChip
X
692 watts692 watts
Source: Mixture of publicly available data sheets and AMD internal estimates. Actual system power measurements may vary based on configuration and components used
I/O HubI/O HubMemory
Controller Hub
Memory Controller
Hub
1414wattswatts
PCI-E Bridge
PCI-E BridgePCI-E Bridge
PCI-E BridgePCIeTM
Bridge
PCIeTM
Bridge
Other Architectural Directions: Integration
Barriers?Integration of heterogeneous designs non-trivialIP barriersSchedule issues with multiple converging components
Big CPU Small
CPU
Small CPU
Small CPU
Small CPU
Special Accel-erators
Memory
IO
RF
Integrating dual designs for processor core enable both peak performance and throughput/watt
Watts/(Spec*Vdd*Vdd*L)
0.01
0.1
1
0 1 10 100 1000Spec2000*L
Summary (1 of 2)
Silicon process technology is unlikely to be the major engine of processor performance increases in the futureMajor circuit related challenges that we’ve only just started to address lie ahead:
Design for variation tolerance and mitigationMaintaining dynamic voltage headroom within reliability and variation imposed limitsAdaptive, self-healing techniques are a key direction
Designers
Variation
Leakage
Vdd DroopHot spots
Summary (2 of 2)
CPU architectures are converging on modest pipe length, limited issue out of order designsMulti-core is good, but has limits in the not too distant futureHeterogeneous integration is a key direction
Silicon process technology is unlikely to be the major engine of processor performance increases in the future
We’re up to the challenge, but it will be a joint effort …
MooreMoore’’s Laws Law
Design CommunityDesign Community
Recommended