Upload
eric-harris
View
217
Download
0
Embed Size (px)
Citation preview
11
1
Technologies for Reducing Power
Trevor MudgeBredt Family Professor of Engineering
Computer Science and Engineering
The University of Michigan, Ann Arbor
SAMOS X, July 18th 2010
22
2
ACAL – University of Michigan
Technologies for Reducing Power
Near threshold operation
3D die stacking
Replacing DRAM with Flash memory
2
33
3
ACAL – University of Michigan
Background Moore’s Law
density of components doubles without increase in cost every 2 years
ignoring NRE costs, etc.
65➞45➞32➞22➞16➞11➞8nm (= F nanometers)
Intel has 32nm in production
Energy per clock cycle
Vdd is the supply voltage, f is the frequency, C capacitanceand Ileak is the leakage current
Vth is the “threshold” voltage at which the gate switches
e.g. Vth ≈ 300 mV and Vdd ≈ 1V
3
€
E ≈ CVdd2 +
IleakVddf
44
4
ACAL – University of Michigan
The Good Old Days—Dennard Scaling
If s is the linear dimension scaling factor (s ≈ √2)
Device dimension tox, L, W 1/s
Voltage V 1/s
Current I 1/s
Capacitance εA/t 1/s
Delay VC/I 1/s
Power VI 1/s2
Power density VI/A 1
4
55
5
ACAL – University of Michigan
The emerging dilemma:
More and more gates can fit on a die, but cooling constraints are restricting their use
The emerging dilemma:
More and more gates can fit on a die, but cooling constraints are restricting their use
Recent TrendsCircuit supply
voltages are no longer scaling…
Circuit supply voltages are no longer scaling…
Therefore, power doesn’t decrease at the same rate that transistor count is increasing – energy density is skyrocketing!
Therefore, power doesn’t decrease at the same rate that transistor count is increasing – energy density is skyrocketing!
U CVdd
2
AIleakVddAf
A = gate area scaling 1/s2
C = capacitance scaling < 1/sDynamic dominates
U CVdd
2
AIleakVddAf
Stagnant
Shrinking
66
6
ACAL – University of Michigan
Impact on Dennard scaling
If s is the linear dimension scaling factor ≈ √2
Device dimension tox, L, W 1/s
Voltage V 1/s 1➞ Current I 1/s
Capacitance εA/t 1/s
Delay VC/I 1/s 1➞ Power VI 1/s2 1/s➞ Power density VI/A 1 s➞
6
77
7
ACAL – University of Michigan
Techniques for Reducing Power
Near threshold operation—Vdd near Vth
3D die stacking
Replacing DRAM with Flash memory
7
88
8
ACAL – University of Michigan
Today: Super-Vth, High Performance, Power Constrained
Super-Vth
Ene
rgy
/ O
pera
tion
Log
(Del
ay)
Supply Voltage0 Vth Vnom
Core i7
3+ GHz40 mW/MHz
Normalized Power, Energy, & Performance
Energy per operation is the key metric for efficiency. Goal: same performance,
low energy per operation
99
9
ACAL – University of Michigan
Subthreshold Design
Super-VthSub-Vth
Ene
rgy
/ O
pera
tion
Log
(Del
ay)
Supply Voltage0 Vth Vnom
500 – 1000X
~16X
Operating in the sub-threshold gives us huge power gains at the expense of performance
OK for sensors!
1010
10
ACAL – University of Michigan
Near-Threshold Computing (NTC)
NTCNTC Super-VthSub-Vth
Ene
rgy
/ O
pera
tion
Log
(Del
ay)
Supply Voltage0 Vth Vnom
~10X
~50-100X
~2-3X
~6-8X
Near-Threshold Computing (NTC):
•60-80X power reduction
•6-8X energy reduction
• Invest portion of extra transistors from scaling to overcome barriers
1212
12
ACAL – University of Michigan
Restoring performance Delay increases by 10x
Computation requires N operations
Break into N/10 parallel subtasks—execution time restored
Total energy is still 8X less—operation count unchanged
Power 80X less
Predicated on being able to parallelize workloads
Suitable for a subset of applications—as noted earlier Streams of independent tasks—a server
Data parallel—signal/image processing
Important to have a solution for code that is difficult to parallelism—single thread performance
1313
13
ACAL – University of Michigan
0.2 0.4 0.6 0.8 1.0 1.210-1
100
101
Ene
rgy
(nor
m)
VDD
(V)
Logic SRAM
Interesting consequences: SRAM
SRAM has a lower activity rate than logic
VDD for minimum energy operation (VMIN) is higher
Logic naturally operates at a lower VMIN than SRAM—and slower
0.0 0.2 0.4 0.6 0.8 1.0 1.2100m
1
10
100
1k
10k
100k
1M
10M
Del
ay (
norm
)
VDD
(V)
Logic SRAM
Leakage
Dynamic
Total
1414
14
ACAL – University of Michigan
NTC—Opportunities and Challenges Opportunities:
New architectures Optimize processes to gain back some of the 10X delay 3D Integration—fewer thermal restrictions
Challenges: Low Voltage Memory
New SRAM designs Robustness analysis at near-threshold
Variation Razor and other in-situ delay monitoring techniques Adaptive body biasing
Performance Loss Many-core designs to improve parallelism Core boosting to improve single thread performance
1515
15
ACAL – University of Michigan
Proposed Parallel Architecture
cache/SRAM(k*fcore,Vddmem,Vthmem)
……core1
(fcore,Vddcore,Vthcore)corek
(fcore,Vddcore,Vthcore)
level converter
cluster
cluster1 clustern…
2nd level memory
cache/SRAM(f0,Vdd0,Vth0)
…
2nd level memory
Core(f0,Vdd0,Vth0)
cache/SRAM(f0,Vdd0,Vth0)
Core(f0,Vdd0,Vth0)
1. R. Dreslinski, B. Zhai, T. Mudge, D. Blaauw, and D. Sylvester. An Energy Efficient Parallel Architecture Using Near Threshold Operation. 16th Int. Conf. on Parallel Architectures and Compilation Techniques (PACT), Romania, Sep. 2007, pp. 175-188.
2. B. Zhai, R. Dreslinski, D. Blaauw, T. Mudge, and D. Sylvester. Energy Efficient Near-threshold Chip Multi-processing. Int. Symp. on Low Power Electronics and Design - 2007 (ISLPED), Aug. 2007, pp. 32-37.
1616
16
ACAL – University of Michigan
48%
Cluster Results
Baseline Single CPU @ 233MHz
NTC 4-Core One core per L1
53% Avg. savings over Baseline
Clustered NTC Multiple cores per L1
3 cores/cluster
2 clusters
74% Avg. savings over Baseline
230MHz Equivalent Performance
Cholesky Benchmark
1717
17
ACAL – University of Michigan
Cluster
L1
Recall, SRAM is run at a higher VDD than cores with little energy penalty Caches operate faster than core
Can introduce clustered architectures Multiple Cores share L1 L1 operated fast enough to
satisfy all core requests in 1-cycle Cores see view of private single
cycle L1
Cluster Cluster Cluster
Core Core Core Core
New NTC Architectures
Advantages (leading to lower power): Clustered sharing Less coherence/snoop traffic
Drawbacks (increased power): Core conflicts evicting L1 data (more misses) Additional bus/Interconnect from cores to L1
(not as tightly coupled)
L1
BUS / Switched Network
Next Level Memory
Core
L1
Core
L1
Core
L1
Core
L1
Core
BUS / Switched Network
Next Level Memory
L1L1L1L1
1818
18
ACAL – University of Michigan
Digression—Chip Makers Response
Exchanged cores for frequency multi / many-cores
Risky behavior “if we build it, they will come”
predicated on the solution to a tough problem—parallelizing software
Multi-cores have only been successes in throughput environments—servers
heterogeneous environments—SoCs
data parallel applications
Parallel processing is application specific that’s OK
treat parallel machines a attached processors
true in SoCs for some time—control plane / data plane separation
18
perf
freq
perf
# cores
1919
19
ACAL – University of Michigan
Measured thread level parallelism—TLP
19
Evolution of Thread-Level Parallelism in Desktop Applications
G. Blake, R. Dreslinski, T. Mudge, University of Michigan, K. Flautner ARM, ISCA 2010, to appear.
Caveat: Desktop Applications
2020
20
ACAL – University of Michigan
Single thread performance: BoostingBaseline Cache runs 4x core frequency
Pipelined cacheBetter Single Thread Performance Boosting
Turn some cores off, speed up the rest Cache frequency remains the same Cache un-pipelined Faster response time Same throughput Core sees larger cache, hiding longer
DRAM latency Increase core voltage and frequency further
Overclock Cache frequency must be increased Even faster response time Increased throughput
Cluster
Core
L1
Core Core Core
4 Cores @15MHz (650mV)
Cache @ 60MHz (700mV)
Core Core Core
Cluster
Core
L1
1 Core @60MHz (850mV)
Cache @ 60MHz (1 V)
Core Core Core
Cluster
Core
L1
1 Core @120MHz (1.3V)
Cache @ 120MHz (1.5V)
4x
8x
2121
21
ACAL – University of Michigan
Single Thread Performance Look at turning off cores and speeding the remaining cores to gain faster
response time.
Graph of cluster performance (not measured – intuition)
R. Dreslinski, M. Wieckowski, D. Blaauw, D. Sylvester, and T. Mudge. Near Threshold Computing: Overcoming Performance Degradation from Aggressive Voltage Scaling. Workshop on Energy-Efficient Design (WEED 2009), held at 36th Int. Symp. on Computer Architecture, Austin, TX, June, 2009.
2222
22
ACAL – University of Michigan
Boosting Clusters—scaled to 22nm
Baseline Cache runs 4x core frequency
Pipelined cacheBetter Single Thread Performance Turn some cores off, speed up the rest
Cache frequency remains the same Cache un-pipelined Faster response time Same throughput Core sees larger cache, hiding longer
DRAM latency Boost core voltage and frequency further
Cache frequency must be increased Even faster response time Increased throughput
Cluster
Core
L1
Core Core Core
4 Cores @140MHz
Cache @ 60MHz
Core Core Core
Cluster
Core
L1
1 Core @600MHz
Cache @ 600MHz
Core Core Core
Cluster
Core
L1
1 Core @1.2GHz
Cache @ 1.2GHz
4x
8x
2424
24
ACAL – University of Michigan
Technologies for Reducing Power
Near threshold operation
3D die stacking
Replacing DRAM with Flash memory
24
2525
25
ACAL – University of Michigan
A Closer Look at Wafer-Level Stacking
Dielectric(SiO2/SiN)
Gate Poly
STI (Shallow Trench Isolation)
Oxide
Silicon
W (Tungsten contact & via)
Al (M1 – M5)
Cu (M6, Top Metal)
“Super-Contact”
Bob Patti, CTO Tezzaron Semiconductor
Preparing a TSV—through silicon via
2626
26
ACAL – University of Michigan
Next, stack second wafer & thin
Bob Patti, CTO Tezzaron Semiconductor
FF: face-to-face
2727
27
ACAL – University of Michigan
3rd wafer
2nd wafer
1st wafer: controller
Then stack a third wafer
Bob Patti, CTO Tezzaron Semiconductor
FB: face-to-back
2828
28
ACAL – University of Michigan
1st wafer: controller
2nd wafer
3rd wafer
This is the completed stack
Finally, flip, thin and add pads
Bob Patti, CTO Tezzaron Semiconductor
2929
29
ACAL – University of Michigan
Characteristics Very high bandwidth low-latency low-power buses possible
10,000 vias / sq mm Electrical characteristics: 1fF and < 1Ω∼
No I/O pads for inter-stack connections—low-power Consider a memory stack:
DDR3 ~40mW per pin 1024 Data pins →40W 4096 Data pins →160W die on wafer ~24uW per pin
Pros / cons 3D interconnect failure < 0.1ppm Heat—1 W/sq mm KGD may be a problem—foundry
Different processes can be combined DRAM / logic / analog / non-volatile memories e.g. DRAM—split sense amps and drivers from memory cells
29
3030
30
ACAL – University of Michigan
Centip3De—3D NTC Project
Logic - BLogic - B
Logic - BLogic - B
Logic - ALogic - A
DRAM Sense/Logic – Bond RoutingDRAM Sense/Logic – Bond Routing
DRAMDRAM
DRAMDRAM
F2F Bond
F2F Bond
Logic - ALogic - A
Centip3De Design
•130nm, 7-Layer 3D-Stacked Chip
•128 - ARM M3 Cores
•1.92 GOPS @130mW
•tapedout: Q1 2010
3131
31
ACAL – University of Michigan
Stacking the Die
System Configuration
•2-Wafer = 16 Clusters (64 Cores)
•4-Wafer = 32 Clusters (128 Cores)
•DDR3 Controller
Core Cluster
ARMM3
ARMM3
ARMM3
ARMM3
ARMM3
ARMM3 ARM
M3
ARMM3
ARMM3
ARMM3
ARMM3
ARMM3
ARMM3
ARMM3 ARM
M3
ARMM3
Core Cluster
ARMM3
ARMM3
ARMM3
ARMM3
ARMM3
ARMM3 ARM
M3
ARMM3
ARMM3
ARMM3
ARMM3
ARMM3
ARMM3
ARMM3 ARM
M3
ARMM3
Cluster Configuration
•4 Arm M3 Cores @ 15MHz (650mV)
•1 kB Instruction Cache @ 60MHz (700mV)
•8 kB Data Cache @60 MHz (700mV)
•Cores connect via 3D to caches on other layer
Estimated Performance (Raytrace)
•1.6 GOPS (0.8 GOPS on 2-Wafer)
•110mW (65mW on 2-Wafer)
•14. 8 GOPS/W
Fair Metric
•Centip3De achieves 24 GOPS/W without DRAM
3232
32
ACAL – University of Michigan
1.9 GOPS (3.8 GOPS in Boost) Max 1 IPC per core
128 Cores
15 MHz
130 mW
14.8 GOPS/W (5.5 in Boost)
Design Scaling and Power BreakdownsNTC Centip3De System
130nm
To
22nm
~600 GOPS (~1k GOPS in Boost) Max 1 IPC per core
4,608 Cores 140 MHz
~3W ~200 GOPS/W
Raytracing Benchmark
3333
33
ACAL – University of Michigan
Technologies for Reducing Power
Near threshold operation
3D die stacking
Replacing DRAM with Flash memory
33
3535
35
ACAL – University of Michigan
Background – NAND Flash overview Dual mode SLC/MLC Flash
bank organization Single Level Cell (SLC)
1 bit/cell 105 erases per block 25 μs read, 200 μs write
Multi Level Cell (MLC) 2 bits/cell 104 erases per block 50 μs read, 680 μs write
2048 64
1 block = 64 SLC/128 MLC pages
2048 64SLC page
MLC pages 2048 64
2112 / 4224 bytes
Addressable read/write unit is the page Pages consist of 2048 bytes + 64 ‘spare’ bytes Erases 64 SLC or 128 MLC pages at a time (a block) Technology – less than 60nm
Dual Mode NAND Flash memory
3636
36
ACAL – University of Michigan 36
Reducing Memory Power
Flash is denser than DRAM Flash is cheaper than DRAM Flash good for idle power optimization
1000× less power than DRAM Flash not so good for low access latency usage model
DRAM still required for acceptable access latencies Flash “wears out” – 10,000/100,000 write/erase cycles
Area/bit
(μm2)$/Gb
Active Power
Idle Power
Read latency
Write latency
Erase latency
DRAM 0.015 3 495mW 15mW 55ns 55ns N/A
NAND 0.005 0.25 50mW 6μW 25μs 200μs 1.5ms
PCM 0.068 ? 6μW 55ns 150ns N/A
NAND Flash cost assumes 2-bit-per-cell MLC. DRAM is a 2 Gbit DDR3-1333 x8 chip.
Flash power numbers are a 2 Gbit SLC x8 chip. Area from ITRS Roadmap 2009.
3737
37
ACAL – University of Michigan
A Case for Flash as Secondary Disk Cache
37
Many server workloads use a large working-set (100’s of MBs ~ 10’s of GB and even more) Large working-set is cached to main memory to maintain high
throughput
Large portion of DRAM to disk cache
Many server applications are more read intensive than write intensive
Flash memory consumes orders of magnitude less idle power than DRAM
Use DRAM for recent and frequently accessed content and use Flash for not recent and infrequently accessed content Client requests are spatially and temporally a zipf like distribution
e.g. 90% of client requests are to 20% of files
3838
38
ACAL – University of Michigan
0
200
400
600
800
1,000
1,200
12us 25us 50us 100us 400us 1600us
disk cache access latency to 80% of files
Ne
two
rk b
an
dw
idth
- M
bp
s(T
hro
ug
hp
ut)
MP4 MP8 MP12
38
A Case for Flash as Secondary Disk Cache
Specweb99
An access latency of 100’s of microseconds can be tolerated.
T. Kgil, and T, Mudge. FlashCache: A NAND Flash memory file cache for low power web servers. Proc. Conf. Compiler and Architecture Support for Embedded Systems (CASES'06), Seoul, S. Korea, Oct. 2006, pp. 103-112.
3939
39
ACAL – University of Michigan 39
Main memory
Processors
128MB DRAM
1GB Flash
FCHT FBST
HDD ctrl.
Hard Disk Drive
Flash ctrl.DMA
FPST FGST1GB DRAM1GB Flash128MB DRAM
Overall Architecture
Processors
HDD ctrl.
Hard Disk Drive
Main memory
Generic main memory + Primary disk cache
Secondary disk cache
Baseline without FlashCacheFlashCache Architecture
Tables used to manage Flash memory
4040
40
ACAL – University of Michigan 40
Overall Network Performance - Mbps
0
200
400
600
800
1,000
1,200
DRAM 32MB +FLASH 1GB
DRAM 64MB +FLASH 1GB
DRAM 128MB+FLASH 1GB
DRAM 256MB+FLASH 1GB
DRAM 512MB+FLASH 1GB
DRAM 1GB
Ne
two
rk B
an
dw
idth
- M
bp
s
MP4 MP8 MP12
128MB DRAM + 1GB NAND Flash performs
as well as 1GB DRAM while requiring only about
1/3 die area (SLC Flash assumed)
Specweb99
4141
41
ACAL – University of Michigan 41
0
0.5
1
1.5
2
2.5
3
DDR2 1GB active DDR2 1GB powerdown DDR2 128MB + Flash1GB
Ove
rall
Po
wer
- W
read power write power idle power
Overall Main Memory Power
SpecWeb99
2.5W
1.6W
0.6W
Flash Memory consumes much less idle power
than DRAM
4242
42
ACAL – University of Michigan
Concluding Remarks on Flash-for-DRAM
DRAM clockless refresh reduces idle power
Flash density continues to grow
Intel-Micron JV announced 25nm flash 8GB die 167sq mm 2 bits per cell
3 bits/cell is coming soon
PCRAM appears to be an interesting future alternative
I predict single level storage using some form of NV memory with disks replacing tape for archival storage
42
4343
43
ACAL – University of Michigan
Cluster Size and BoostingAnalysis:
•2-Die stack
•Fixed die size
•Fixed amount of cache per core
•Raytrace Algorithm
4444
44
ACAL – University of Michigan
System ArchitectureB-B
InterfaceB-B
InterfaceClock
Layer BLayer A
Sys Ctrl
DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAMMem Fwd
4-CoreCluster
4-CoreCluster
4-CoreCluster
4-CoreCluster
Layer Hub
JTAG
Mem Fwd
4-CoreCluster
4-CoreCluster
4-CoreCluster
4-CoreCluster
Layer Hub
4-CoreCluster
4-CoreCluster
4-CoreCluster
4-CoreCluster
Layer Hub
4-CoreCluster
4-CoreCluster
4-CoreCluster
4-CoreCluster
Layer Hub
ClockSystem Comm
Fwd DRAMDRAMJTAG
4545
45
ACAL – University of Michigan
Cluster Architecture
Cluster ClockGen
60 MhzSystemClock
65 MhzCacheClock
15 Mhz CoreClocks with
0,90,180,270degree phase offsets
M3
ClusterI-Cache1024b4-Way
ClusterD-Cache
8192b4-Way
ClusterMMIO,
Reset Ctrl,etc
60 MhzCacheClock
M3
M3
M3
SystemCommunication
JTAG In
JTAG Out
AMBA-likeBusses To
DRAM(128-bit)
AMBA-likeBuses (32-bit)3D integration
Layer 1 Layer 2
4646
46
ACAL – University of Michigan
With With power and cooling becoming an increasingly costly part of the operating cost of a server, the old trend of striving for higher performance with little regard for power is over. Emerging semiconductor process technologies, multicore architectures, and new interconnect technology provide an avenue for future servers to become low power, compact, and possibly mobile. In our talk we examine three techniques for achieving low power: 1) Near threshold operation; 2) 3D die stacking; and 3) replacing DRAM with Flash memory.power and cooling becoming an increasingly costly part of the operating cost of a server, the old trend of striving for higher performance with little regard for power is over. Emerging semiconductor process technologies, multicore architectures, and new interconnect technology provide an avenue for future servers to become low power, compact, and possibly mobile. In our talk we examine three techniques for achieving low power: 1) Near threshold operation; 2) 3D die
; and 3) replacing DRAM with Flash memory.
46
4848
48
ACAL – University of Michigan
Solutions?
Reduce Vdd “Near Threshold Computing”
Drawbacks
slower less reliable operation
But
parallelism
suits some computations (more as time goes by)
robustness techniques
e.g. Razor—in situ monitoring
Cool chips (again) interesting developments in microfluidics
Devices that operate at lower Vdd without performance loss
48
4949
49
ACAL – University of Michigan
Low-Voltage Robustness
VDD scaling reduces SRAM robustness
Maintain robustness through device sizing and VTH selection
Robustness measured using importance-sampling
In NTC range 6T is smaller
300 400 500 600 700 800 900 10000.8
1.0
1.2
1.4
1.6
1.8
2.0
Logi
c R
ule
Bitc
ell A
rea
(m
2 )
VDD
(mV)
6T with VTH
Selection 6T w/0 V
TH Selection
8T with VTH
Selection 8T w/0 V
TH Selection
5050
50
ACAL – University of Michigan
SRAM Designs
HD-SRAM Differential write Single-ended read Asymmetric sizing HD µ/σ = 12.1 / 1.16 6T µ/σ = 11.0 / 1.35
Crosshairs Skew VDD in column Skew GND in row Target failing cells No bitcell changes
Skew hurts some cells
0 20 40 60 80 100
0.0
0.2
0.4
0.6
0.8
1.0
Row
& C
olum
n E
rror
Rat
e (n
orm
)
VDD
and GND Skew (mV)
1.1V 0.7V
2 4 6 8 10 12 14 160
5
10
15
20
25
30
35
40
Num
ber
of C
hips
Read + Write Margin (VTH
)
Half-differential Differential
BLR BLRW
WLRWLRW
VDD1VDD2
GND1
GND2
BL BLBAR
WL WL
5151
51
ACAL – University of Michigan
Evolution of Subthreshold Designs
Unpublished Results – ISSCC 2010 – Do not disclose
Phoenix 2 Design (2010)- 0.18 µm CMOS
-Commercial ARM M3 Core
-Used to investigate:
•Energy harvesting
•Power management
-37.4 µW/MHz
Subliminal 2 Design (2007)-0.13 µm CMOS
-Used to investigate process variation
-3.5 µW/MHz
Subliminal 1 Design (2006)-0.13 µm CMOS
-Used to investigate existence of Vmin
-2.60 µW/MHz
Phoneix 1 Design (2008)- 0.18 µm CMOS
-Used to investigate sleep current
-2.8 µW/MHz
processor
memory
244m
305m
122m
181m
processor
memory
244m
305m
122m
181m
253 µm
IMEM IMEM
DMEM DMEM
CORECORE
715 µm
Proc CProc A
IMEM
CORE
DMEM
Proc B
253 µm
98 µm