11 1 Technologies for Reducing Power Trevor Mudge Bredt Family Professor of Engineering Computer Science and Engineering The University of Michigan, Ann

11

1

Technologies for Reducing Power

Trevor MudgeBredt Family Professor of Engineering

Computer Science and Engineering

The University of Michigan, Ann Arbor

SAMOS X, July 18th 2010

22

2

ACAL – University of Michigan


Near threshold operation

3D die stacking

Replacing DRAM with Flash memory

2

33

3


Background Moore’s Law

density of components doubles without increase in cost every 2 years

ignoring NRE costs, etc.

65➞45➞32➞22➞16➞11➞8nm (= F nanometers)

Intel has 32nm in production

Energy per clock cycle

Vdd is the supply voltage, f is the frequency, C capacitanceand Ileak is the leakage current

Vth is the “threshold” voltage at which the gate switches

e.g. Vth ≈ 300 mV and Vdd ≈ 1V

3

€

E ≈ CVdd2 +

IleakVddf

44

4


The Good Old Days—Dennard Scaling

If s is the linear dimension scaling factor (s ≈ √2)

Device dimension tox, L, W 1/s

Voltage V 1/s

Current I 1/s

Capacitance εA/t 1/s

Delay VC/I 1/s

Power VI 1/s2

Power density VI/A 1

4

55

5


The emerging dilemma:

More and more gates can fit on a die, but cooling constraints are restricting their use

The emerging dilemma:

More and more gates can fit on a die, but cooling constraints are restricting their use

Recent TrendsCircuit supply

voltages are no longer scaling…

Circuit supply voltages are no longer scaling…

Therefore, power doesn’t decrease at the same rate that transistor count is increasing – energy density is skyrocketing!

Therefore, power doesn’t decrease at the same rate that transistor count is increasing – energy density is skyrocketing!

U CVdd

2

AIleakVddAf

A = gate area scaling 1/s2

C = capacitance scaling < 1/sDynamic dominates

U CVdd

2

AIleakVddAf

Stagnant

Shrinking

66

6


Impact on Dennard scaling

If s is the linear dimension scaling factor ≈ √2

Device dimension tox, L, W 1/s

Voltage V 1/s 1➞ Current I 1/s

Capacitance εA/t 1/s

Delay VC/I 1/s 1➞ Power VI 1/s2 1/s➞ Power density VI/A 1 s➞

6

77

7


Techniques for Reducing Power

Near threshold operation—Vdd near Vth

3D die stacking


7

88

8


Today: Super-Vth, High Performance, Power Constrained

Super-Vth

Ene

rgy

/ O

pera

tion

Log

(Del

ay)

Supply Voltage0 Vth Vnom

Core i7

3+ GHz40 mW/MHz

Normalized Power, Energy, & Performance

Energy per operation is the key metric for efficiency. Goal: same performance,

low energy per operation

99

9


Subthreshold Design

Super-VthSub-Vth

Ene

rgy

/ O

pera

tion

Log

(Del

ay)


500 – 1000X

~16X

Operating in the sub-threshold gives us huge power gains at the expense of performance

OK for sensors!

1010

10


Near-Threshold Computing (NTC)

NTCNTC Super-VthSub-Vth

Ene

rgy

/ O

pera

tion

Log

(Del

ay)


~10X

~50-100X

~2-3X

~6-8X

Near-Threshold Computing (NTC):

•60-80X power reduction

•6-8X energy reduction

• Invest portion of extra transistors from scaling to overcome barriers

1212

12


Restoring performance Delay increases by 10x

Computation requires N operations

Break into N/10 parallel subtasks—execution time restored

Total energy is still 8X less—operation count unchanged

Power 80X less

Predicated on being able to parallelize workloads

Suitable for a subset of applications—as noted earlier Streams of independent tasks—a server

Data parallel—signal/image processing

Important to have a solution for code that is difficult to parallelism—single thread performance

1313

13


0.2 0.4 0.6 0.8 1.0 1.210-1

100

101

Ene

rgy

(nor

m)

VDD

(V)

Logic SRAM

Interesting consequences: SRAM

SRAM has a lower activity rate than logic

VDD for minimum energy operation (VMIN) is higher

Logic naturally operates at a lower VMIN than SRAM—and slower

0.0 0.2 0.4 0.6 0.8 1.0 1.2100m

1

10

100

1k

10k

100k

1M

10M

Del

ay (

norm

)

VDD

(V)

Logic SRAM

Leakage

Dynamic

Total

1414

14


NTC—Opportunities and Challenges Opportunities:

New architectures Optimize processes to gain back some of the 10X delay 3D Integration—fewer thermal restrictions

Challenges: Low Voltage Memory

New SRAM designs Robustness analysis at near-threshold

Variation Razor and other in-situ delay monitoring techniques Adaptive body biasing

Performance Loss Many-core designs to improve parallelism Core boosting to improve single thread performance

1515

15


Proposed Parallel Architecture

cache/SRAM(k*fcore,Vddmem,Vthmem)

……core1

(fcore,Vddcore,Vthcore)corek

(fcore,Vddcore,Vthcore)

level converter

cluster

cluster1 clustern…

2nd level memory

cache/SRAM(f0,Vdd0,Vth0)

…

2nd level memory

Core(f0,Vdd0,Vth0)

cache/SRAM(f0,Vdd0,Vth0)

Core(f0,Vdd0,Vth0)

1. R. Dreslinski, B. Zhai, T. Mudge, D. Blaauw, and D. Sylvester. An Energy Efficient Parallel Architecture Using Near Threshold Operation. 16th Int. Conf. on Parallel Architectures and Compilation Techniques (PACT), Romania, Sep. 2007, pp. 175-188.

2. B. Zhai, R. Dreslinski, D. Blaauw, T. Mudge, and D. Sylvester. Energy Efficient Near-threshold Chip Multi-processing. Int. Symp. on Low Power Electronics and Design - 2007 (ISLPED), Aug. 2007, pp. 32-37.

1616

16


48%

Cluster Results

Baseline Single CPU @ 233MHz

NTC 4-Core One core per L1

53% Avg. savings over Baseline

Clustered NTC Multiple cores per L1

3 cores/cluster

2 clusters

74% Avg. savings over Baseline

230MHz Equivalent Performance

Cholesky Benchmark

1717

17


Cluster

L1

Recall, SRAM is run at a higher VDD than cores with little energy penalty Caches operate faster than core

Can introduce clustered architectures Multiple Cores share L1 L1 operated fast enough to

satisfy all core requests in 1-cycle Cores see view of private single

cycle L1

Cluster Cluster Cluster

Core Core Core Core

New NTC Architectures

Advantages (leading to lower power): Clustered sharing Less coherence/snoop traffic

Drawbacks (increased power): Core conflicts evicting L1 data (more misses) Additional bus/Interconnect from cores to L1

(not as tightly coupled)

L1

BUS / Switched Network

Next Level Memory

Core

L1

Core

L1

Core

L1

Core

L1

Core

BUS / Switched Network

Next Level Memory

L1L1L1L1

1818

18


Digression—Chip Makers Response

Exchanged cores for frequency multi / many-cores

Risky behavior “if we build it, they will come”

predicated on the solution to a tough problem—parallelizing software

Multi-cores have only been successes in throughput environments—servers

heterogeneous environments—SoCs

data parallel applications

Parallel processing is application specific that’s OK

treat parallel machines a attached processors

true in SoCs for some time—control plane / data plane separation

18

perf

freq

perf

# cores

1919

19


Measured thread level parallelism—TLP

19

Evolution of Thread-Level Parallelism in Desktop Applications

G. Blake, R. Dreslinski, T. Mudge, University of Michigan, K. Flautner ARM, ISCA 2010, to appear.

Caveat: Desktop Applications

2020

20


Single thread performance: BoostingBaseline Cache runs 4x core frequency

Pipelined cacheBetter Single Thread Performance Boosting

Turn some cores off, speed up the rest Cache frequency remains the same Cache un-pipelined Faster response time Same throughput Core sees larger cache, hiding longer

DRAM latency Increase core voltage and frequency further

Overclock Cache frequency must be increased Even faster response time Increased throughput

Cluster

Core

L1

Core Core Core

4 Cores @15MHz (650mV)

Cache @ 60MHz (700mV)

Core Core Core

Cluster

Core

L1

1 Core @60MHz (850mV)

Cache @ 60MHz (1 V)

Core Core Core

Cluster

Core

L1

1 Core @120MHz (1.3V)

Cache @ 120MHz (1.5V)

4x

8x

2121

21


Single Thread Performance Look at turning off cores and speeding the remaining cores to gain faster

response time.

Graph of cluster performance (not measured – intuition)

R. Dreslinski, M. Wieckowski, D. Blaauw, D. Sylvester, and T. Mudge. Near Threshold Computing: Overcoming Performance Degradation from Aggressive Voltage Scaling. Workshop on Energy-Efficient Design (WEED 2009), held at 36th Int. Symp. on Computer Architecture, Austin, TX, June, 2009.

2222

22


Boosting Clusters—scaled to 22nm

Baseline Cache runs 4x core frequency

Pipelined cacheBetter Single Thread Performance Turn some cores off, speed up the rest

Cache frequency remains the same Cache un-pipelined Faster response time Same throughput Core sees larger cache, hiding longer

DRAM latency Boost core voltage and frequency further

Cache frequency must be increased Even faster response time Increased throughput

Cluster

Core

L1

Core Core Core

4 Cores @140MHz

Cache @ 60MHz

Core Core Core

Cluster

Core

L1

1 Core @600MHz

Cache @ 600MHz

Core Core Core

Cluster

Core

L1

1 Core @1.2GHz

Cache @ 1.2GHz

4x

8x

2424

24




3D die stacking


24

2525

25


A Closer Look at Wafer-Level Stacking

Dielectric(SiO2/SiN)

Gate Poly

STI (Shallow Trench Isolation)

Oxide

Silicon

W (Tungsten contact & via)

Al (M1 – M5)

Cu (M6, Top Metal)

“Super-Contact”

Bob Patti, CTO Tezzaron Semiconductor

Preparing a TSV—through silicon via

2626

26


Next, stack second wafer & thin


FF: face-to-face

2727

27


3rd wafer

2nd wafer

1st wafer: controller

Then stack a third wafer


FB: face-to-back

2828

28


1st wafer: controller

2nd wafer

3rd wafer

This is the completed stack

Finally, flip, thin and add pads


2929

29


Characteristics Very high bandwidth low-latency low-power buses possible

10,000 vias / sq mm Electrical characteristics: 1fF and < 1Ω∼

No I/O pads for inter-stack connections—low-power Consider a memory stack:

DDR3 ~40mW per pin 1024 Data pins →40W 4096 Data pins →160W die on wafer ~24uW per pin

Pros / cons 3D interconnect failure < 0.1ppm Heat—1 W/sq mm KGD may be a problem—foundry

Different processes can be combined DRAM / logic / analog / non-volatile memories e.g. DRAM—split sense amps and drivers from memory cells

29

3030

30


Centip3De—3D NTC Project

Logic - BLogic - B

Logic - BLogic - B

Logic - ALogic - A

DRAM Sense/Logic – Bond RoutingDRAM Sense/Logic – Bond Routing

DRAMDRAM

DRAMDRAM

F2F Bond

F2F Bond

Logic - ALogic - A

Centip3De Design

•130nm, 7-Layer 3D-Stacked Chip

•128 - ARM M3 Cores

•1.92 GOPS @130mW

•tapedout: Q1 2010

3131

31


Stacking the Die

System Configuration

•2-Wafer = 16 Clusters (64 Cores)

•4-Wafer = 32 Clusters (128 Cores)

•DDR3 Controller

Core Cluster

ARMM3

ARMM3

ARMM3

ARMM3

ARMM3

ARMM3 ARM

M3

ARMM3

ARMM3

ARMM3

ARMM3

ARMM3

ARMM3

ARMM3 ARM

M3

ARMM3

Core Cluster

ARMM3

ARMM3

ARMM3

ARMM3

ARMM3

ARMM3 ARM

M3

ARMM3

ARMM3

ARMM3

ARMM3

ARMM3

ARMM3

ARMM3 ARM

M3

ARMM3

Cluster Configuration

•4 Arm M3 Cores @ 15MHz (650mV)

•1 kB Instruction Cache @ 60MHz (700mV)

•8 kB Data Cache @60 MHz (700mV)

•Cores connect via 3D to caches on other layer

Estimated Performance (Raytrace)

•1.6 GOPS (0.8 GOPS on 2-Wafer)

•110mW (65mW on 2-Wafer)

•14. 8 GOPS/W

Fair Metric

•Centip3De achieves 24 GOPS/W without DRAM

3232

32


1.9 GOPS (3.8 GOPS in Boost) Max 1 IPC per core

128 Cores

15 MHz

130 mW

14.8 GOPS/W (5.5 in Boost)

Design Scaling and Power BreakdownsNTC Centip3De System

130nm

To

22nm

~600 GOPS (~1k GOPS in Boost) Max 1 IPC per core

4,608 Cores 140 MHz

~3W ~200 GOPS/W

Raytracing Benchmark

3333

33




3D die stacking


33

3434

34


FIN—Thanks

3535

35


Background – NAND Flash overview Dual mode SLC/MLC Flash

bank organization Single Level Cell (SLC)

1 bit/cell 105 erases per block 25 μs read, 200 μs write

Multi Level Cell (MLC) 2 bits/cell 104 erases per block 50 μs read, 680 μs write

2048 64

1 block = 64 SLC/128 MLC pages

2048 64SLC page

MLC pages 2048 64

2112 / 4224 bytes

Addressable read/write unit is the page Pages consist of 2048 bytes + 64 ‘spare’ bytes Erases 64 SLC or 128 MLC pages at a time (a block) Technology – less than 60nm

Dual Mode NAND Flash memory

3636

36

ACAL – University of Michigan 36

Reducing Memory Power

Flash is denser than DRAM Flash is cheaper than DRAM Flash good for idle power optimization

1000× less power than DRAM Flash not so good for low access latency usage model

DRAM still required for acceptable access latencies Flash “wears out” – 10,000/100,000 write/erase cycles

Area/bit

(μm2)$/Gb

Active Power

Idle Power

Read latency

Write latency

Erase latency

DRAM 0.015 3 495mW 15mW 55ns 55ns N/A

NAND 0.005 0.25 50mW 6μW 25μs 200μs 1.5ms

PCM 0.068 ? 6μW 55ns 150ns N/A

NAND Flash cost assumes 2-bit-per-cell MLC. DRAM is a 2 Gbit DDR3-1333 x8 chip.

Flash power numbers are a 2 Gbit SLC x8 chip. Area from ITRS Roadmap 2009.

3737

37


A Case for Flash as Secondary Disk Cache

37

Many server workloads use a large working-set (100’s of MBs ~ 10’s of GB and even more) Large working-set is cached to main memory to maintain high

throughput

Large portion of DRAM to disk cache

Many server applications are more read intensive than write intensive

Flash memory consumes orders of magnitude less idle power than DRAM

Use DRAM for recent and frequently accessed content and use Flash for not recent and infrequently accessed content Client requests are spatially and temporally a zipf like distribution

e.g. 90% of client requests are to 20% of files

3838

38


0

200

400

600

800

1,000

1,200

12us 25us 50us 100us 400us 1600us

disk cache access latency to 80% of files

Ne

two

rk b

an

dw

idth

- M

bp

s(T

hro

ug

hp

ut)

MP4 MP8 MP12

38

A Case for Flash as Secondary Disk Cache

Specweb99

An access latency of 100’s of microseconds can be tolerated.

T. Kgil, and T, Mudge. FlashCache: A NAND Flash memory file cache for low power web servers. Proc. Conf. Compiler and Architecture Support for Embedded Systems (CASES'06), Seoul, S. Korea, Oct. 2006, pp. 103-112.

3939

39


Main memory

Processors

128MB DRAM

1GB Flash

FCHT FBST

HDD ctrl.

Hard Disk Drive

Flash ctrl.DMA

FPST FGST1GB DRAM1GB Flash128MB DRAM

Overall Architecture

Processors

HDD ctrl.

Hard Disk Drive

Main memory

Generic main memory + Primary disk cache

Secondary disk cache

Baseline without FlashCacheFlashCache Architecture

Tables used to manage Flash memory

4040

40


Overall Network Performance - Mbps

0

200

400

600

800

1,000

1,200

DRAM 32MB +FLASH 1GB

DRAM 64MB +FLASH 1GB

DRAM 128MB+FLASH 1GB



DRAM 1GB

Ne

two

rk B

an

dw

idth

- M

bp

s

MP4 MP8 MP12

128MB DRAM + 1GB NAND Flash performs

as well as 1GB DRAM while requiring only about

1/3 die area (SLC Flash assumed)

Specweb99

4141

41


0

0.5

1

1.5

2

2.5

3

DDR2 1GB active DDR2 1GB powerdown DDR2 128MB + Flash1GB

Ove

rall

Po

wer

- W

read power write power idle power

Overall Main Memory Power

SpecWeb99

2.5W

1.6W

0.6W

Flash Memory consumes much less idle power

than DRAM

4242

42


Concluding Remarks on Flash-for-DRAM

DRAM clockless refresh reduces idle power

Flash density continues to grow

Intel-Micron JV announced 25nm flash 8GB die 167sq mm 2 bits per cell

3 bits/cell is coming soon

PCRAM appears to be an interesting future alternative

I predict single level storage using some form of NV memory with disks replacing tape for archival storage

42

4343

43


Cluster Size and BoostingAnalysis:

•2-Die stack

•Fixed die size

•Fixed amount of cache per core

•Raytrace Algorithm

4444

44


System ArchitectureB-B

InterfaceB-B

InterfaceClock

Layer BLayer A

Sys Ctrl

DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAMMem Fwd

4-CoreCluster

4-CoreCluster

4-CoreCluster

4-CoreCluster

Layer Hub

JTAG

Mem Fwd

4-CoreCluster

4-CoreCluster

4-CoreCluster

4-CoreCluster

Layer Hub

4-CoreCluster

4-CoreCluster

4-CoreCluster

4-CoreCluster

Layer Hub

4-CoreCluster

4-CoreCluster

4-CoreCluster

4-CoreCluster

Layer Hub

ClockSystem Comm

Fwd DRAMDRAMJTAG

4545

45


Cluster Architecture

Cluster ClockGen

60 MhzSystemClock

65 MhzCacheClock

15 Mhz CoreClocks with

0,90,180,270degree phase offsets

M3

ClusterI-Cache1024b4-Way

ClusterD-Cache

8192b4-Way

ClusterMMIO,

Reset Ctrl,etc

60 MhzCacheClock

M3

M3

M3

SystemCommunication

JTAG In

JTAG Out

AMBA-likeBusses To

DRAM(128-bit)

AMBA-likeBuses (32-bit)3D integration

Layer 1 Layer 2

4646

46


With With power and cooling becoming an increasingly costly part of the operating cost of a server, the old trend of striving for higher performance with little regard for power is over. Emerging semiconductor process technologies, multicore architectures, and new interconnect technology provide an avenue for future servers to become low power, compact, and possibly mobile. In our talk we examine three techniques for achieving low power: 1) Near threshold operation; 2) 3D die stacking; and 3) replacing DRAM with Flash memory.power and cooling becoming an increasingly costly part of the operating cost of a server, the old trend of striving for higher performance with little regard for power is over. Emerging semiconductor process technologies, multicore architectures, and new interconnect technology provide an avenue for future servers to become low power, compact, and possibly mobile. In our talk we examine three techniques for achieving low power: 1) Near threshold operation; 2) 3D die

; and 3) replacing DRAM with Flash memory.

46

4747

47


4848

48


Solutions?

Reduce Vdd “Near Threshold Computing”

Drawbacks

slower less reliable operation

But

parallelism

suits some computations (more as time goes by)

robustness techniques

e.g. Razor—in situ monitoring

Cool chips (again) interesting developments in microfluidics

Devices that operate at lower Vdd without performance loss

48

4949

49


Low-Voltage Robustness

VDD scaling reduces SRAM robustness

Maintain robustness through device sizing and VTH selection

Robustness measured using importance-sampling

In NTC range 6T is smaller

300 400 500 600 700 800 900 10000.8

1.0

1.2

1.4

1.6

1.8

2.0

Logi

c R

ule

Bitc

ell A

rea

(m

2 )

VDD

(mV)

6T with VTH

Selection 6T w/0 V

TH Selection

8T with VTH

Selection 8T w/0 V

TH Selection

5050

50


SRAM Designs

HD-SRAM Differential write Single-ended read Asymmetric sizing HD µ/σ = 12.1 / 1.16 6T µ/σ = 11.0 / 1.35

Crosshairs Skew VDD in column Skew GND in row Target failing cells No bitcell changes

Skew hurts some cells

0 20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

Row

& C

olum

n E

rror

Rat

e (n

orm

)

VDD

and GND Skew (mV)

1.1V 0.7V

2 4 6 8 10 12 14 160

5

10

15

20

25

30

35

40

Num

ber

of C

hips

Read + Write Margin (VTH

)

Half-differential Differential

BLR BLRW

WLRWLRW

VDD1VDD2

GND1

GND2

BL BLBAR

WL WL

5151

51


Evolution of Subthreshold Designs

Unpublished Results – ISSCC 2010 – Do not disclose

Phoenix 2 Design (2010)- 0.18 µm CMOS

-Commercial ARM M3 Core

-Used to investigate:

•Energy harvesting

•Power management

-37.4 µW/MHz

Subliminal 2 Design (2007)-0.13 µm CMOS

-Used to investigate process variation

-3.5 µW/MHz

Subliminal 1 Design (2006)-0.13 µm CMOS

-Used to investigate existence of Vmin

-2.60 µW/MHz

Phoneix 1 Design (2008)- 0.18 µm CMOS

-Used to investigate sleep current

-2.8 µW/MHz

processor

memory

244m

305m

122m

181m

processor

memory

244m

305m

122m

181m

253 µm

IMEM IMEM

DMEM DMEM

CORECORE

715 µm

Proc CProc A

IMEM

CORE

DMEM

Proc B

253 µm

98 µm

Documents

11 1 Technologies for Reducing Power Trevor Mudge Bredt Family Professor of Engineering Computer Science and Engineering The University of Michigan, Ann