29
1 ASP-DAC 2014 1 Novel Nonvolatile Memory Hierarchies to Realize "Normally-Off Mobile Processors" Shinobu Fujita, Kumiko Nomura, Hiroki Noguchi, Susumu Takeda , Keiko Abe Toshiba Corporation, R&D Center Advanced LSI technology laboratory This work was partly supported by Normally-off Computing PJ (NEDO) in Japan. Acknowledgement

Novel Nonvolatile Memory Hierarchies to Realize Normally ... · Novel Nonvolatile Memory Hierarchies to Realize "Normally-Off Mobile Processors" Shinobu Fujita, Kumiko Nomura, Hiroki

  • Upload
    others

  • View
    19

  • Download
    0

Embed Size (px)

Citation preview

1

ASP-DAC 2014

1

Novel Nonvolatile Memory Hierarchies to Realize "Normally-Off Mobile Processors"

Shinobu Fujita, Kumiko Nomura, Hiroki Noguchi, Susumu Takeda , Keiko Abe

Toshiba Corporation, R&D CenterAdvanced LSI technology laboratory

This work was partly supported by Normally-off Computing PJ (NEDO) in Japan.Acknowledgement

2

OUTLINE Introduction:

Normally-off (N-off) Processor (from ver.0 to ver.1. )

Key Point 1: Advanced STT-MRAM

Key Point 2: Decrease in power for short CPU standby state by applying new memory cell design

Key Point 3: Power Decrease for long CPU standby state by Ultra-Fast- Power Gating

Conclusions Towards N-off ver 2.

3

Normally-Off Computer Ver.0 (2001)(FED journal Japan, 2001)

volatileNon-

VolatileMemory(MRAM)

Register files

ALU/FlipFlop

Cache (L2)

Main memory

Storage

Cache (L1)

non-volatile

Memory Hierarchy

Proposed by K. Ando, AIST, Japan

History of Concept on Normally-Off Computer

The same Ver.0 concept presented by T. Kawahara, ASP-DAC 2011.

(based on MRAM)

volatileVolatileMemory Register files

ALU/FlipFlop

Cache (L2)

Main memory

Storage

Cache (L1)

non-volatile

Memory Hierarchy

“Nonvolatile memoryand normally-off computing”

4

Attention: -Active power (write power) of nonvolatile memory is so large! -Speed of NV-Memory is much slower than that of SRAM.(CPU core power and performance is largely degraded by Ver.0!)

L2 -cache

resistor

Main memory

L1 -cache

ALU

Storage

Active power is dominant.

Standby power is dominant.

Ver.0 (All Nonvolatile Memory Hierarchy) is not suitable for decreasing power ..

Power

Time

Rethink Normally-off Concept Ver.0

5

Normally-Off Computer Ver.0 (2001)(FED journal Japan, 2001)

volatileNon-

VolatileMemory(MRAM)

Register files

ALU/FlipFlop

Cache (L2)

Main memory

Storage

Cache (L1)

non-volatile

Memory Hierarchy

K. Ando, AIST, Japan

History of Concept on Normally-Off Computer (2)

Ver.0 (2011) T. Kawahara, ASP-DAC 2011.

volatileVolatileMemory Register files

ALU/FlipFlop

Cache (L2)

Main memory

Storage

Cache (L1)

non-volatile

Memory Hierarchy

K. Abe, S. Fujita et al., Toshiba, SSDM 2010.

Non-VolatileMemory(MRAM)

Register files

ALU/FlipFlop

Cache (L2)

Main memory

Storage

Cache (L1)

non-volatile

Memory Hierarchy

CPU core power and performance is largely degraded!

Ver.1 (2010)

CPU core power and performance is largely degraded!

Ultra low power applicationssuch as Sensor Networks etc.

6

Capacitance of Cache Memory in CPU is increasing, which increases standby power of processors!

<Background>•Increase performance not by increasing clock frequency.•Multi-core.

‘08 ‘09 ‘10 ‘11 ‘12 ‘13

MB

Server, W/S CPU

2

4

8

16

32

1

64 Smart Phone CPU

0.5

Note PC CPUDesk Top PC CPU

CPU CoreResistors

L2, L3 Cache

Resistor file

L1Cache

More cache,More Leakage..

Why nonvolatile L2 , L3, LL Cache?

7

Consumed Energy Caused by Leakage Power of Last Level Cache (L2$)

Especially for Mobile-Processor, not Standby Power but Leakage Power is Dominant!

(Evaluation from one-day use case.)

0%

20%

40%

60%

80%

100%

Activestate

Clockgatedstate

Powergated state(except L2 (retention))

Ene

rgy

cons

umed

(%)

8

1

10

100

1000

Acc

ess

spe

ed (

ns)

limitedendurance

RAM

Storage

Memory capacity (bit)1M 1G 1T

HDD

1000000

PCM

FeRAM

MRAM

10000000

practicallyunlimitedendurance

SRAM

NANDFlash

DRAM

STT-MRAM

ReRAM

STT-MRAM is the best in NVM, but..

Its operation speed is slowand its power is high for cache memory.

9

0%

20%

40%

60%

80%

100%

Activestate

Clockgatedstate

Powergated state( L2 retention)

Ene

rgy

cons

umed

(%)

General STT-MRAMActive powerIncreasesdrastically!

Standby power is low, but active energy is extremely higher than that of SRAM even using conventional STT-MRAM.“Dilemma of Nonvolatile Memory! “

General STT-MRAMStandby powerDecreaseslargely!

10

OUTLINE Introduction: New Design Concept

Normally-off (N-off) Processor (from ver.0 to ver.1. )

Key Point 1: Advanced STT-MRAM

Key Point 2: Decrease in power for short CPU standby state (in CPU active state) by applying new memory cell design

Key Point 3: Power Decrease for long CPU standby state by Ultra-Fast- Power Gating

Conclusions Towards N-off ver 2.

11

1

10

100

1000

Acc

ess

spe

ed (

ns)

limitedendurance

RAM

Storage

Memory capacity (bit)1M 1G 1T

HDD

1000000

PCM

FeRAM

MRAM

10000000

practicallyunlimitedendurance

SRAM

NANDFlash

DRAM

STT-MRAM

ReRAM

p-STT-MRAM

Advanced

Advanced STT-MRAM has been developed!

12

Breakthrough Breakthrough by Toshibaby Toshiba’’s advanced STTs advanced STT--MRAMMRAM

1.0E-05

1.0E-04

1.0E-03

1.0E-02

1.E-10 1.E-09 1.E-08 1.E-07Programming time (nsec)

Pro

gram

min

g cu

rrent

(A

)

Breakeven point for the replacement of hp-SRAM

in power of cache memory

[1] Sony corp. IEDM (2005)[2] New York univ. APPLIED PHYSICS LETTERS 97, 242510 (2010) [3] Cornel Univ. APPLIED PHYSICS LETTERS 95, 012506 (2009) [4] Minnesota univ. J. Phys. D: Appl. Phys. 45, 025001 (2012).[5] NEC corp. Symposium on VLSI Circuits 7.3 (2012).[6] IBM corp. Appl Phys Lett 98, 022501 (2011).[7] TDK-Headway Applied Physics Express 5 093008 (2012)

[3]

[1]

[2]

[4]

[5]

[6]

[7]

Power of cache memory is increased

Reduction in power of cache

memory

Toshiba 2012(Advanced STT-MRAM, 28-30nm)

Pow

er d

own

Higher speed

13

Access Time Measurements

Volt

age

(V)

Access time (ns)

Pass

Fail

1.2

1.1

1.0

0.9

0.83 4 5 6

[email protected] [email protected]

17.8mW@250MHz Read

STT-MRAM Test Chip

502.03μm 502.03μm55.26μm

560.

12μ

m

32.69μm

592.

81μ

m 1059.32μm

XDEC

READ & WRITE1K BL

1K W

L

4:1MUX

256Rx256C

4:1MUX

256Rx256C

4:1MUX

256Rx256C

4:1MUX

256Rx256C

DriverSAProcess 65-nm CMOS process

Macro size 0.628 mm2

Organization 4K words x 256 bits = 1 Mb

Process 65-nm CMOS processMacro size 0.628 mm2

Organization 4K words x 256 bits = 1 Mb

2T-2MTJ

Embedded Memory Integration (by Toshiba N-off PJ)

Access time < 4ns

H. Noguchi et al., VLSI circuit symposium, 2013

14

High speed STT-MRAM is NOT for high CPU performance, but for lower power CPU!

0

1

2

3

4

0 10 20 30

SRAM

Access Time (ns)

Ave

rage

Tim

e pe

r Ins

truc

tion

(ns)

STT-MRAM

< 3%@5ns

0

1

2

3

4

0 5 10 15 2Rel

ativ

e A

vera

ge P

ower

for L

2 C

ache

Mem

ory

Access Time of STT-MRAM (ns)

HP-SRAM=1

Higher Power

Lower Power 1MB

15

CMOS

配線層

CMOS

STT-MRAM

Development of “STT-MRAM-top Integration”

Cross section image

Conventional CMOS Process(in-house fab, foundry..)

Specific MRAM Integration Process

Pool -top construction (Marina Bay Sands Hotel)

To be presented in VLSI-TSA 2014.

16

OUTLINE Introduction: New Design Concept

Normally-off (N-off) Processor (from ver.0 to ver.1. )

Key Point 1: Advanced STT-MRAM

Key Point 2: Decrease in power for short CPU standby state (in CPU active state) by applying new memory cell design (normally-off type design)

Key Point 3: Power Decrease for long CPU standby state by Ultra-Fast- Power Gating

Conclusions Towards N-off ver 2.

17

Short StandbyLeakage power

From “Normally-On Type Memory with Power Gating”to “Normally-Off Type Memory without Power Gating”

(1) SRAM and Nonvolatile SRAM without Power Gating

Time

Power

Short StandbyLeakage power

10~30ns

Power

Time

WL

BL /BLLeakage path

Active Active Active

Active Active Active

Leakage path

Nonvolatile SRAM

WLBL /BL

FP

PF

WLBL /BL

FP

PF

- (2004 Toshiba)

MTJ MTJ

Normally-On Type

NV-SRAM for High Speed!PowerGating?

Short standby

18

From “Normally-On Type Memory with Power Gating”to “Normally-Off Type Memory without Power Gating”

(1) SRAM and Nonvolatile SRAM with Power Gating

(2) Normally-off Type Memory without Power Gating

WL

BL /BLLeakage path

No Leakage path, No power gating switch.

Nonvolatile SRAM

WLBL /BL

FP

PF

WLBL /BL

FP

PF

- (2004 Toshiba)

MTJ MTJ

New design: “Normally-off Type”(Next page)

Normally-On Type

Overhead of power gating switchis much large! (Delay and Power overhead also )

Power gating switch

SRAM

NV-SRAM

Are

aSRAM

x2 ~x4A

rea

power gatingswitch

STT-MRAM

STT-MRAM cell is much smaller than SRAM cell.

19

(a) D-MRAM (b) 3T-2MTJ (c) 2T-2MTJ

WBL

SLRWL

WWL

SWL

RBL

MTJ

WL

BL SL /BL

WWLMTJ MTJ

WL

BL SL /BL

/SL

MTJ MTJ

K. Abe et al.IEDM2012(Toshiba)

A. Kawasumi et al.IMW2013(Toshiba)

H. Noguchi et al.VLSI Circuit 2013(Toshiba)

(d) 4T-2MTJ

C. Tanaka et al.SSDM 2013(Toshiba)

Various kinds of Normally-off Type Memory Cell designsusing advanced p-STT-MRAM presented by Toshiba.

As there are No Leakage paths like SRAM, no power gating switch is needed in the memory arrays.

MTJ1

WLBL /BL

SL

M1

M2

N1M3

M4

N2

PF

PF

PF

PF MTJ2

20

Processors# of cores 1Frequency 1GHz

Issue width 1(out of order)

ISA ARMv7

Memory

L1 cache 32+32kB, 4- way, 64B line, Write-back, 1 read/write port, 1ns latency

L2 cache 1MB, 8- way, 64B line,

-

-Write-back,1 read/write port

ExecutionWarm-up 1M inst.Execution 10M inst.

-

--SRAM(Reference)

3ns / 50uA(Advanced p-MTJ)2MTJ-6T

25ns / 120uA(Reference p-MTJ)2MTJ-4T

3ns / 50uA(Advanced p-MTJ)

MTJ deviceWrite Time / Current

D-MRAM(1MTJ-3T, This work)

Cell Type

--SRAM(Reference)

3ns / 50uA(Advanced p-MTJ)2MTJ-6T

25ns / 120uA(Reference p-MTJ)2MTJ-4T

3ns / 50uA(Advanced p-MTJ)

MTJ deviceWrite Time / Current

D-MRAM(1MTJ-3T, This work)

Cell Type

CPU-Simulation(ARM core, Linux-OS)

Processor benchmark sets: SPEC2006

21

Results of Power of Cache Memory(Short standby state)

(case study: (a) D-MRAM)

0.1

1

10

bzip2 gobmk h264ref xalancbmk gromacs namd calculix lbm

Nor

mal

ized

1M

B-8

way

L2$

pow

er 2MTJ-6T

2MTJ-4T

1MTJ-3T (advancedp-MTJ this work)1MTJ-3T(referenced p-MTJ)

Normally-Off

Normally-Onwithout PG

Normally-Off memory using low-power and advanced p-STT-MRAM can reduce the cache power the most effectively.

Worse than SRAM(Normally-on)

SRAMSRAM

Better than SRAM (Normally-Off)Better than SRAM (Normally-Off)

22

0

0.2

0.4

0.6

0.8

1

bzip2 gobmk h264ref xalancbmk gromacs namd calculix lbm

Nor

mal

ized

inst

ruct

ions

per

cyc

les

(IPC

) (a

. u.)

2MTJ-6T

2MTJ-4T

1MTJ-3T (advancedp-MTJ this work)1MTJ-3T (referencedp-MTJ)

Normally-Off memory cell design using advancedp-MTJ(STT-MRAM) has the best performance

comparable to that of SRAM.

Processor performance(Short standby state)

(case study: Normally-off STT-MRAM) SRAM case = 1

23

OUTLINE Introduction: New Design Concept

Normally-off (N-off) Processor (from ver.0 to ver.1. )

Key Point 1: Advanced STT-MRAM

Key Point 2: Decrease in power for short CPU standby state (in CPU active state) by applying new memory cell design

Key Point 3: Power Decrease for long CPU standby stateby Ultra-Fast- Power Gating

Conclusions Towards N-off ver 2.

24

Recovery time to active-state

OthersL2-$L1-$ClockConventional Power GatingActive state

Clock gated state OFF ~1ns

CPU core sleep (L2-Cache retention)

OFF OFF 10s

CPU core Sleep (L2-Cache decay)

OFF OFF decay 20s

Deep power-down state(DPS)

OFF OFF OFF OFF except State SRAM

100s

~100nsNormally OFF

OFFOFFOFFDeeper power-down state

OFF

L2-logic +prepheral

Recovery time to active-state

L2-memoryL1-$ClockHigh-speed PG with NV-cacheActive state Normally

OFFClock gated state OFF Normally

OFF~1ns

CPU core sleep;Deep power-down state (DPS)

OFF Normally OFF

~10ns

25

Power(%)

Active state

CPU core sleep state

Deep power-down state (DPS)

Clock gated

1

20sec

20nsecIdle time

(ns)

0

100

50

10 102 103 104 105 106

Clock gated

L2 cache decay

2nsec

40sec

200sec

Deep power-down state (DPS)

Conventional PG

Ultra-fast PG + Nonvolatile-L2 cache

200nsec

Long time standby state

State transition policy for long time standby state: Conventional power gating (PG) vs. Ultra-fast PG with NV-L2-cache

Short standby state

26

a) Conventional PG (Power Gating)

b) Ultra-fast PG+NV-cache

Time

Power Interrupt

Interrupt Interrupt

Time

Power

Interrupt Interrupt Interrupt

Deep sleep state

Deep sleep state

Deep sleep state

Deep sleep state

27

Case 1 Case 2 Case 30

100

200

300

400

500

600

700

800

900

1000

ConventionalNV-L2 cache

29%

66%

90%Ave

rage

Pow

er o

f Pro

cess

or (m

W)

Case studies: Decrease in average power of processor by ultra-fast PG with nonvolatile-L2 cache.

Case1:CPU active state dominant, Case2: Moderate,Case3: CPU idle state dominant.

28Normally-off Processor

(Concept Ver.2) from 2012Normally-off Processor(Concept Ver.1) 2010

Volatile Rethink!Nonvolatile/VolatileHybrid

CPU core

Nonvolatile

L2, L3 Cache

L1 CacheRegister fileRegisters

Nonvolatile

Volatile

CPU core

Volatile

Conclusion•For HP-mobile processors, we proposed N-off processor ver.1; volatile L1-cache/ nonvolatile L2,LLC. •To realize N-off processor ver.1, advanced STT-MRAM, normally-off type memory cell design, ultra-fast power gating are three key points.•By applying new memory cell designs without leakage paths, not only CPU standby power but CPU active power has been effectively reduced.• Average power reduction by 29 to 90% can be expected with little degradation of CPU performance.• N-off processor concept shifting Ver.1 to Ver.2 has been in progress.

Ex.

L2 (256kB)SRAM/ STT-MRAM

L3 (16MB)High densitySTT-MRAM

L2 (1MB)STT-MRAM

CPUCore(Big)

CPUCore

(LITTLE)

29

Thank you!