29
NANDFlashSim : Intrinsic Latency Variation NANDFlashSim : Intrinsic Latency Variation Aware NAND Flash Memory System Modeling and Simulation at Microarchitecture Level and Simulation at Microarchitecture Level Myoungsoo Jung (MJ), Ellis H. Wilson III, David Donofrio, John Shalf, Mahmut T. Kandemir

NANDFlashSim :IntrinsicLatencyVariation: Intrinsic …storageconference.us/2012/Presentations/R25.Flash.2.NANDFlashSim.… · • Latencies of the NAND flash memory fluctuatedependingontheaddressoffluctuate

  • Upload
    vunhu

  • View
    220

  • Download
    0

Embed Size (px)

Citation preview

NANDFlashSim : Intrinsic Latency VariationNANDFlashSim : Intrinsic Latency Variation Aware NAND Flash Memory System Modeling

and Simulation at Microarchitecture Leveland Simulation at Microarchitecture Level

Myoungsoo Jung (MJ), y g g ( ),Ellis H. Wilson III, David Donofrio, John Shalf, Mahmut T. Kandemir

AgendaAgenda

• Revisiting NAND flash technology• Advance NAND flash operationsAdvance NAND flash operations• NANDFlashSim• Evaluation

Intrinsic Latency VariationIntrinsic Latency Variation

• Fowler-Nordheim Tunneling – Making an electron channela g a e ect o c a e– Voltage is applied over a certain threshold

I t l t l i• Incremental step pulse programming (ISPP)

sof

Cel

lsGate

Floating Gate

ibut

ion Floating Gate

Dist

rSource DrainChannel

Intrinsic Latency VariationIntrinsic Latency Variation

• Each step of ISPP needs different programming duration (latency) p g g ( y)

• Latencies of the NAND flash memory fluctuate depending on the address offluctuate depending on the address of the pages in a block

NAND Flash ArchitectureNAND Flash Architecture

• Employing cache and data registers • Multiple planes (memory array)Multiple planes (memory array)• Multiple dies

Memory Array(Plane)

Density TrendDensity Trend

• Flash technology– Each cell is capable to store multiple bitsac ce s capab e to sto e u t p e b ts– manufacturing feature size is scaling down

S f d it i i i b t t• So far, density is increasing by two to four times every 2 years70

Flash Chip

40

50

60

Gb)

10

20

30

Den

sity

(G

2002 2004 2006 2008 2010 2012

0

Years

Density TrendDensity Trend

• Shrinking manufacturing feature size might be limited around 12 nanometerg

• Multi-die stack technologyFl h k ti t l b– Flash packages continue to scale up by employing multiple dies and planes

70 Flash Chip Flash Chip Flash Chip MCP

40

50

60

Gb)

1500

2000

Gb)

?1500

2000

Gb)

10

20

30

Den

sity

(G500

1000D

ensi

ty (G ?

500

1000D

ensi

ty (G

MOSAID HLNAND (32Gb 16-DIE NAND STACK)

How does performance behavior change?

2002 2004 2006 2008 2010 2012

0

Years2002 2004 2006 2008 2010 2012 2014

032Gb 64Gb

Years

1Gb

2002 2004 2006 2008 2010 2012 2014

032Gb 64Gb

Years

1Gb

Advance NAND Flash OperationAdvance NAND Flash Operation

Legacy OperationLegacy OperationA I/O ti lit i t l ti t• An I/O operation splits into several operation stages

• Each stage should be appropriately handled by device drivers

writeSttatu

cmdcheck

data

s

cmd

dataaddr

Cache OperationCache Operation

C h d i i l• Cache mode operations use internal registers in an attempt to hide performance

h d f d t toverhead from data movements

data2data1

Internal Data Move ModeInternal Data Move Mode

S i d l t d t• Saving space and cycles to copy data• Source and destination page address should

be located in the same diebe located in the same die

data

N d t t

cmd

No data movement through NAND interface

cmdsrc addrdst addr

Multi plane Mode OperationMulti-plane Mode OperationT diff t b d i ll l• Two different pages can be served in parallel

• Addresses should indicate same page offset in a block, same die address and should have different ,plane addresses (plane addressing rule)

cmdcheck

cmddata1data2

cmdaddr1addr2

Interleaved Die Mode OperationInterleaved-Die Mode Operation

idi t ki d t f i t l• providing a way, taking advantage of internal parallelism by interleaving NAND transactions

• Scheduling NAND transactions and bus arbitrations gare critical dominant of memory system performance

cmd cmd

cmdaddr1data1

cmdaddr2data2

ChallengesChallenges

P f i d b d• Performances are varied based on:– intrinsic latency variation characteristic– internal parallelism– advanced flash operations typesp yp

• Performances are affected byhow to deal with diverse advance flash– how to deal with diverse advance flash operationshow to effectively schedule NAND– how to effectively schedule NAND transactions

Prior Simulation WorksPrior Simulation Works

Fl h b d S lid S Di k Si l i• Flash-based Solid State Disks Simulation– Tightly coupled to specific flash firmware

• Unaware of latency variation of NAND flash– Latency approximation model with constantsy pp

• Course-grain NAND command handling– In-order executionIn order execution

e n

SSD SimulatorI/O Subsystem

h Fi

rmw

are

Late

ncy

roxi

mat

ion

Flas

h LAp

p

NANDFlashSimNANDFlashSim

Si l ti d M d li NAND fl h• Simulating and Modeling NAND flash rather than flash firmware or SSDs

NANDFlashSim can be applied to diverse– NANDFlashSim can be applied to diverse application like off-chip caches of a multi-core system and I/O subsystems of mobile systems

– Multiple instances can be used for building SATA, PCI-e based SSDs

e n

SSD SimulatorI/O Subsystem

war

e

h Fi

rmw

are

Late

ncy

roxi

mat

ion

lash

Firm

w

Flas

h LAp

p Fl

NANDFlashSimNANDFlashSim

D t il d Ti i M d l• Detailed Timing Model • Awareness of intrinsic latency variation

d i d b f i i– designed to be performance variation-aware and employs different page offsets in a physical blockb oc

• Reconfigurable Microarchitecture– Supports highly reconfigurable architectures in pp g y g

terms of multiple dies and planes• Fine-grain NAND flash command handling

– 16 combinations of advance flash operation– Supporting out-of-order execution

High level ViewHigh-level View

C d t hit t d i di id l t t• Command set architecture and individual state machine associated with itH t d NAND fl h l k d i t• Host and NAND flash clock domain are separate.

• All entries (controller, register, die, …) are updated at every cyclesupdated at every cycles

Uni

t

BusLo

gica

l U

AND

Flas

h I/

O

k*j

Bloc

ks

NA

Command Set ArchitectureCommand Set Architecture

• Multi-stage Operation– Stage are defined by common operationsStage a e de ed by co o ope at o s– CLE, ALE, TIR, TIN, TOR, TON, etc…

C d Ch i Latency• Command Chains– Defines command sequences

Latency Variation

Generators

Req.

Evaluation MethodologyEvaluation Methodology

SAMSUNG

MICRON

SK HYNIX

Validation (Throughput)Validation (Throughput)78

Worst-case basedTypical case based

456

dth

(MB

/s)

Typical-case based Variation-aware based MSIS

12% 2.1%

1.6% 3.8%

0123

Ban

dwid

35

40 Worst-case based Typical-case based

Legacy Cache Two-plane(2x)2x Cache0

14

16 Worst-case based

Write Performance (Single Die)4.1% ~11.2% Deviation

83 1 170 9% (T i l )

15

20

25

30

idth

(MB

/s)

Variation-aware based MSIS

8

10

12

14

h (M

B/s

)

Typical-case based Variation-aware based MSIS

83.1~170.9% (Typical-case)

7.3%~42%(Worst-case)

0

5

10

15

Ban

dw

2

4

6

8

Ban

dwid

t5.3~9.4% (NANDFlashSim)

Legacy Cache Two-plane (2x)0

Legacy Cache Two-plane(2x)2x Cache0

Read Performance (Single Die)Write Performance (Dual Die)

Improvement from deviation 48.7%~79.6% in typical-case model, 44.4 ~ 53.5 % in worst-case model

Validation (Latency)Validation (Latency)

Write-intensive Workload (MSN Server, Financial OLTP)

Latencies of NANDFlashSim are almost completely overlapped with realLatencies of NANDFlashSim are almost completely overlapped with real product latencies (Hynix)

Read-intensive Workload (Webserch, User)

Performance of Multiple PlanesPerformance of Multiple Planes

• Performance of write are significantly enhanced as the number of plane increases– Cell activities (TIN) can be executed in parallel

• Data movement (TOR) is a dominant factorData movement (TOR) is a dominant factor in determining bandwidth

2048 Bytes 4096 Bytes 8192 Bytes 2048 Bytes 4096 Bytes 8192 Bytes

1012141618

(MB

/s)

2530354045

MB

/s)

2468

10

Ban

dwid

th (

510152025

Ban

dwid

th (MLarger transfer sizes couldn’t take

advantage of multiple planes well

Legacy

Cache ModeTwo-plane

Four-planeSix-plane

Eight-plane0

Legacy

Cache ModeTwo-plane

Four-planeSix-plane

Eight-plane0

Write Performance (Single Die) Read Performance (Single Die)

Performance of Multiple DiesPerformance of Multiple Dies

Si il t lti l it f• Similar to multi-plane, write performance are improved by increasing the number of diesdies

• Multiple dies architecture provides a little worse performance than multi-planeworse performance than multi plane

40 Legacy Cache Mode Two-plane16

Legacy

25

30

35

40

MB

/s)

8

10

12

14

(MB

/s)

Legacy Cache Mode Two-plane Two-plane Cache Mode

10

15

20Ba

ndw

idth

(M

2

4

6

8

Ban

dwid

th

1(SDP) 2(DDP) 4(QDP) 8(ODP)0

5B

Package Type

1(SDP) 2(DDP) 4(QDP) 8(ODP)0

Package Type

Multi plane VS Multi dieMulti-plane VS Multi-die

• Under disk-friendly workload– The performance of interleaved-die e pe o a ce o te ea ed d e

operation is 54.5% better than multi-plane operation on averagep g

– Interleaved-die operations have less restrictions for addressingrestrictions for addressing

250003000035000 msnfs

web fin

250003000035000 msnfs

webfin

10000150002000025000

IOP

S

usr prn

10000150002000025000

IOP

S usr prn

1 2 4 8 160

5000

The Number of Plane1 2 4 8 16

05000

The Number of Die

Breakdown of CyclesBreakdown of Cycles

• While writes, most cycles are used for NAND flash itself, reads spend at least , p50.5% of the total time doing.

Bus activities Memory cell activities

e (2x) 66.9%

TON TOR CLE ALE

Cache

7.0%

6 8%

TIN TOR TIR CLE

egac

y

Cache

Mod

e

Two-

plane

(

62.8%

50.5%cyhe

Mod

e

Two-

plane

(2x)2x C 6.8%

3.6%

3.5%

ALE

LegC

0.0 5.0x106 1.0x107 1.5x107 2.0x107 2.5x107

Cycles (ns) Lega

cy

CachT

0.0 7.0x107 1.4x108 2.1x108 2.8x108 3.5x108

Cycles (ns)

Page Migration TestPage Migration Test6x109

7x109

Legacy Cache ModeI t l 6x1011

7x1011 Legacy Cache mode

3x109

4x109

5x109

ycle

s (n

s)

Internal 2x 2x cache mode 2x internal

11

4x1011

5x1011

6x10

gy (u

J)

Internal 2x 2x cache mode 2x internal

0

1x109

2x109

Tota

l Cy

1x1011

2x1011

3x1011

Ene

rg

Energy consumption is

2 blocks 4 blocks 8 blocks 16 blocks 32 blocks

0

Block Migration Size

ALE CLE TIR TOR TIN TON BER

2 blocks 4 blocks 8 blocks 16 blocks 32 blocks0

Migration Block Size

depending on # of activate components

(2 )

2x cache

2x internal

ALE CLE TIR TOR TIN TON BER

saving

Legacy

Cache Mode

Internal Data Move

Two-plane (2x)

g y

0 1x108 2x108 3x108 4x108

Cycles (ns)

Conclusion & Future WorksConclusion & Future Works

A h hi l f l ti ll li• A research vehicle for evaluating parallelism and architecture trend

Single instance– Single instance• Integrating it into GEM5 and Simics• Plan to apply it with Green Flash and Xtensa of CoDExpp y

– Multiple instances• We successfully built a multi-channel SSD framework

with 1024 instances (~16384 dies ~ 131072 planes)with 1024 instances (~16384 dies, ~ 131072 planes)

• Open Source Project– Static/shared library– Static/shared library– Standalone simulation

Q & AQ & A

D l d• Download – http://www.cse.psu.edu/~mqj5086/nfs/

• Mailing list• Mailing list– [email protected]

• Thanks toThanks to– Dean Klein, Micron Technology, Inc.– Seung-hwan Song, University of Minnesota – Michael Kim, Corelinks– Kurt Lee, Corelinks

Leonard Ko Corelinks– Leonard Ko, Corelinks– Yulwon Cho, Stanford University