[537] Flashpages.cs.wisc.edu/~harter/537/lec-24.pdf · Flash: 11 11 11 11 11 11 11 11 00 01 11 11...

Preview:

Citation preview

[537] FlashTyler Harter

Flash vs. Disk

Disk OverviewI/O requires: seek, rotate, transfer

Inherently: - not parallel (only one head) - slow (mechanical) - poor random I/O (locality around disk head)

Random requests take 10ms+

Flash OverviewHold charge in cells. No moving parts!

Inherently parallel.

No seeks!

Disks vs. Flash: PerformanceThroughput: - disk: ~130 MB/s (sequential) - flash: ~400 MB/s

Latency - disk: ~10 ms (one op) - flash - read: 10-50 us - program: 200-500 us - erase: 2 ms

types of write, more later…

Source: http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/44830.pdf

Disks vs. Flash: CapacityAn obvious question is why are we talking about spinning disks at all, rather than SSDs, which have higher IOPS and are the “future” of storage. The root reason is that the cost per GB remains too high, and more importantly that the growth rates in capacity/$ between disks and SSDs are relatively close (at least for SSDs that have sufficient numbers of program-erase cycles to use in data centers), so that cost will not change enough in the coming decade. We do make extensive use of SSDs, but primarily for high performance workloads and caching, and this helps disks by shifting seeks to SSDs.

~ Eric Brewer et al.

Flash Architecture

SLC: Single-Level Cell

char

ge

NAND Cell

SLC: Single-Level Cell

char

ge

NAND Cell

1

SLC: Single-Level Cell

char

ge

NAND Cell

0

MLC: Multi-Level Cell

char

ge

NAND Cell

00

MLC: Multi-Level Cell

char

ge

NAND Cell

01

MLC: Multi-Level Cell

char

ge

NAND Cell

10

MLC: Multi-Level Cell

char

ge

NAND Cell

11

SLCch

arge

char

ge

MLC

Single- vs. Multi- Level Cell

Single- vs. Multi- Level Cell

SLCch

arge

char

ge

MLC

expensive robust

cheap sensitive

WearoutProblem: flash cells wear out after being overwritten too many times.

MLC: ~10K times SLC: ~100K times

Usage strategy: ???

WearoutProblem: flash cells wear out after being overwritten too many times.

MLC: ~10K times SLC: ~100K times

Usage strategy: wear leveling. - prevents some cells from wearing out while others still fresh.

BanksFlash devices are divided into banks (aka, planes).

Banks can be accessed in parallel.

Bank 0 Bank 1 Bank 2 Bank 3

BanksFlash devices are divided into banks (aka, planes).

Banks can be accessed in parallel.

Bank 0 Bank 1 Bank 2 Bank 3

readread

BanksFlash devices are divided into banks (aka, planes).

Banks can be accessed in parallel.

Bank 0 Bank 1 Bank 2 Bank 3

BanksFlash devices are divided into banks (aka, planes).

Banks can be accessed in parallel.

Bank 0 Bank 1 Bank 2 Bank 3

datadata

BanksFlash devices are divided into banks (aka, planes).

Banks can be accessed in parallel.

Bank 0 Bank 1 Bank 2 Bank 3

Flash WritesWriting 0’s: - fast, fine-grained

Writing 1’s: - slow, course-grained

Flash WritesWriting 0’s: - fast, fine-grained - called “program”

Writing 1’s: - slow, course-grained - called “erase”

Flash WritesWriting 0’s: - fast, fine-grained [pages] - called “program”

Writing 1’s: - slow, course-grained [blocks] - called “erase”

Bank 0 Bank 2 Bank 3Bank 1

Bank 0 Bank 2 Bank 3

each bank contains many “blocks”

Block

1111 1111

1111 1111

1111 1111

1111 1111

1111 1111

1111 1111

1111 1111

1111 1111

Block

1111 1111

1111 1111

1111 1111

1111 1111

1111 1111

1111 1111

1111 1111

1111 1111

one block

Block

1111 1111

1111 1111

1111 1111

1111 1111

1111 1111

1111 1111

1111 1111

1111 1111

one page

Block

1111 1111

1111 1111

1111 1111

1111 1111

1111 1111

1111 1111

1111 1111

1111 1111

Block

1111 1111

1111 1111

1111 1111

1001 1111

1111 1111

1111 1111

1111 1111

1111 1111

program

Block

1111 1111

1111 1111

1111 1111

1001 1111

1111 1111

1111 1111

1111 1111

1111 1111

Block

1111 1111

1111 1111

1111 1111

1001 1100

1111 1111

1111 1111

1111 1111

1111 1111

program

Block

1111 1111

1111 1111

1111 1111

1001 1100

1111 1111

1111 1111

1111 1111

1111 1111

Block

1111 1111

1111 1111

1111 1111

1001 1100

1111 1111

1111 1111

1110 0001

1111 1111

program

Block

1111 1111

1111 1111

1111 1111

1001 1100

1111 1111

1111 1111

1110 0001

1111 1111

Block

1111 1111

1111 1111

1111 1111

1001 1100

1111 1111

1111 1111

1110 0001

1111 1111

erase

Block

1111 1111

1111 1111

1111 1111

1111 1111

1111 1111

1111 1111

1111 1111

1111 1111

erase

Block

1111 1111

1111 1111

1111 1111

1111 1111

1111 1111

1111 1111

1111 1111

1111 1111

APIsdisk flash

read

writ

e

APIsdisk flash

read read sector read page

writ

e

APIsdisk flash

read read sector read page

write sector

writ

e

program page (0’s)

erase block (1’s)

Flash HierarchyBank/plane: 1024 to 4096 blocks - banks accessed in parallel

Block: 64 to 256 pages - unit of erase

Page: 2 to 8 KB - unit of read and program

Disk vs. Flash PerformanceThroughput: - disk: ~130 MB/s (sequential) - flash: ~400 MB/s

Latency - disk: ~10 ms (one op) - flash - read: 10-50 us - program: 200-500 us - erase: 2 ms

Working with File Systems

Traditional File Systems

File System

Storage DeviceTraditional API: - read sector - write sector

Traditional File Systems

File System

Storage DeviceTraditional API: - read sector - write sector

not same as flash

Options1. Build/use new file systems for flash - JFFS, YAFFS - lot of work!

2. Build traditional API over flash API. - use FFS, LFS, whatever we want

Traditional API with Flashread(addr): return flash_read(addr)

write(addr, data): block_copy = flash_read(all pages of block) modify block_copy with data flash_erase(block of addr) flash_program(all pages of block_copy)

Memory:

Flash:00 00

00 11

11 00

11 11

1111

1111

1111

1111

00 01

1111

1111

11 11

block 0 block 1 block 2

Flash:00 00

00 11

11 00

11 11

1111

1111

1111

1111

00 01

1111

1111

11 11

block 0 block 1 block 2

FS wants to write 0001

Memory:

Flash:00 00

00 11

11 00

11 11

1111

1111

1111

1111

00 01

1111

1111

11 11

block 0 block 1 block 2

Memory:

Flash:00 00

00 11

11 00

11 11

1111

1111

1111

1111

00 01

1111

1111

11 11

block 0 block 1 block 2

Memory:00 00

00 11

11 00

11 11

read all other pages in block

Flash:00 00

00 11

11 00

11 11

1111

1111

1111

1111

00 01

1111

1111

11 11

block 0 block 1 block 2

Memory:00 00

00 11

11 00

11 11

Flash:00 00

00 11

11 00

11 11

1111

1111

1111

1111

00 01

1111

1111

11 11

block 0 block 1 block 2

Memory:00 01

00 11

11 00

11 11

modify target page in memory

Flash:00 00

00 11

11 00

11 11

1111

1111

1111

1111

00 01

1111

1111

11 11

block 0 block 1 block 2

Memory:00 01

00 11

11 00

11 11

Flash:1111

1111

1111

11 11

1111

1111

1111

1111

00 01

1111

1111

11 11

block 0 block 1 block 2

Memory:00 01

00 11

11 00

11 11

erase block

Flash:1111

1111

1111

11 11

1111

1111

1111

1111

00 01

1111

1111

11 11

block 0 block 1 block 2

Memory:00 01

00 11

11 00

11 11

Flash:1111

1111

1111

1111

00 01

1111

1111

11 11

block 0 block 1 block 2

Memory:00 01

00 11

11 00

11 11

program all pages in block

00 01

00 11

11 00

11 11

Flash:1111

1111

1111

1111

00 01

1111

1111

11 11

block 0 block 1 block 2

Memory:

00 01

00 11

11 00

11 11

Write AmplificationRandom writes are extremely expensive!

Writing one 2KB page may cause: - read, erase, and program of 256KB block.

Write AmplificationRandom writes are extremely expensive!

Writing one 2KB page may cause: - read, erase, and program of 256KB block.

Would FFS or LFS be better with flash?

File Systems over FlashCopy-On-Write FS may prevent some expensive random writes.

What about wear leveling? LFS won’t do this.

What if we want to use some other FS?

Flash Translation Layer

Better SolutionAdd copy-on-write layer between FS and flash. Avoids RMW (read-modify-write) cycle.

Translate logical device addrs to physical addrs.

FTL: Flash Translation Layer.

Should translation use math or data structure?

Flash Translation Layer

0001

block 0

0010

00 11

0000

10 01

block 1

11 11

11 11

11 11

0 1 2 3 4 5 6 7

physical:

logical:

Flash Translation Layer

0001

block 0

0010

00 11

0000

10 01

block 1

11 11

11 11

11 11

0 1 2 3 4 5 6 7

physical:

logical:

write 1101

Flash Translation Layer

0001

block 0

0010

00 11

0000

10 01

block 1

1101

11 11

11 11

0 1 2 3 4 5 6 7

physical:

logical:

write 1101

Flash Translation Layer

0001

block 0

0010

00 11

0000

10 01

block 1

1101

11 11

11 11

0 1 2 3 4 5 6 7

physical:

logical:

write 1101

Flash Translation Layer

0001

block 0

0010

00 11

0000

10 01

block 1

1101

11 11

11 11

0 1 2 3 4 5 6 7

physical:

logical:

Flash Translation Layer

0001

block 0

0010

00 11

0000

10 01

block 1

1101

11 11

11 11

0 1 2 3 4 5 6 7

physical:

logical:must eventually

be garbage collected

FTLCould be implemented as device driver or in firmware (usually the latter).

Where to store mapping? SRAM.

Physical pages can be in three states: - valid, invalid, free

States

free valid

invalid

States

free valid

invalid

program

erase

States

free valid

invalid

program

erase relocate

States

free valid

invalid

program

erase relocate

why is file delete problematic?

States

free valid

invalid

program

erase relocate or TRIM

SSD-aware file systems can tell device that page is garbage

SSD Architecture

FTL SRAM: mapping tbl

SSD: looks like disk

Problem: Big Mapping TableAssume 200GB device, 2KB pages, 4-byte entries.

SRAM needed: (200GB / 2KB) * 4 bytes = 400 MB.

Too big, SRAM is expensive!

Page Translations

0001

block 0

0010

00 11

0000

10 01

block 1

0101

11 11

11 11

0 1 2 3 4 5 6 7

physical:

logical:

2-Page Translations

0001

block 0

0010

00 11

0000

10 01

block 1

0101

11 11

11 11

0 1 2 3 4 5 6 7

physical:

logical:

Larger MappingsAdvantage: larger mappings decrease table size.

Disadvantage?

2-Page Translations

0001

block 0

0010

00 11

0000

10 01

block 1

0101

11 11

11 11

0 1 2 3 4 5 6 7

physical:

logical:

2-Page Translations

0001

block 0

0010

00 11

0000

10 01

block 1

0101

11 11

11 11

0 1 2 3 4 5 6 7

physical:

logical:

write 1011

2-Page Translations

0001

block 0

0010

00 11

0000

10 01

block 1

0101

10 11

0101

0 1 2 3 4 5 6 7

physical:

logical:

write 1011

copy

2-Page Translations

0001

block 0

0010

00 11

0000

10 01

block 1

0101

10 11

0101

0 1 2 3 4 5 6 7

physical:

logical:

write 1011

2-Page Translations

0001

block 0

0010

00 11

0000

10 01

block 1

0101

10 11

0101

0 1 2 3 4 5 6 7

physical:

logical:

write 1011

Larger MappingsAdvantage: larger mappings decrease table size.

Disadvantages? - more read-modify-write updates - more garbage - less flexibility for placement

Hybrid FTLUse course-grained mapping for most (e.g., 95%) of data. Map at block level.

Use fine-grained mapping for recent data. Map at page level.

Hybrid FTL

physical:

logical: A B C D W X Y Z

A B C Z W X Y D

block maplogical

0 1

physical 1 0

page maplogical

3 7

physical 7 3

Strategy: Log BlocksWrite changed pages to designated log blocks.

After blocks become full, merge changes with old data.

Eventually garbage collect old pages.

Merging

MergingMerging technique depends on I/O pattern.

Three merge types: - full merge - partial merge - switch merge

MergingMerging technique depends on I/O pattern.

Three merge types: - full merge - partial merge - switch merge

A

block 0

B C D 1111block 1 (log)

1111

11 11

11 11

0 1 2 3

physical:

logical: …

1111

block 2

1111

11 11

11 11

A

block 0

B C D 1111block 1 (log)

1111

11 11

11 11

0 1 2 3

physical:

logical: …

1111

block 2

1111

11 11

11 11

write D2

A

block 0

B C D D2

block 1 (log)

1111

11 11

11 11

0 1 2 3

physical:

logical: …

1111

block 2

1111

11 11

11 11

write D2

A

block 0

B C D D2

block 1 (log)

1111

11 11

11 11

0 1 2 3

physical:

logical: …

1111

block 2

1111

11 11

11 11

A

block 0

B C D D2

block 1 (log)

1111

11 11

11 11

0 1 2 3

physical:

logical: …

1111

block 2

1111

11 11

11 11

eventually, we need to get rid of red arrows, as these represent expensive mappings

A

block 0

B C D D2

block 1 (log)

1111

11 11

11 11

0 1 2 3

physical:

logical: …

1111

block 2

1111

11 11

11 11

A

block 0

B C D D2

block 1 (log)

1111

11 11

11 11

0 1 2 3

physical:

logical: …

A

block 2

B C D2

COPY

A

block 0

B C D D2

block 1 (log)

1111

11 11

11 11

0 1 2 3

physical:

logical: …

A

block 2

B C D2

A

block 0

B C D D2

block 1 (log)

1111

11 11

11 11

0 1 2 3

physical:

logical: …

A

block 2

B C D2

A

block 0

B C D D2

block 1 (log)

1111

11 11

11 11

0 1 2 3

physical:

logical: …

A

block 2

B C D2

garbage

MergingMerging technique depends on I/O pattern.

Three merge types: - full merge - partial merge - switch merge

A

block 0

B C D 1111block 1 (log)

1111

11 11

11 11

0 1 2 3

physical:

logical: …

1111

block 2

1111

11 11

11 11

A

block 0

B C D 1111block 1 (log)

1111

11 11

11 11

0 1 2 3

physical:

logical: …

1111

block 2

1111

11 11

11 11

write D2

A

block 0

B C D 1111block 1 (log)

1111

11 11 D2

0 1 2 3

physical:

logical: …

1111

block 2

1111

11 11

11 11

write D2

A

block 0

B C D 1111block 1 (log)

1111

11 11 D2

0 1 2 3

physical:

logical: …

1111

block 2

1111

11 11

11 11

A

block 0

B C D 1111block 1 (log)

1111

11 11 D2

0 1 2 3

physical:

logical: …

1111

block 2

1111

11 11

11 11

why was this a better pick?

A

block 0

B C D A

block 1 (log)

B C D2

0 1 2 3

physical:

logical: …

1111

block 2

1111

11 11

11 11

A

block 0

B C D A

block 1 (log)

B C D2

0 1 2 3

physical:

logical: …

1111

block 2

1111

11 11

11 11

A

block 0

B C D A

block 1

B C D2

0 1 2 3

physical:

logical: …

1111

block 2

1111

11 11

11 11

A

block 0

B C D A

block 1

B C D2

0 1 2 3

physical:

logical: …

1111

block 2

1111

11 11

11 11

garbage

MergingMerging technique depends on I/O pattern.

Three merge types: - full merge - partial merge - switch merge

A

block 0

B C D 1111block 1 (log)

1111

11 11

11 11

0 1 2 3

physical:

logical: …

1111

block 2

1111

11 11

11 11

A

block 0

B C D 1111block 1 (log)

1111

11 11

11 11

0 1 2 3

physical:

logical: …

1111

block 2

1111

11 11

11 11

write A2

A

block 0

B C D A2

block 1 (log)

1111

11 11

1111

0 1 2 3

physical:

logical: …

1111

block 2

1111

11 11

11 11

write A2

A

block 0

B C D A2

block 1 (log)

B2 11 11

1111

0 1 2 3

physical:

logical: …

1111

block 2

1111

11 11

11 11

write B2

A

block 0

B C D A2

block 1 (log)

B2 C2 1111

0 1 2 3

physical:

logical: …

1111

block 2

1111

11 11

11 11

write C2

A

block 0

B C D A2

block 1 (log)

B2 C2 D2

0 1 2 3

physical:

logical: …

1111

block 2

1111

11 11

11 11

write D2

A

block 0

B C D A2

block 1 (log)

B2 C2 D2

0 1 2 3

physical:

logical: …

1111

block 2

1111

11 11

11 11

garbage

MergingMerging technique depends on I/O pattern.

Three merge types: - full merge [copy unmodified and modified] - partial merge [copy unmodified] - switch merge [no copy]

MergingMerging technique depends on I/O pattern.

Three merge types: - full merge [copy unmodified and modified] - partial merge [copy unmodified] - switch merge [no copy]

why not always do partial or switch merge?

SummaryFlash is much faster than disk, but…

It is more expensive.

It’s not a drop-in replacement beneath an FS without a complex layer for emulating hard disk API.

Recommended