File Systems of Solid-State Drives - University of Rochestercs.rochester.edu/~sandhya/csc256/seminars/veronda_ssd_logfs.pdf · File Systems of Solid-State Drives A Brief Teaching

File Systems of Solid-State Drives

A Brief Teaching by Mark VerondaTuesday, November 4, 2008

1Tuesday, November 4, 2008

Outline

• Solid-State Drives (SSD)• History and Motivations

• Properties

• Implications

• Log-Structured (LS) File Systems• What they are

• Why they are good for SSDs

• Implementations in Current Use

• Background


File Systems

• Storing

• Organizing

• Easy to find/access/allocate files


File Systems

• What have we been shown?

• Hierarchical Directory

• Unix and inodes

• Mainly stuff that is good for old-fashioned, slow, slow, slow disks


Block Devices

• Two (rough) Categories of I/O-devices

• Character Devices

• e.g., Keyboard

• Block Devices

• e.g., A hard-drive


Solid-State Drive

• Wikipedia Definition: “A data storage device that uses solid-state memory to store persistent data.”

• Aren’t magnets “solid-state?”

• No moving parts!


Outline

• Solid-State Drives (SSD)• History and Motivations

• Properties

• Implications

✓ Background





History

• First Solid-State Drive?

• 1976!

• Magnetic Core Memory• Yes, that’s why we call it “Core Dump” today

• An “SSD” from the Dataram Corporation would cost $95,000 (in 2008) for the amazing amount of 1 MB


History

• Bubble Memory

• Hyped by Intel at the beginning of the 80’s.

• Used in the Atari system

• Too expensive


History

• 1991

• DEC builds servers with SSDs• 1995

• Sun begins to offer them as well• 1999

• 11 manufacturers of Solid-State Drives• 2001

• 22 manufacturers10Tuesday, November 4, 2008

Why Use SSD?

• No moving parts

• Reduced energy usage

• Better reliability

• Faster


Properties of a SSD

• Two types of SSDs in use today:

• Non-Volatile RAM

• NAND-Gate Flash memory


Non-Volatile RAM

• Treat RAM memory as a block device.

• Usually, a battery pack provides the non-volatility.


Non-Volatile RAM

• Pros :

• Speed

• Purely Random I/O

• Cons :

• Expen$$$ive!!!

• Not the best way to ensure non-volatility


NAND-Gate

• NAND - Not AND


The Problem with NAND

• Floating-Gate Transistors

• Wear leveling...


(with NAND-gates)

• A flash memory package consists of:

• One or more die (AKA: Chips)

• Shares an 8-bit serial I/O bus

• Contains 8,192 blocks

• Organized by 4 planes of 2,048 blocks

• Each block has 64 4KB pages

• A page has an additional 128 bytes for metadata (ID and ECC)

Flash Memory


I/O serial connection

Die0 (Chip0)

Plane0 Plane1 Plane2 Plane3

44K Reg

a block


Flash Memory• Reading

• Writing :

• Slow

• Change bit from 1 to 0

• Erasing :

• Much Slower

• Change bit from 0 to 1

• Must be done on the page-level


Another Problem with Flash

• Some quick statistics :

Page read to register 25 µsWrite page from register 200 µsBlock erase 1.5 msSerial access to register (data bus) 100 µs


!"#$

%&$'()*+'

,"-.+

/0))'(

1*&*-'(

2("+'##"(

34*#5

6'708

9108

34*#5

2:-

!"#$

%&$'(+"&&'+$

;<1

!!" #$%&'$(()'

34*#5

2:-

34*#5

2:-

34*#5

2:-

Figure 3: SSD Logic Components

address to physical flash location. The processor, buffer-

manager, and multiplexer are typically implemented in a

discrete component such as an ASIC or FPGA, and data

flow between these logic elements is very fast. The pro-

cessor, and its associated RAM, may be integrated, as

is the case for simple USB flash-stick devices, or stan-

dalone as for designs with more substantial processing

and memory requirements.

As described in Section 2, flash packages export an

8-bit wide serial data interface with a similar number of

control pins. A 32GB SSD with 8 of the Samsung parts

would require 136 pins at the flash controller(s) just for

the flash components. With such a device, it might be

possible to achieve full interconnection between the flash

controller(s) and flash packages, but for larger configura-

tions this is not likely to remain feasible. For the mo-

ment, we assume full interconnection between data path,

control logic, and flash. We return to the issue of inter-

connect density in Section 3.3.

This paper is primarily concerned with the organiza-

tion of the flash array and the algorithms needed to man-

age mappings between logical disk and physical flash ad-

dresses. It is beyond the scope of this paper to tackle the

many important issues surrounding the design and layout

of SSD logic components.

3.1 Logical Block Map

As pointed out by Birrell et al. [2], the nature of NAND

flash dictates that writes cannot be performed in place as

on a rotating disk. Moreover, to achieve acceptable per-

formance, writes must be performed sequentially when-

ever possible, as in a log. Since each write of a single

logical-disk block address (LBA) corresponds to a write

of a different flash page, even the simplest SSD must

maintain some form of mapping between logical block

address and physical flash location. We assume that the

logical block map is held in volatile memory and recon-

structed from stable storage at startup time.

We frame the discussion of logical block maps us-

ing the abstraction of an allocation pool to think about

how an SSD allocates flash blocks to service write re-

quests. When handling a write request, each target log-

ical page (4KB) is allocated from a pre-determined pool

of flash memory. The scope of an allocation pool might

be as small as a flash plane or as large as multiple flash

packages. When considering the properties of allocation

pools, the following variables come to mind.

• Static map. A portion of each LBA constitutes a

fixed mapping to a specific allocation pool.

• Dynamic map. The non-static portion of a LBA is

the lookup key for a mapping within a pool.

• Logical page size. The size for the referent of a

mapping entry might be as large as a flash block

(256KB), or as small as a quarter-page (1KB) .

• Page span. A logical page might span related pages

on different flash packages thus creating the poten-

tial for accessing sections of the page in parallel.

These variables are then bound by three constraints:

• Load balancing. Optimally, I/O operations should

be evenly balanced between allocation pools.

• Parallel access. The assignment of LBAs to phys-

ical addresses should interfere as little as possible

with the ability to access those LBAs in parallel. So,

for example if LBA0..LBAn are always accessed at

the same time, they should not be stored on a com-

ponent that requires each to be accessed in series.

• Block erasure. Flash pages cannot be re-written

without first being erased. Only fixed-size blocks of

contiguous pages can be erased.

SSD Internals


Current State of SSDs

• Two Manufacturers of note

• Samsung

• Texas Memory Instruments

• Both offer proprietary “optimizations” to ensure wear-leveling.


Implications of SSD

• Ramón Cáceres, Fred Douglis, Kai Li, Brian Marsh (1993)

• Predicts future of (mobile) computers to consist of NV-RAM used for running applications and a backing store of flash memory for persistent data.


Effects on File Systems

• No need to “cluster” data for locality

• Need to avoid writes to the same block

• File System cache unnecessary because data and metadata can be quickly accessed.


Outline

✓ Solid-State Drives (SSD)✓ History and Motivations

✓ Properties

✓ Implications

✓ Background





Log-Structured File System

• Concept introduced (and implemented) by Rosenblum and Ousterhout in 1991

• Main premises :• CPU speeds are increasing exponentially.

• RAM sizes are increasing exponentially.

• Hard-drive speed has remained constant.

• Conclusion?

• I/O will primarily consists of writes.


Motivating Example

• BSD FFS (Fast File System) of 1992 is terribly inefficient for writes.

• e.g., it takes 5 seeks to add one small file to a folder.

• What about for SSD with no seek-time?

• 5 locations will be rewritten


Log-Structured Overview

• Contains concepts similar to the file systems that we have seen.

• e.g., inodes and directories.

• The log is the only data structure.

• Perform write operations in a batch

• Pro: reduces number of seeks.

• Con: increases potential for data-loss.


Two Issues

• 1) How to retrieve information from the log?

• 2) How to manage free space so that there will always be enough of it to be able to perform a write as a single operation?


(1) Information Retrieval

• With FFS :

• inode number --> disk address of inode.

• With a log-structured file system :

• Introduce new data structure: inode map.


Inode Map

• The inode map is divided into blocks that are written to the log.

• Fix (to a location on the disk) a checkpoint region of all the inode map blocks.


Information Retrieval

32 . M, Rosenblum and J. K. Ousterhout

dirl dir2 filel file2

‘1 &.* ~=

Lag +

~ ........

Disk &w~: ;::;I ,, !!J

~: ,,,,Disk

,::: ~

Sprite LFS+ ‘?

filel file2dirl dir2

Unix FFS

Fig. 1. A comparison between Sprite LFS and Unix FFS. This example shows the modified disk

blocks written by Sprite LFS and Unix FFS when creating two single-block files named dirl /filel

and dlr2 /flle2. Each system must write new data blocks and inodes for file 1 and flle2, plus new

data blocks and inodes for the containing directories. Unix FFS requires ten nonsequential

writes for the new information (the inodes for the new files are each written twice to ease

recovery from crashes), while Sprite LFS performs the operations in a single large write. The

same number of disk accesses will be required to read the files in the two systems. Sprite LFS

also writes out new inode map blocks to record the new inode locations

Block Key:Threaded log

laOld log end New log end

Old datablock

New datablock II

Prcwously deletedII

Copy and Compact

Old log end New log end

Fig, 2. Possible free space management solutions for log-structured file systems, In a log-struc-

tured file system, free space for the log can be generated either by copying the old blocks or by

threading the log around the old blocks. The left side of the figure shows the threaded log

approach where the log skips over the active blocks and overwrites blocks of files that have been

deleted or overwritten. Pointers between the blocks of the log are mamtained so that the log can

be followed during crash recovery The right side of the figure shows the copying scheme where

log space is generated by reading the section of disk after the end of the log and rewriting the

active blocks of that section along with the new data into the newly generated space.

be no faster than traditional file systems. The second alternative is to copy

live data out of the log in order to leave large free extents for writing. For

this paper we will assume that the live data is written back in a compacted

form at the head of the log; it could also be moved to another log-structured

file system to form a hierarchy of logs, or it could be moved to some totally

different file system or archive. The disadvantage of copying is its cost,

particularly for long-lived files; in the simplest case where the log works

circularly across the disk and live data is copied back into the log, all of the

long-lived files will have to be copied in every pass of the log across the disk.

Sprite LFS uses a combination of threading and copying. The disk is

divided into large fixed-size extents called segments. Any given segment is

always written sequentially from its beginning to its end, and all live data

must be copied out of a segment before the segment can be rewritten.

However, the log is threaded on a segment-by-segment basis; if the system

ACM Transactions on Computer Systems, Vol. 10, No. 1, February 1992,


(2) Free-Space

• Need to maintain large enough extents to write new data.

• Two approaches:

• Threading

• Copy and Compact


Threading

• Leave live data in place

• “Thread” the log through free spaces in between the free extents.

• Drawback?

• Severe fragmentation


Copy and Compact

• Copy live data out of log

• Write the new data


Free Space management

operation is complete, the segments that were read aremarked as clean, and they can be used for new data or foradditional cleaning.

As part of segment cleaning it must be possible toidentify which blocks of each segment are live, so that theycan be written out again. It must also be possible to iden-tify the file to which each block belongs and the position ofthe block within the file; this information is needed in orderto update the file’s inode to point to the new location of theblock. Sprite LFS solves both of these problems by writinga segment summary block as part of each segment. Thesummary block identifies each piece of information that iswritten in the segment; for example, for each file data blockthe summary block contains the file number and blocknumber for the block. Segments can contain multiple seg-ment summary blocks when more than one log write isneeded to fill the segment. (Partial-segment writes occurwhen the number of dirty blocks buffered in the file cacheis insufficient to fill a segment.) Segment summary blocksimpose little overhead during writing, and they are usefulduring crash recovery (see Section 4) as well as duringcleaning.

Sprite LFS also uses the segment summary informa-tion to distinguish live blocks from those that have beenoverwritten or deleted. Once a block’s identity is known,its liveness can be determined by checking the file’s inodeor indirect block to see if the appropriate block pointer stillrefers to this block. If it does, then the block is live; if itdoesn’t, then the block is dead. Sprite LFS optimizes thischeck slightly by keeping a version number in the inodemap entry for each file; the version number is incrementedwhenever the file is deleted or truncated to length zero.The version number combined with the inode number forman unique identifier (uid) for the contents of the file. Thesegment summary block records this uid for each block in

Old log end New log end

Copy and CompactBlock Key:

Previously deleted

New data block

Old data block

Threaded log

New log endOld log end

Figure 2 — Possible free space management solutions for log-structured file systems.In a log-structured file system, free space for the log can be generated either by copying the old blocks or by threading the log around theold blocks. The left side of the figure shows the threaded log approach where the log skips over the active blocks and overwrites blocks offiles that have been deleted or overwritten. Pointers between the blocks of the log are maintained so that the log can be followed duringcrash recovery. The right side of the figure shows the copying scheme where log space is generated by reading the section of disk after theend of the log and rewriting the active blocks of that section along with the new data into the newly generated space.

the segment; if the uid of a block does not match the uidcurrently stored in the inode map when the segment iscleaned, the block can be discarded immediately withoutexamining the file’s inode.

This approach to cleaning means that there is nofree-block list or bitmap in Sprite. In addition to savingmemory and disk space, the elimination of these data struc-tures also simplifies crash recovery. If these data structuresexisted, additional code would be needed to log changes tothe structures and restore consistency after crashes.

3.4. Segment cleaning policies

Given the basic mechanism described above, fourpolicy issues must be addressed:

(1) When should the segment cleaner execute? Somepossible choices are for it to run continuously inbackground at low priority, or only at night, or onlywhen disk space is nearly exhausted.

(2) How many segments should it clean at a time? Seg-ment cleaning offers an opportunity to reorganizedata on disk; the more segments cleaned at once, themore opportunities to rearrange.

(3) Which segments should be cleaned? An obviouschoice is the ones that are most fragmented, but thisturns out not to be the best choice.

(4) How should the live blocks be grouped when theyare written out? One possibility is to try to enhancethe locality of future reads, for example by groupingfiles in the same directory together into a single out-put segment. Another possibility is to sort the blocksby the time they were last modified and group blocksof similar age together into new segments; we callthis approach age sort.

July 24, 1991 - 5 -


Pros (for SSD)

• Better write efficiency

• Provides wear-leveling


JFFS 1/2

• JFFS - Journalling Flash File System

• Perfect!

• Optimized for NOR-gate flash chips.

• Implements some functionality that is not necessary for NAND-gate chips.


NOR-Gate

• NOR - Not OR

• Low density

• Slower read/write speeds

• Expensive (compared to NAND)


YAFFS

• YAFFS - Yet Another Flash File System

• Specifically Designed for NAND

• Must be perfect!

• Right?

• Maximum file system size: 8 GB


✓Vote

✓ Solid-State Drives (SSD)✓ History and Motivations

✓ Properties

✓ Implications

✓ Background

✓ Log-Structured (LS) File Systems✓ What they are

✓ Why they are good for SSDs

✓ Implementations in Current Use

(May the best File-System for this nation win!)


Thank you for Learning


Documents

File Systems of Solid-State Drives - University of Rochestercs.rochester.edu/~sandhya/csc256/seminars/veronda_ssd_logfs.pdf · File Systems of Solid-State Drives A Brief Teaching