Introduction to File Systems. File System Issues What is the role of files? What is the file abstraction? File naming. How to find the file we want? Sharing

Introduction to File Systems

File System Issues

• What is the role of files? What is the file abstraction?

• File naming. How to find the file we want?Sharing files. Controlling access to files.

• Performance issues - how to deal with the bottleneck of disks? What is the “right” way to optimize file access?

Role of Files

• Persistence long-lived data for posterity non-volatile storage media semantically meaningful (memorable)

names

AbstractionsAddressbook, record for Duke CPS

Userview

Application

File System

addrfile fid, byte range*

Disk Subsystem

device, block #

surface, cylinder, sector

bytes

fid

block#

*File Abstractions• UNIX-like files

– Sequence of bytes– Operations: open (create), close, read, write,

seek

• Memory mapped files– Sequence of bytes – Mapped into address space– Page fault mechanism does data transfer

• Named, Possibly typed

Unix File Syscalls

int fd, num, success, bufsize; char data[bufsize]; long offset, pos;

fd = open (filename, mode [,permissions]);

success = close (fd);

pos = lseek (fd, offset, mode);

num = read (fd, data, bufsize);

num = write (fd, data, bufsize);

O_RDONLYO_WRONLYO_RDWRO_CREATO_APPEND...

User grp othersrwx rwx rwx111 100 000

Relative tobeginning, currentposition, end of file

Memory Mapped Filesfd = open (somefile, consistent_mode);pa = mmap(addr, len, prot, flags, fd,

offset);

VAS

len

len

pa

fd + offset

R, W, X,none

Shared,Private,Fixed,Noreserve

Reading performed by Load instr.

Nachos File Syscalls/Operations

Create(“zot”);

OpenFileId fd;fd = Open(“zot”);Close(fd);

char data[bufsize];Write(data, count, fd);Read(data, count, fd);

Limitations:1. small, fixed-size files and directories2. single disk with a single directory3. stream files only: no seek syscall4. file size is specified at creation time5. no access control, etc.

Functions of File System

• Determine layout of files and metadata on disk in terms of blocks. Disk block allocation. Bad blocks.

• Handle read and write system calls

• Initiate I/O operations for movement of blocks to/from disk.

• Maintain buffer cache

File System Data Structures

stdinstdoutstderr

Process descriptor

per-

proc

ess

file

ptr

arr

ay

System-wideOpen file table

r-w pos, mode

System-wideFile descriptor table

in-memorycopy of inodeptr to on-diskinode

r-w pos, mode

File data

pos

pos

UNIX Inodes

FileAttributes

Blo

ck A

ddr

...

......

...

...

... ...

Data Block Addr

1

1

1

2

2

2

2

3 3 3 3

Data blocks

Decoupling meta-datafrom directory entries

File Sharing Between Parent/Child

main(int argc, char *argv[]) {char c;int fdrd, fdwt, fdpriv;

if ((fdrd = open(argv[1], O_RDONLY)) == -1)exit(1);

if ((fdwt = creat([argv[2], 0666)) == -1)exit(1);

fork();

if ((fdpriv = open([argv[3], O_RDONLY)) == -1)exit(1);

while (TRUE) {if (read(fdrd, &c, 1) != 1)

exit(0);write(fdwt, &c, 1);

}}

File System Data Structures

stdinstdoutstderr

Process descriptor

per-

proc

ess

file

ptr

arr

ay


r-w pos, mode



r-w pos, mode

forked process’sProcess descriptor

openafterfork

Sharing Open File Instances

shared seek offset in shared file table entry

system open file table

user IDprocess ID

process group IDparent PIDsignal state

siblingschildren

user IDprocess ID

process group IDparent PIDsignal state

siblingschildren

process file descriptorsprocess

objects

shared file(inode or vnode)

child

parent

Directory Subsystem

• Map filenames to fileids: open (create) syscall. Create kernel data structures.

• Maintain naming structure (unlink, mkdir, rmdir)

Pathname Resolution

inode#current

Directory node FileAttributes

inode#Proj

Directory node FileAttributes

cps110

current

inode#proj3

Directory nodeProj

FileAttributes

proj3proj3data filedata file

“cps110/current/Proj/proj3”

FileAttributes

index node of wd

Access Patterns Along the Way

Proc DirSubsys

open(/foo/bar/file);read(fd,buf,sizeof(buf));read(fd,buf,sizeof(buf));read(fd,buf,sizeof(buf));close(fd);

read(rootdir);read(inode);read(foo);read(inode);read(bar);read(inode);read(fd,buf,sizeof(buf));read(fd,buf,sizeof(buf));read(fd,buf,sizeof(buf));

read(rootdir);read(inode);read(foo);read(inode);read(bar);read(inode);read(filedatablock);

FileSys

DeviceSubsys

Functions of Device Subsystem

In general, deal with device characteristics• Translate block numbers (the abstraction of

device shown to file system) to physical disk addresses. Device specific intelligent placement of blocks. (subject to change with upgrades in technology)

• Schedule (reorder?) disk operations

What to do about Disks?

Avoid them altogether! CachingDisk scheduling

• Idea is to reorder outstanding requests to minimize seeks.

Layout on disk– Placement to minimize disk overhead

Build a better disk (or substitute)– Example: RAID

Disk Scheduling for Requests

• Minimize seek AND rotational delay• Maintain a queue of requests on EACH

cylinder– Sort each queue by sector (rotational position)– At one cylinder process request based on

rotational position to minimize rotational delay

• Move from cylinder when all requests on old cylinder are processed, in old direction, until no request remain in that direction.– “Elevator” algorithm

Redundant Array of Inexpensive Disks

• Parallel seeks and data transmission• Each record is “smeared” across disks so that

some bytes come from each disk– All disks positioned at same cylinder, track– Add extra redundant disk, for error check

• Reading one file is much faster, if no seek needed

• Error correction possible, so re-reading on error not needed

File Naming

Goals of File Naming• Foremost function - to find files

(e.g., in open() ), Map file name to file object.

• To store meta-data about files.• To allow users to choose their own file names

without undue name conflict problems.• To allow sharing.• Convenience: short names, groupings.• To avoid implementation complications

Possible Naming Structures• Flat name space - 1 system-wide table,

– Unique naming with multiple users is hard.Name conflicts.

– Easy sharing, need for protection

• Per-user name space– Protection by isolation, no sharing– Easy to avoid name conflicts– Associate process with directory to use to

resolve names, allow user to change this “current working directory” (cd)

Naming Structures

Naming network• Component names - pathnames

– Absolute pathnames - from a designated root– Relative pathnames - from a working directory– Each name carries how to resolve it.

• Could allow defining short names for files anywhere in the network. This might produce cycles, but it makes naming things more convenient.

Full Naming Network*

• /Jamie/lynn/project/D• /Jamie/d• /Jamie/lynn/jam/proj1/C• (relative from Terry)

A• (relative from Jamie)

d

root

Lynn

Jamie

Terry

project

A

B

CD E

proj1

project

D

d

lynn

jam

TA

grp1

* not Unix

Full Naming Network*

• /Jamie/lynn/project/D• /Jamie/d• /Jamie/lynn/jam/proj1/C• (relative from Terry)

A• (relative from Jamie)

d

root

Lynn

Jamie

Terry

project

A

B

CD E

proj1

project

D

d

lynn

jam

TA

grp1

* UnixWhy?

Meta-Data

• File size• File type• Protection - access

control information• History:

creation time, last modification,last access.

• Location of file - which device

• Location of individual blocks of the file on disk.

• Owner of file• Group(s) of users

associated with file

Restricting to a Hierarchy

• Problems with full naming network– What does it mean to “delete” a file?– Meta-data interpretation

Operations on Directories (UNIX)

• link (oldpathname, newpathname) - make entry pointing to file

• unlink (filename) - remove entry pointing to file

• mknod (dirname, type, device) - used (e.g. by mkdir utility function) to create a directory (or named pipe, or special file)

• getdents(fd, buf, structsize) - reads dir entries

Reclaiming Storage

root

Jo

Jamie

Terry

project

A

B

CD E

proj1

projectD

d

joe

jamTA

grp1X

XX Series of

unlinks

What shouldbe dealloc?

root

Jo

Jamie

Terry

project

A

B

CD E

proj1

projectD

d

joe

jamTA

grp1X

XX Series of

unlinks

Reclaiming Storage

root

Jo

Jamie

Terry

project

A

B

CD E

proj1

projectD

d

joe

jamTA

grp1X

XX Series of

unlinks

2

31

21

2

Reference Counting?

root

Jo

Jamie

Terry

project

A

B

CD E

proj1

projectD

d

joe

jamTA

grp1X

XX Series of

unlinks

Garbage Collection

*

** Phase 1 marking

Phase 2 collect

Restricting to a Hierarchy

• Problems with full naming network– What does it mean to “delete” a file?– Meta-data interpretation

• Eliminating cycles– allows use of reference counts for

reclaiming file space– avoids garbage collection

Given: Naming Hierarchy (because of implementation

issues)/

tmp usretcbin vmunix

ls sh project users

packages

(volume root)

tex emacs

mount point

leaf

A Typical Unix File Tree

/

tmp usretc

File trees are built by graftingvolumes from different devicesor from network servers.

Each volume is a set of directories and files; a host’s file tree is the set of directories and files visible to processes on a given host.

bin vmunix

ls sh project users

packages

coveredDir

In Unix, the graft operation isthe privileged mount system call,and each volume is a filesystem.

mount point

mount (coveredDir, volume)coveredDir: directory pathnamevolume: device

volume root contents become visible at pathname coveredDir


/

tmp usretc


Each volume is a set of directories and files; a host’s file tree is the set of directories and files visible to processes on a given host.

bin vmunix

ls sh project users

packages

coveredDir


mount point

mount (coveredDir, volume)

/usr/project/packages/coveredDir/emacs

(volume root)

tex emacs

/

tmp usretc

Each volume is a set of directories and files; a host’s file tree is the set ofdirectories and files visible to processes on a given host.

bin vmunix

ls sh project users

packages

(volume root)

tex emacs

mount pointmount (coveredDir, volume)

coveredDir: directory pathnamevolume: device specifier or network volume

volume root contents become visible at pathname coveredDir

/usr/project/packages/coveredDir/emacs




Reclaiming Convenience

• Symbolic links - indirect filesfilename maps, not to file object, but to another pathname– allows short aliases– slightly different semantics

• Search path rules

Unix File Naming (Hard Links)0

rain: 32

hail: 48

0

wind: 18

sleet: 48

inode 48

inode link count = 2

directory A directory B

A Unix file may have multiple names.

link system calllink (existing name, new name)create a new name for an existing fileincrement inode link count

unlink system call (“remove”)unlink(name)destroy directory entrydecrement inode link countif count = 0 and file is not in active usefree blocks (recursively) and on-disk inode

Each directory entry naming thefile is called a hard link.

Each inode contains a reference countshowing how many hard links name it.

Unix Symbolic (Soft) Links• Unix files may also be named by symbolic (soft) links.

– A soft link is a file containing a pathname of some other file.

0

rain: 32

hail: 48

inode 48

inode link count = 1

directory A

0

wind: 18

sleet: 67

directory B

../A/hail/0

inode 67

symlink system callsymlink (existing name, new name)allocate a new file (inode) with type symlinkinitialize file contents with existing namecreate directory entry for new file with new name

The target of the link may beremoved at any time, leavinga dangling reference.

How should the kernel handle recursive soft links?

Convenience, but not performance!

Soft vs. Hard Links

What’s the difference in behavior?

Soft vs. Hard Links


Terry Lynn

Jamie

/

Soft vs. Hard Links


Terry Lynn

Jamie

/

X

Soft vs. Hard Links


Terry Lynn

Jamie

/

X

File Structure

After Resolving Long PathnamesOPEN(“/usr/faculty/carla/classes/cps110/spring02/lectures/lecture13.ppt”,…)

Finally Arrive at File• What do users seem to want from the file

abstraction?• What do these usage patterns mean for file

structure and implementation decisions?– What operations should be optimized 1st?– How should files be structured?– Is there temporal locality in file usage?– How long do files really live?

Know your Workload!

• File usage patterns should influence design decisions. Do things differently depending:– How large are most files? How long-lived?

Read vs. write activity. Shared often?– Different levels “see” a different workload.

• Feedback loop

Usage patterns observed today

File Systemdesign and impl

Generalizations from UNIX Workloads

• Standard Disclaimers that you can’t generalize…but anyway…

• Most files are small (fit into one disk block) although most bytes are transferred from longer files.

• Most opens are for read mode, most bytes transferred are by read operations

• Accesses tend to be sequential and 100%

More on Access Patterns

• There is significant reuse (re-opens) most opens go to files repeatedly opened & quickly. Directory nodes and executables also exhibit good temporal locality.– Looks good for caching!

• Use of temp files is significant part of file system activity in UNIX very limited reuse, short lifetimes (less than a minute).


Proc Dirsubsys




F.S

File Structure Implementation:

Mapping File Block• Contiguous

– 1 block pointer, causes fragmentation, growth is a problem.

• Linked– each block points to next block, directory points to

first, OK for sequential access

• Indexed– index structure required, better for random access

into file.

UNIX Inodes

FileAttributes

Blo

ck A

ddr

...

......

...

...

... ...

Data Block Addr

1

1

1

2

2

2

2

3 3 3 3

Data blocks

Decoupling meta-datafrom directory entries

File Allocation Table (FAT)

Lecture.ppt

Pic.jpg

Notes.txt

eof

eof

eof

Meta-Data

• File size• File type• Protection - access

control information• History:

creation time, last modification,last access.

• Location of file - which device

• Location of individual blocks of the file on disk.

• Owner of file• Group(s) of users

associated with file

Access Control for Files

• Access control lists - detailed list attached to file of users allowed (denied) access, including kind of access allowed/denied.

• UNIX RWX - owner, group, everyone

UNIX access control

• Each file carries its access control with it.

rwx rwx rwx setuid

OwnerUID

GroupGID

Everybody else When bit set, itallows processexecuting objectto assume UID ofowner temporarily -enter owner domain(rights amplification)

• Owner has chmod, chgrp rights (granting, revoking)

More on Access Control Later

Files to Blocks: Allocation

• Clustering

• Log Structure

What to do about Disks?

Avoid them altogether! CachingDisk scheduling

– Idea is to reorder outstanding requests to minimize seeks.

Layout on disk– Placement to minimize disk overhead

Build a better disk (or substitute)– Example: RAID

Goal: Good Layout on Disk

• Placement to minimize disk overhead: can address both seek and rotational latency

• Cluster related things together (e.g. an inode and its data, inodes in same directory (ls command), data blocks of multi-block file, files in same directory)

• Sub-block allocation to reduce fragmentation for small files

• Log-Structured File Systems

Effect of ClusteringAccess time = seek time + rotational delay + transfer time

average seek time = 2 ms for an intra-cylinder group seek, let’s sayrotational delay = 8 milliseconds for full rotation at 7200 RPM: average delay = 4 mstransfer time = 1 millisecond for an 8KB block at 8 MB/s

8 KB blocks deliver about 15% of disk bandwidth. 64KB blocks/clusters deliver about 50% of disk bandwidth.128KB blocks/clusters deliver about 70% of disk bandwidth.

Actual performance will likely be better with good disk layout, since most seek/rotate delays to read the next block/cluster will be “better than average”.

Block Allocation to Disk Layout

• The level of indirection in the file block maps allows flexibility in file layout.

• “File system design is 99% block allocation.” [McVoy]

• Competing goals for block allocation:– allocation cost– bandwidth for high-volume transfers– efficient directory operations

• Goal: reduce disk arm movement and seek overhead.

• metric of merit: bandwidth utilization

FFS and LFSTwo different approaches to block allocation:

– Cylinder groups in the Fast File System (FFS) [McKusick81]

• clustering enhancements [McVoy91], and improved cluster allocation [McKusick: Smith/Seltzer96]

• FFS can also be extended with metadata logging [e.g., Episode]

– Log-Structured File System (LFS)• proposed in [Douglis/Ousterhout90]• implemented/studied in [Rosenblum91]• BSD port, sort of maybe: [Seltzer93]• extended with self-tuning methods [Neefe/Anderson97]

– Other approach: extent-based file systems

FFS Cylinder Groups• FFS defines cylinder groups as the unit of disk locality,

and it factors locality into allocation choices.– typical: thousands of cylinders, dozens of groups– Strategy: place “related” data blocks in the same cylinder group

whenever possible.• seek latency is proportional to seek distance

– Smear large files across groups:• Place a run of contiguous blocks in each group.

– Reserve inode blocks in each cylinder group.• This allows inodes to be allocated close to their directory entries

and close to their data blocks (for small files).

FFS Allocation Policies1. Allocate file inodes close to their containing

directories.For mkdir, select a cylinder group with a more-than-average

number of free inodes.For creat, place inode in the same group as the parent.

2. Concentrate related file data blocks in cylinder groups.

Most files are read and written sequentially.

Place initial blocks of a file in the same group as its inode.How should we handle directory blocks?

Place adjacent logical blocks in the same cylinder group.Logical block n+1 goes in the same group as block n.Switch to a different group for each indirect block.

Allocating a Block1. Try to allocate the rotationally optimal physical

block after the previous logical block in the file.Skip rotdelay physical blocks between each logical block.(rotdelay is 0 on track-caching disk controllers.)

2. If not available, find another block at a nearby rotational position in the same cylinder group

We’ll need a short seek, but we won’t wait for the rotation.If not available, pick any other block in the cylinder group.

3. If the cylinder group is full, or we’re crossing to a new indirect block, go find a new cylinder group.

Pick a block at the beginning of a run of free blocks.

Small Files with Bigger Blocks• Internal fragmentation in the file system blocks can

waste significant space for small files.• E.g., 1KB files waste 87% of disk space (and bandwidth) in a naive

file system with an 8KB block size.• Most files are small: one study [Irlam93] shows a median of 22KB.

• FFS solution: optimize small files for space efficiency.– Subdivide blocks into 2/4/8 fragments (or just frags).– Free block maps contain one bit for each fragment.

• To determine if a block is free, examine bits for all its fragments.

– The last block of a small file is stored on fragment(s).• If multiple fragments they must be contiguous.

Clustering in FFS• Clustering improves bandwidth utilization for large

files read and written sequentially.• Allocate clumps/clusters/runs of blocks contiguously; read/write

the entire clump in one operation with at most one seek.

– Typical cluster sizes: 32KB to 128KB.

• FFS can allocate contiguous runs of blocks “most of the time” on disks with sufficient free space.– This (usually) occurs as a side effect of setting rotdelay = 0.

• Newer versions may relocate to clusters of contiguous storage if the initial allocation did not succeed in placing them well.

– Must modify buffer cache to group buffers together and read/write in contiguous clusters.

Log-Structured File Systems• Assumption: Cache is effectively filtering out reads so we

should optimize for writes• Basic Idea: manage disk as an append-only log (subsequent

writes involve minimal head movement)• Data and meta-data (mixed) accumulated in large segments

and written contiguously• Reads work as in UNIX - once inode is found, data blocks

located via index.• Cleaning an issue - to produce contiguous free space,

correcting fragmentation developing over time.• Claim: LFS can use 70% of disk bandwidth for writing while

Unix FFS can use only 5-10% typically because of seeks.

LFS logs

In LFS, all block and metadata allocation is log-based.– LFS views the disk as “one big log” (logically).– All writes are clustered and sequential/contiguous.

• Intermingles metadata and blocks from different files.– Data is laid out on disk in the order it is written.– No-overwrite allocation policy: if an old block or inode is

modified, write it to a new location at the tail of the log.– LFS uses (mostly) the same metadata structures as FFS;

only the allocation scheme is different.• Cylinder group structures and free block maps are

eliminated.• Inodes are found by indirecting through a new map

LFS Data Structures on Disk

• Inode – in log, same as FFS• Inode map – in log, locates position of inode, version,

time of last access• Segment summary – in log, identifies contents of

segment (file#, offset for each block in segment)• Segment usage table – in log, counts live bytes in

segment and last write time• Checkpoint region – fixed location on disk, locates

blocks of inode map, identifies last checkpoint in log.• Directory change log – in log, records directory

operations to maintain consistency of ref counts in inodes

Structure of the Log

clean

Checkpoint region

= inode map block

= inode

D1

= directory node

File 2File 1

block 2

= data block

File 1block1

= segment summary, usage, dirlog

Writing the Log in LFS1. LFS “saves up” dirty blocks and dirty inodes until it

has a full segment (e.g., 1 MB).– Dirty inodes are grouped into block-sized clumps.– Dirty blocks are sorted by (file, logical block number).– Each log segment includes summary info and a

checksum.

2. LFS writes each log segment in a single burst, with at most one seek.– Find a free segment “slot” on the disk, and write it.– Store a back pointer to the previous segment.

• Logically the log is sequential, but physically it consists of a chain of segments, each large enough to amortize seek overhead.

Growth of the Log

write (file1, block1)creat (D1/file3)write (file3, block1)

clean

Checkpoint region

D1File 2File 1

block 2File 1block1

File 1block1

File 3 D1

Death in the Log

write (file1, block1)creat (D1/file3)write (file3, block1)

clean

Checkpoint region

D1File 2File 1

block 2File 1block1

File 1block1

File 3 D1

Writing the Log: the Rest of the Story

1. LFS cannot always delay writes long enough to accumulate a full segment; sometimes it must push a partial segment.– fsync, update daemon, NFS server, etc.– Directory operations are synchronous in FFS, and some must

be in LFS as well to preserve failure semantics and ordering.

2. LFS allocation and write policies affect the buffer cache, which is supposed to be filesystem-independent.– Pin (lock) dirty blocks until the segment is written; dirty blocks

cannot be recycled off the free chain as before.– Endow *indirect blocks with permanent logical block numbers

suitable for hashing in the buffer cache.

Cleaning in LFSWhat does LFS do when the disk fills up?1. As the log is written, blocks and inodes written earlier in

time are superseded (“killed”) by versions written later.– files are overwritten or modified; inodes are updated– when files are removed, blocks and inodes are deallocated

2. A cleaner daemon compacts remaining live data to free up large hunks of free space suitable for writing segments.– look for segments with little remaining live data

• benefit/cost analysis to choose segments

– write remaining live data to the log tail– can consume a significant share of bandwidth, and there are

lots of cost/benefit heuristics involved.

Cleaning the Log

Checkpoint region

D1File 2File 1

block 2File 1block1

File 1block1 File 3 D1

clean

Cleaning the Log

Checkpoint region

D1

File 2

File 1 block 2

File 1block1


File 1 block 2

File 2

Cleaning the Log

Checkpoint region

File 2


File 1 block 2

clean

Cleaning Issues

• Must be able to identify which blocks are live• Must be able to identify the file to which each

block belongs in order to update inode to new location

• Segment Summary block contains this info– File contents associated with uid (version # and

inode #)– Inode entries contain version # (incr. on truncate)– Compare to see if inode points to block under

consideration

Policies• When cleaner cleans – threshold based• How much – 10s at a time until threshold reached• Which segments

– Most fragmented segment is not best choice.– Value of free space in segment depends on stability of

live data (approx. age)– Cost / benefit analysis Benefit = free space available (1-u) * age of youngest block

Cost = cost to read segment + cost to move live data

– Segment usage table supports this

• How to group live blocks

Recovering Disk Contents

• Checkpoints – define consistent states– Position in log where all data structures are consistent– Checkpoint region (fixed location) – contains the addresses of all

blocks of inode map and segment usage table, ptr to last segment written

• Actually 2 that alternate in case a crash occurs while writing checkpoint region data

• Roll-forward – to recover beyond last checkpoint– Uses Segment summary blocks at end of log – if we find new

inodes, update inode map found from checkpoint– Adjust utilizations in segment usage table– Restore consistency in ref counts within inodes and directory

entries pointing to those inodes using Directory operation log (like an intentions list)

Recovery of the Log

Checkpoint region

D1File 2File 1

block 2File 1block1

File 1block1

File 3 D1

Written since checkpoint

Recovery in Unix

fsck

• Traverses the directory structure checking ref counts of inodes

• Traverses inodes and freelist to check block usage of all disk blocks

File Caching

• What to cache?

• How to manage the file buffer cache?

File Buffer Cache

Memory

Filecache

Proc

V.M.What do wewant to cache?

• inodes• dir nodes• whole files• disk blocks

The answer will determine• where in the request path the cache sits

Virtual memoryand File buffercache will competefor physical memory

• How?


cache

Proc F.S.




Why Are File Caches Effective?

1. Locality of reference: storage accesses come in clumps.spatial locality: If a process accesses data in block B, it is likely

to reference other nearby data soon.(e.g., the remainder of block B)

example: reading or writing a file one byte at a time

temporal locality: Recently accessed data is likely to be used again.

2. Read-ahead: if we can predict what blocks will be needed soon, we can prefetch them into the cache.most files are accessed sequentially

What to Cache? Locality in File Access Patterns

(UNIX Workloads)

• Most files are small (often fitting into one disk block) although most bytes are transferred from longer files.

• Accesses tend to be sequential and 100%– Spatial locality– What happens when we cache a huge file?

• Most opens are for read mode, most bytes transferred are by read operations

• There is significant reuse (re-opens) most opens go to files repeatedly opened & quickly. Directory nodes and executables also exhibit good temporal locality.– Looks good for caching!

• Use of temp files is significant part of file system activity in UNIX very limited reuse, short lifetimes (less than a minute).

• Long absolute pathnames are common in file opens – Name resolution can dominate performance – why?

What to Cache? Locality in File Access Patterns

(continued)

stdinstdoutstderr

Process descriptor

per-

proc

ess

file

ptr

arr

ay


r-w pos, mode



r-w pos, mode

File data

Byte#

Proc read(fd, buf, num)

Filebuffercache

V.M.

Block#

?

Caching File Blocks

Caching as “The Answer”

• Avoid the disk for as many file operations as possible.

• Cache acts as a filter for the requests seen by the disk

• Reads served best -- reuse.• Delayed writeback will avoid

going to disk at all for temp files.

Memory

Filecache

Proc

How to Manage the I/O Cache?

Goal: maintain K slots in memory as a cache over a collection of m items on secondary storage (K << m).

1. What happens on the first access to each item?Fetch it into some slot of the cache, use it, and leave it there to

speed up access if it is needed again later.

2. How to determine if an item is resident in the cache?Maintain a directory of items in the cache: a hash table.Hash on a unique identifier (tag) for the item (fully associative).

3. How to find a slot for an item fetched into the cache?Choose an unused slot, or select an item to replace according to

some policy, and evict it from the cache, freeing its slot.

File Block Buffer Cache

HASH(vnode, logical block) Buffers with valid data are retained in memory in a buffer cache or file cache.

Each item in the cache is a buffer header pointing at a buffer .

Blocks from different files may be intermingled in the hash chains.

System data structures hold pointers to buffers only when I/O is pending or imminent.

- busy bit instead of refcount

- most buffers are “free”

free/inactivelist head

free/inactive list tail

Most systems use a pool of buffers in kernel memory as a staging area for memory<->disk transfers.

Handling Updates in the File Cache

1. Blocks may be modified in memory once they have been brought into the cache.

Modified blocks are dirty and must (eventually) be written back.Write-back, write-through (104?)

2. Once a block is modified in memory, the write back to disk may not be immediate (synchronous).

Delayed writes absorb many small updates with one disk write.How long should the system hold dirty data in memory?

Asynchronous writes allow overlapping of computation and disk update activity (write-behind).Do the write call for block n+1 while transfer of block n is in progress.

Thus file caches also can improve performance for writes.

3. Knowing data gets to disk Force it but you can’t trust to a “write” syscall - fsync

Mechanism for Cache Eviction/Replacement

• Typical approach: maintain an ordered free/inactive list of slots that are candidates for reuse.– Busy items in active use are not on the list.

• E.g., some in-memory data structure holds a pointer to the item.

• E.g., an I/O operation is in progress on the item.

– The best candidates are slots that do not contain valid items.• Initially all slots are free, and they may become free again as items

are destroyed (e.g., as files are removed).

– Other slots are listed in order of value of the items they contain.• These slots contain items that are valid but inactive: they are held

in memory only in the hope that they will be accessed again later.

Replacement Policy• The effectiveness of a cache is

determined largely by the policy for ordering slots/items on the free/inactive list. Defines the replacement policy

• A typical cache replacement policy is LRU– Assume hot items used recently are likely

to be used again.– Move the item to the tail of the free list on

every release.– The item at the front of the list is the

coldest inactive item.

• Other alternatives:– FIFO: replace the oldest item.– MRU/LIFO: replace the most recently used

item.

HASH(vnode, logical block)

free/inactivelist head

free/inactive list tail

Viewing Memory as a Unified I/O Cache

A key role of the I/O system is to manage the page/block cache for performance and reliability.

tracking cache contents and managing page/block sharing

choreographing movement to/from external storage

balancing competing uses of memory

Modern systems attempt to balance memory usage between the VM system and the file cache.

Grow the file cache for file-intensive workloads.

Grow the VM page cache for memory-intensive workloads.

Support a consistent view of files across different style of access.

unified buffer cache

Prefetching

• To avoid the access latency of moving the data in for that first cache miss.

• Prediction! “Guessing” what data will be needed in the future. How?– It’s not for free:

Consequences of guessing wrongOverhead – removal of useful stuff, disk bandwidth consumed

Intrafile prediction

• Sequential access suggests prefetching block n+1 when n is requested

• Upon seek (sequentiality is broken)– Stop prefetching– Detect a “stride” or pattern automatically– Depend on hints from program

• Compiler generated “prefetch” statements• User supplied

– How often is this issue relevant? Big files, nonsequential files, predictable accesses.

Interfile prediction

Related files – what does that mean?– Directory nodes that are ancestors of a cached

object must be cached in order to resolve pathname.

– Detection of “file working sets”

• Trees representing program executions are constructed

– Capture semantic relationships among files in “semantic distance” measure – SEER system

Documents

Introduction to File Systems. File System Issues What is the role of files? What is the file abstraction? File naming. How to find the file we want? Sharing