Upload
kelley-lindsey
View
225
Download
4
Embed Size (px)
Citation preview
Introduction to File Systems
File System Issues
• What is the role of files? What is the file abstraction?
• File naming. How to find the file we want?Sharing files. Controlling access to files.
• Performance issues - how to deal with the bottleneck of disks? What is the “right” way to optimize file access?
Role of Files
• Persistence long-lived data for posterity non-volatile storage media semantically meaningful (memorable)
names
AbstractionsAddressbook, record for Duke CPS
Userview
Application
File System
addrfile fid, byte range*
Disk Subsystem
device, block #
surface, cylinder, sector
bytes
fid
block#
*File Abstractions• UNIX-like files
– Sequence of bytes– Operations: open (create), close, read, write,
seek
• Memory mapped files– Sequence of bytes – Mapped into address space– Page fault mechanism does data transfer
• Named, Possibly typed
Unix File Syscalls
int fd, num, success, bufsize; char data[bufsize]; long offset, pos;
fd = open (filename, mode [,permissions]);
success = close (fd);
pos = lseek (fd, offset, mode);
num = read (fd, data, bufsize);
num = write (fd, data, bufsize);
O_RDONLYO_WRONLYO_RDWRO_CREATO_APPEND...
User grp othersrwx rwx rwx111 100 000
Relative tobeginning, currentposition, end of file
Memory Mapped Filesfd = open (somefile, consistent_mode);pa = mmap(addr, len, prot, flags, fd,
offset);
VAS
len
len
pa
fd + offset
R, W, X,none
Shared,Private,Fixed,Noreserve
Reading performed by Load instr.
Nachos File Syscalls/Operations
Create(“zot”);
OpenFileId fd;fd = Open(“zot”);Close(fd);
char data[bufsize];Write(data, count, fd);Read(data, count, fd);
Limitations:1. small, fixed-size files and directories2. single disk with a single directory3. stream files only: no seek syscall4. file size is specified at creation time5. no access control, etc.
Functions of File System
• Determine layout of files and metadata on disk in terms of blocks. Disk block allocation. Bad blocks.
• Handle read and write system calls
• Initiate I/O operations for movement of blocks to/from disk.
• Maintain buffer cache
File System Data Structures
stdinstdoutstderr
Process descriptor
per-
proc
ess
file
ptr
arr
ay
System-wideOpen file table
r-w pos, mode
System-wideFile descriptor table
in-memorycopy of inodeptr to on-diskinode
r-w pos, mode
File data
pos
pos
UNIX Inodes
FileAttributes
Blo
ck A
ddr
...
......
...
...
... ...
Data Block Addr
1
1
1
2
2
2
2
3 3 3 3
Data blocks
Decoupling meta-datafrom directory entries
File Sharing Between Parent/Child
main(int argc, char *argv[]) {char c;int fdrd, fdwt, fdpriv;
if ((fdrd = open(argv[1], O_RDONLY)) == -1)exit(1);
if ((fdwt = creat([argv[2], 0666)) == -1)exit(1);
fork();
if ((fdpriv = open([argv[3], O_RDONLY)) == -1)exit(1);
while (TRUE) {if (read(fdrd, &c, 1) != 1)
exit(0);write(fdwt, &c, 1);
}}
File System Data Structures
stdinstdoutstderr
Process descriptor
per-
proc
ess
file
ptr
arr
ay
System-wideOpen file table
r-w pos, mode
System-wideFile descriptor table
in-memorycopy of inodeptr to on-diskinode
r-w pos, mode
forked process’sProcess descriptor
openafterfork
Sharing Open File Instances
shared seek offset in shared file table entry
system open file table
user IDprocess ID
process group IDparent PIDsignal state
siblingschildren
user IDprocess ID
process group IDparent PIDsignal state
siblingschildren
process file descriptorsprocess
objects
shared file(inode or vnode)
child
parent
Directory Subsystem
• Map filenames to fileids: open (create) syscall. Create kernel data structures.
• Maintain naming structure (unlink, mkdir, rmdir)
Pathname Resolution
inode#current
Directory node FileAttributes
inode#Proj
Directory node FileAttributes
cps110
current
inode#proj3
Directory nodeProj
FileAttributes
proj3proj3data filedata file
“cps110/current/Proj/proj3”
FileAttributes
index node of wd
Access Patterns Along the Way
Proc DirSubsys
open(/foo/bar/file);read(fd,buf,sizeof(buf));read(fd,buf,sizeof(buf));read(fd,buf,sizeof(buf));close(fd);
read(rootdir);read(inode);read(foo);read(inode);read(bar);read(inode);read(fd,buf,sizeof(buf));read(fd,buf,sizeof(buf));read(fd,buf,sizeof(buf));
read(rootdir);read(inode);read(foo);read(inode);read(bar);read(inode);read(filedatablock);
FileSys
DeviceSubsys
Functions of Device Subsystem
In general, deal with device characteristics• Translate block numbers (the abstraction of
device shown to file system) to physical disk addresses. Device specific intelligent placement of blocks. (subject to change with upgrades in technology)
• Schedule (reorder?) disk operations
What to do about Disks?
Avoid them altogether! CachingDisk scheduling
• Idea is to reorder outstanding requests to minimize seeks.
Layout on disk– Placement to minimize disk overhead
Build a better disk (or substitute)– Example: RAID
Disk Scheduling for Requests
• Minimize seek AND rotational delay• Maintain a queue of requests on EACH
cylinder– Sort each queue by sector (rotational position)– At one cylinder process request based on
rotational position to minimize rotational delay
• Move from cylinder when all requests on old cylinder are processed, in old direction, until no request remain in that direction.– “Elevator” algorithm
Redundant Array of Inexpensive Disks
• Parallel seeks and data transmission• Each record is “smeared” across disks so that
some bytes come from each disk– All disks positioned at same cylinder, track– Add extra redundant disk, for error check
• Reading one file is much faster, if no seek needed
• Error correction possible, so re-reading on error not needed
File Naming
Goals of File Naming• Foremost function - to find files
(e.g., in open() ), Map file name to file object.
• To store meta-data about files.• To allow users to choose their own file names
without undue name conflict problems.• To allow sharing.• Convenience: short names, groupings.• To avoid implementation complications
Possible Naming Structures• Flat name space - 1 system-wide table,
– Unique naming with multiple users is hard.Name conflicts.
– Easy sharing, need for protection
• Per-user name space– Protection by isolation, no sharing– Easy to avoid name conflicts– Associate process with directory to use to
resolve names, allow user to change this “current working directory” (cd)
Naming Structures
Naming network• Component names - pathnames
– Absolute pathnames - from a designated root– Relative pathnames - from a working directory– Each name carries how to resolve it.
• Could allow defining short names for files anywhere in the network. This might produce cycles, but it makes naming things more convenient.
Full Naming Network*
• /Jamie/lynn/project/D• /Jamie/d• /Jamie/lynn/jam/proj1/C• (relative from Terry)
A• (relative from Jamie)
d
root
Lynn
Jamie
Terry
project
A
B
CD E
proj1
project
D
d
lynn
jam
TA
grp1
* not Unix
Full Naming Network*
• /Jamie/lynn/project/D• /Jamie/d• /Jamie/lynn/jam/proj1/C• (relative from Terry)
A• (relative from Jamie)
d
root
Lynn
Jamie
Terry
project
A
B
CD E
proj1
project
D
d
lynn
jam
TA
grp1
* UnixWhy?
Meta-Data
• File size• File type• Protection - access
control information• History:
creation time, last modification,last access.
• Location of file - which device
• Location of individual blocks of the file on disk.
• Owner of file• Group(s) of users
associated with file
Restricting to a Hierarchy
• Problems with full naming network– What does it mean to “delete” a file?– Meta-data interpretation
Operations on Directories (UNIX)
• link (oldpathname, newpathname) - make entry pointing to file
• unlink (filename) - remove entry pointing to file
• mknod (dirname, type, device) - used (e.g. by mkdir utility function) to create a directory (or named pipe, or special file)
• getdents(fd, buf, structsize) - reads dir entries
Reclaiming Storage
root
Jo
Jamie
Terry
project
A
B
CD E
proj1
projectD
d
joe
jamTA
grp1X
XX Series of
unlinks
What shouldbe dealloc?
root
Jo
Jamie
Terry
project
A
B
CD E
proj1
projectD
d
joe
jamTA
grp1X
XX Series of
unlinks
Reclaiming Storage
root
Jo
Jamie
Terry
project
A
B
CD E
proj1
projectD
d
joe
jamTA
grp1X
XX Series of
unlinks
2
31
21
2
Reference Counting?
root
Jo
Jamie
Terry
project
A
B
CD E
proj1
projectD
d
joe
jamTA
grp1X
XX Series of
unlinks
Garbage Collection
*
** Phase 1 marking
Phase 2 collect
Restricting to a Hierarchy
• Problems with full naming network– What does it mean to “delete” a file?– Meta-data interpretation
• Eliminating cycles– allows use of reference counts for
reclaiming file space– avoids garbage collection
Given: Naming Hierarchy (because of implementation
issues)/
tmp usretcbin vmunix
ls sh project users
packages
(volume root)
tex emacs
mount point
leaf
A Typical Unix File Tree
/
tmp usretc
File trees are built by graftingvolumes from different devicesor from network servers.
Each volume is a set of directories and files; a host’s file tree is the set of directories and files visible to processes on a given host.
bin vmunix
ls sh project users
packages
coveredDir
In Unix, the graft operation isthe privileged mount system call,and each volume is a filesystem.
mount point
mount (coveredDir, volume)coveredDir: directory pathnamevolume: device
volume root contents become visible at pathname coveredDir
A Typical Unix File Tree
/
tmp usretc
File trees are built by graftingvolumes from different devicesor from network servers.
Each volume is a set of directories and files; a host’s file tree is the set of directories and files visible to processes on a given host.
bin vmunix
ls sh project users
packages
coveredDir
In Unix, the graft operation isthe privileged mount system call,and each volume is a filesystem.
mount point
mount (coveredDir, volume)
/usr/project/packages/coveredDir/emacs
(volume root)
tex emacs
/
tmp usretc
Each volume is a set of directories and files; a host’s file tree is the set ofdirectories and files visible to processes on a given host.
bin vmunix
ls sh project users
packages
(volume root)
tex emacs
mount pointmount (coveredDir, volume)
coveredDir: directory pathnamevolume: device specifier or network volume
volume root contents become visible at pathname coveredDir
/usr/project/packages/coveredDir/emacs
File trees are built by graftingvolumes from different devicesor from network servers.
In Unix, the graft operation isthe privileged mount system call,and each volume is a filesystem.
A Typical Unix File Tree
Reclaiming Convenience
• Symbolic links - indirect filesfilename maps, not to file object, but to another pathname– allows short aliases– slightly different semantics
• Search path rules
Unix File Naming (Hard Links)0
rain: 32
hail: 48
0
wind: 18
sleet: 48
inode 48
inode link count = 2
directory A directory B
A Unix file may have multiple names.
link system calllink (existing name, new name)create a new name for an existing fileincrement inode link count
unlink system call (“remove”)unlink(name)destroy directory entrydecrement inode link countif count = 0 and file is not in active usefree blocks (recursively) and on-disk inode
Each directory entry naming thefile is called a hard link.
Each inode contains a reference countshowing how many hard links name it.
Unix Symbolic (Soft) Links• Unix files may also be named by symbolic (soft) links.
– A soft link is a file containing a pathname of some other file.
0
rain: 32
hail: 48
inode 48
inode link count = 1
directory A
0
wind: 18
sleet: 67
directory B
../A/hail/0
inode 67
symlink system callsymlink (existing name, new name)allocate a new file (inode) with type symlinkinitialize file contents with existing namecreate directory entry for new file with new name
The target of the link may beremoved at any time, leavinga dangling reference.
How should the kernel handle recursive soft links?
Convenience, but not performance!
Soft vs. Hard Links
What’s the difference in behavior?
Soft vs. Hard Links
What’s the difference in behavior?
Terry Lynn
Jamie
/
Soft vs. Hard Links
What’s the difference in behavior?
Terry Lynn
Jamie
/
X
Soft vs. Hard Links
What’s the difference in behavior?
Terry Lynn
Jamie
/
X
File Structure
After Resolving Long PathnamesOPEN(“/usr/faculty/carla/classes/cps110/spring02/lectures/lecture13.ppt”,…)
Finally Arrive at File• What do users seem to want from the file
abstraction?• What do these usage patterns mean for file
structure and implementation decisions?– What operations should be optimized 1st?– How should files be structured?– Is there temporal locality in file usage?– How long do files really live?
Know your Workload!
• File usage patterns should influence design decisions. Do things differently depending:– How large are most files? How long-lived?
Read vs. write activity. Shared often?– Different levels “see” a different workload.
• Feedback loop
Usage patterns observed today
File Systemdesign and impl
Generalizations from UNIX Workloads
• Standard Disclaimers that you can’t generalize…but anyway…
• Most files are small (fit into one disk block) although most bytes are transferred from longer files.
• Most opens are for read mode, most bytes transferred are by read operations
• Accesses tend to be sequential and 100%
More on Access Patterns
• There is significant reuse (re-opens) most opens go to files repeatedly opened & quickly. Directory nodes and executables also exhibit good temporal locality.– Looks good for caching!
• Use of temp files is significant part of file system activity in UNIX very limited reuse, short lifetimes (less than a minute).
Access Patterns Along the Way
Proc Dirsubsys
open(/foo/bar/file);read(fd,buf,sizeof(buf));read(fd,buf,sizeof(buf));read(fd,buf,sizeof(buf));close(fd);
read(rootdir);read(inode);read(foo);read(inode);read(bar);read(inode);read(fd,buf,sizeof(buf));read(fd,buf,sizeof(buf));read(fd,buf,sizeof(buf));
read(rootdir);read(inode);read(foo);read(inode);read(bar);read(inode);read(filedatablock);
F.S
File Structure Implementation:
Mapping File Block• Contiguous
– 1 block pointer, causes fragmentation, growth is a problem.
• Linked– each block points to next block, directory points to
first, OK for sequential access
• Indexed– index structure required, better for random access
into file.
UNIX Inodes
FileAttributes
Blo
ck A
ddr
...
......
...
...
... ...
Data Block Addr
1
1
1
2
2
2
2
3 3 3 3
Data blocks
Decoupling meta-datafrom directory entries
File Allocation Table (FAT)
Lecture.ppt
Pic.jpg
Notes.txt
eof
eof
eof
Meta-Data
• File size• File type• Protection - access
control information• History:
creation time, last modification,last access.
• Location of file - which device
• Location of individual blocks of the file on disk.
• Owner of file• Group(s) of users
associated with file
Access Control for Files
• Access control lists - detailed list attached to file of users allowed (denied) access, including kind of access allowed/denied.
• UNIX RWX - owner, group, everyone
UNIX access control
• Each file carries its access control with it.
rwx rwx rwx setuid
OwnerUID
GroupGID
Everybody else When bit set, itallows processexecuting objectto assume UID ofowner temporarily -enter owner domain(rights amplification)
• Owner has chmod, chgrp rights (granting, revoking)
More on Access Control Later
Files to Blocks: Allocation
• Clustering
• Log Structure
What to do about Disks?
Avoid them altogether! CachingDisk scheduling
– Idea is to reorder outstanding requests to minimize seeks.
Layout on disk– Placement to minimize disk overhead
Build a better disk (or substitute)– Example: RAID
Goal: Good Layout on Disk
• Placement to minimize disk overhead: can address both seek and rotational latency
• Cluster related things together (e.g. an inode and its data, inodes in same directory (ls command), data blocks of multi-block file, files in same directory)
• Sub-block allocation to reduce fragmentation for small files
• Log-Structured File Systems
Effect of ClusteringAccess time = seek time + rotational delay + transfer time
average seek time = 2 ms for an intra-cylinder group seek, let’s sayrotational delay = 8 milliseconds for full rotation at 7200 RPM: average delay = 4 mstransfer time = 1 millisecond for an 8KB block at 8 MB/s
8 KB blocks deliver about 15% of disk bandwidth. 64KB blocks/clusters deliver about 50% of disk bandwidth.128KB blocks/clusters deliver about 70% of disk bandwidth.
Actual performance will likely be better with good disk layout, since most seek/rotate delays to read the next block/cluster will be “better than average”.
Block Allocation to Disk Layout
• The level of indirection in the file block maps allows flexibility in file layout.
• “File system design is 99% block allocation.” [McVoy]
• Competing goals for block allocation:– allocation cost– bandwidth for high-volume transfers– efficient directory operations
• Goal: reduce disk arm movement and seek overhead.
• metric of merit: bandwidth utilization
FFS and LFSTwo different approaches to block allocation:
– Cylinder groups in the Fast File System (FFS) [McKusick81]
• clustering enhancements [McVoy91], and improved cluster allocation [McKusick: Smith/Seltzer96]
• FFS can also be extended with metadata logging [e.g., Episode]
– Log-Structured File System (LFS)• proposed in [Douglis/Ousterhout90]• implemented/studied in [Rosenblum91]• BSD port, sort of maybe: [Seltzer93]• extended with self-tuning methods [Neefe/Anderson97]
– Other approach: extent-based file systems
FFS Cylinder Groups• FFS defines cylinder groups as the unit of disk locality,
and it factors locality into allocation choices.– typical: thousands of cylinders, dozens of groups– Strategy: place “related” data blocks in the same cylinder group
whenever possible.• seek latency is proportional to seek distance
– Smear large files across groups:• Place a run of contiguous blocks in each group.
– Reserve inode blocks in each cylinder group.• This allows inodes to be allocated close to their directory entries
and close to their data blocks (for small files).
FFS Allocation Policies1. Allocate file inodes close to their containing
directories.For mkdir, select a cylinder group with a more-than-average
number of free inodes.For creat, place inode in the same group as the parent.
2. Concentrate related file data blocks in cylinder groups.
Most files are read and written sequentially.
Place initial blocks of a file in the same group as its inode.How should we handle directory blocks?
Place adjacent logical blocks in the same cylinder group.Logical block n+1 goes in the same group as block n.Switch to a different group for each indirect block.
Allocating a Block1. Try to allocate the rotationally optimal physical
block after the previous logical block in the file.Skip rotdelay physical blocks between each logical block.(rotdelay is 0 on track-caching disk controllers.)
2. If not available, find another block at a nearby rotational position in the same cylinder group
We’ll need a short seek, but we won’t wait for the rotation.If not available, pick any other block in the cylinder group.
3. If the cylinder group is full, or we’re crossing to a new indirect block, go find a new cylinder group.
Pick a block at the beginning of a run of free blocks.
Small Files with Bigger Blocks• Internal fragmentation in the file system blocks can
waste significant space for small files.• E.g., 1KB files waste 87% of disk space (and bandwidth) in a naive
file system with an 8KB block size.• Most files are small: one study [Irlam93] shows a median of 22KB.
• FFS solution: optimize small files for space efficiency.– Subdivide blocks into 2/4/8 fragments (or just frags).– Free block maps contain one bit for each fragment.
• To determine if a block is free, examine bits for all its fragments.
– The last block of a small file is stored on fragment(s).• If multiple fragments they must be contiguous.
Clustering in FFS• Clustering improves bandwidth utilization for large
files read and written sequentially.• Allocate clumps/clusters/runs of blocks contiguously; read/write
the entire clump in one operation with at most one seek.
– Typical cluster sizes: 32KB to 128KB.
• FFS can allocate contiguous runs of blocks “most of the time” on disks with sufficient free space.– This (usually) occurs as a side effect of setting rotdelay = 0.
• Newer versions may relocate to clusters of contiguous storage if the initial allocation did not succeed in placing them well.
– Must modify buffer cache to group buffers together and read/write in contiguous clusters.
Log-Structured File Systems• Assumption: Cache is effectively filtering out reads so we
should optimize for writes• Basic Idea: manage disk as an append-only log (subsequent
writes involve minimal head movement)• Data and meta-data (mixed) accumulated in large segments
and written contiguously• Reads work as in UNIX - once inode is found, data blocks
located via index.• Cleaning an issue - to produce contiguous free space,
correcting fragmentation developing over time.• Claim: LFS can use 70% of disk bandwidth for writing while
Unix FFS can use only 5-10% typically because of seeks.
LFS logs
In LFS, all block and metadata allocation is log-based.– LFS views the disk as “one big log” (logically).– All writes are clustered and sequential/contiguous.
• Intermingles metadata and blocks from different files.– Data is laid out on disk in the order it is written.– No-overwrite allocation policy: if an old block or inode is
modified, write it to a new location at the tail of the log.– LFS uses (mostly) the same metadata structures as FFS;
only the allocation scheme is different.• Cylinder group structures and free block maps are
eliminated.• Inodes are found by indirecting through a new map
LFS Data Structures on Disk
• Inode – in log, same as FFS• Inode map – in log, locates position of inode, version,
time of last access• Segment summary – in log, identifies contents of
segment (file#, offset for each block in segment)• Segment usage table – in log, counts live bytes in
segment and last write time• Checkpoint region – fixed location on disk, locates
blocks of inode map, identifies last checkpoint in log.• Directory change log – in log, records directory
operations to maintain consistency of ref counts in inodes
Structure of the Log
clean
Checkpoint region
= inode map block
= inode
D1
= directory node
File 2File 1
block 2
= data block
File 1block1
= segment summary, usage, dirlog
Writing the Log in LFS1. LFS “saves up” dirty blocks and dirty inodes until it
has a full segment (e.g., 1 MB).– Dirty inodes are grouped into block-sized clumps.– Dirty blocks are sorted by (file, logical block number).– Each log segment includes summary info and a
checksum.
2. LFS writes each log segment in a single burst, with at most one seek.– Find a free segment “slot” on the disk, and write it.– Store a back pointer to the previous segment.
• Logically the log is sequential, but physically it consists of a chain of segments, each large enough to amortize seek overhead.
Growth of the Log
write (file1, block1)creat (D1/file3)write (file3, block1)
clean
Checkpoint region
D1File 2File 1
block 2File 1block1
File 1block1
File 3 D1
Death in the Log
write (file1, block1)creat (D1/file3)write (file3, block1)
clean
Checkpoint region
D1File 2File 1
block 2File 1block1
File 1block1
File 3 D1
Writing the Log: the Rest of the Story
1. LFS cannot always delay writes long enough to accumulate a full segment; sometimes it must push a partial segment.– fsync, update daemon, NFS server, etc.– Directory operations are synchronous in FFS, and some must
be in LFS as well to preserve failure semantics and ordering.
2. LFS allocation and write policies affect the buffer cache, which is supposed to be filesystem-independent.– Pin (lock) dirty blocks until the segment is written; dirty blocks
cannot be recycled off the free chain as before.– Endow *indirect blocks with permanent logical block numbers
suitable for hashing in the buffer cache.
Cleaning in LFSWhat does LFS do when the disk fills up?1. As the log is written, blocks and inodes written earlier in
time are superseded (“killed”) by versions written later.– files are overwritten or modified; inodes are updated– when files are removed, blocks and inodes are deallocated
2. A cleaner daemon compacts remaining live data to free up large hunks of free space suitable for writing segments.– look for segments with little remaining live data
• benefit/cost analysis to choose segments
– write remaining live data to the log tail– can consume a significant share of bandwidth, and there are
lots of cost/benefit heuristics involved.
Cleaning the Log
Checkpoint region
D1File 2File 1
block 2File 1block1
File 1block1 File 3 D1
clean
Cleaning the Log
Checkpoint region
D1
File 2
File 1 block 2
File 1block1
File 1block1 File 3 D1
File 1 block 2
File 2
Cleaning the Log
Checkpoint region
File 2
File 1block1 File 3 D1
File 1 block 2
clean
Cleaning Issues
• Must be able to identify which blocks are live• Must be able to identify the file to which each
block belongs in order to update inode to new location
• Segment Summary block contains this info– File contents associated with uid (version # and
inode #)– Inode entries contain version # (incr. on truncate)– Compare to see if inode points to block under
consideration
Policies• When cleaner cleans – threshold based• How much – 10s at a time until threshold reached• Which segments
– Most fragmented segment is not best choice.– Value of free space in segment depends on stability of
live data (approx. age)– Cost / benefit analysis Benefit = free space available (1-u) * age of youngest block
Cost = cost to read segment + cost to move live data
– Segment usage table supports this
• How to group live blocks
Recovering Disk Contents
• Checkpoints – define consistent states– Position in log where all data structures are consistent– Checkpoint region (fixed location) – contains the addresses of all
blocks of inode map and segment usage table, ptr to last segment written
• Actually 2 that alternate in case a crash occurs while writing checkpoint region data
• Roll-forward – to recover beyond last checkpoint– Uses Segment summary blocks at end of log – if we find new
inodes, update inode map found from checkpoint– Adjust utilizations in segment usage table– Restore consistency in ref counts within inodes and directory
entries pointing to those inodes using Directory operation log (like an intentions list)
Recovery of the Log
Checkpoint region
D1File 2File 1
block 2File 1block1
File 1block1
File 3 D1
Written since checkpoint
Recovery in Unix
fsck
• Traverses the directory structure checking ref counts of inodes
• Traverses inodes and freelist to check block usage of all disk blocks
File Caching
• What to cache?
• How to manage the file buffer cache?
File Buffer Cache
Memory
Filecache
Proc
V.M.What do wewant to cache?
• inodes• dir nodes• whole files• disk blocks
The answer will determine• where in the request path the cache sits
Virtual memoryand File buffercache will competefor physical memory
• How?
Access Patterns Along the Way
cache
Proc F.S.
open(/foo/bar/file);read(fd,buf,sizeof(buf));read(fd,buf,sizeof(buf));read(fd,buf,sizeof(buf));close(fd);
read(rootdir);read(inode);read(foo);read(inode);read(bar);read(inode);read(fd,buf,sizeof(buf));read(fd,buf,sizeof(buf));read(fd,buf,sizeof(buf));
read(rootdir);read(inode);read(foo);read(inode);read(bar);read(inode);read(filedatablock);
Why Are File Caches Effective?
1. Locality of reference: storage accesses come in clumps.spatial locality: If a process accesses data in block B, it is likely
to reference other nearby data soon.(e.g., the remainder of block B)
example: reading or writing a file one byte at a time
temporal locality: Recently accessed data is likely to be used again.
2. Read-ahead: if we can predict what blocks will be needed soon, we can prefetch them into the cache.most files are accessed sequentially
What to Cache? Locality in File Access Patterns
(UNIX Workloads)
• Most files are small (often fitting into one disk block) although most bytes are transferred from longer files.
• Accesses tend to be sequential and 100%– Spatial locality– What happens when we cache a huge file?
• Most opens are for read mode, most bytes transferred are by read operations
• There is significant reuse (re-opens) most opens go to files repeatedly opened & quickly. Directory nodes and executables also exhibit good temporal locality.– Looks good for caching!
• Use of temp files is significant part of file system activity in UNIX very limited reuse, short lifetimes (less than a minute).
• Long absolute pathnames are common in file opens – Name resolution can dominate performance – why?
What to Cache? Locality in File Access Patterns
(continued)
stdinstdoutstderr
Process descriptor
per-
proc
ess
file
ptr
arr
ay
System-wideOpen file table
r-w pos, mode
System-wideFile descriptor table
in-memorycopy of inodeptr to on-diskinode
r-w pos, mode
File data
Byte#
Proc read(fd, buf, num)
Filebuffercache
V.M.
Block#
?
Caching File Blocks
Caching as “The Answer”
• Avoid the disk for as many file operations as possible.
• Cache acts as a filter for the requests seen by the disk
• Reads served best -- reuse.• Delayed writeback will avoid
going to disk at all for temp files.
Memory
Filecache
Proc
How to Manage the I/O Cache?
Goal: maintain K slots in memory as a cache over a collection of m items on secondary storage (K << m).
1. What happens on the first access to each item?Fetch it into some slot of the cache, use it, and leave it there to
speed up access if it is needed again later.
2. How to determine if an item is resident in the cache?Maintain a directory of items in the cache: a hash table.Hash on a unique identifier (tag) for the item (fully associative).
3. How to find a slot for an item fetched into the cache?Choose an unused slot, or select an item to replace according to
some policy, and evict it from the cache, freeing its slot.
File Block Buffer Cache
HASH(vnode, logical block) Buffers with valid data are retained in memory in a buffer cache or file cache.
Each item in the cache is a buffer header pointing at a buffer .
Blocks from different files may be intermingled in the hash chains.
System data structures hold pointers to buffers only when I/O is pending or imminent.
- busy bit instead of refcount
- most buffers are “free”
free/inactivelist head
free/inactive list tail
Most systems use a pool of buffers in kernel memory as a staging area for memory<->disk transfers.
Handling Updates in the File Cache
1. Blocks may be modified in memory once they have been brought into the cache.
Modified blocks are dirty and must (eventually) be written back.Write-back, write-through (104?)
2. Once a block is modified in memory, the write back to disk may not be immediate (synchronous).
Delayed writes absorb many small updates with one disk write.How long should the system hold dirty data in memory?
Asynchronous writes allow overlapping of computation and disk update activity (write-behind).Do the write call for block n+1 while transfer of block n is in progress.
Thus file caches also can improve performance for writes.
3. Knowing data gets to disk Force it but you can’t trust to a “write” syscall - fsync
Mechanism for Cache Eviction/Replacement
• Typical approach: maintain an ordered free/inactive list of slots that are candidates for reuse.– Busy items in active use are not on the list.
• E.g., some in-memory data structure holds a pointer to the item.
• E.g., an I/O operation is in progress on the item.
– The best candidates are slots that do not contain valid items.• Initially all slots are free, and they may become free again as items
are destroyed (e.g., as files are removed).
– Other slots are listed in order of value of the items they contain.• These slots contain items that are valid but inactive: they are held
in memory only in the hope that they will be accessed again later.
Replacement Policy• The effectiveness of a cache is
determined largely by the policy for ordering slots/items on the free/inactive list. Defines the replacement policy
• A typical cache replacement policy is LRU– Assume hot items used recently are likely
to be used again.– Move the item to the tail of the free list on
every release.– The item at the front of the list is the
coldest inactive item.
• Other alternatives:– FIFO: replace the oldest item.– MRU/LIFO: replace the most recently used
item.
HASH(vnode, logical block)
free/inactivelist head
free/inactive list tail
Viewing Memory as a Unified I/O Cache
A key role of the I/O system is to manage the page/block cache for performance and reliability.
tracking cache contents and managing page/block sharing
choreographing movement to/from external storage
balancing competing uses of memory
Modern systems attempt to balance memory usage between the VM system and the file cache.
Grow the file cache for file-intensive workloads.
Grow the VM page cache for memory-intensive workloads.
Support a consistent view of files across different style of access.
unified buffer cache
Prefetching
• To avoid the access latency of moving the data in for that first cache miss.
• Prediction! “Guessing” what data will be needed in the future. How?– It’s not for free:
Consequences of guessing wrongOverhead – removal of useful stuff, disk bandwidth consumed
Intrafile prediction
• Sequential access suggests prefetching block n+1 when n is requested
• Upon seek (sequentiality is broken)– Stop prefetching– Detect a “stride” or pattern automatically– Depend on hints from program
• Compiler generated “prefetch” statements• User supplied
– How often is this issue relevant? Big files, nonsequential files, predictable accesses.
Interfile prediction
Related files – what does that mean?– Directory nodes that are ancestors of a cached
object must be cached in order to resolve pathname.
– Detection of “file working sets”
• Trees representing program executions are constructed
– Capture semantic relationships among files in “semantic distance” measure – SEER system