32
P.J.Braam/CMU -- 1 Linux Virtual File System Peter J. Braam

P.J.Braam/CMU -- 1 Linux Virtual File System Peter J. Braam

Embed Size (px)

Citation preview

Page 1: P.J.Braam/CMU -- 1 Linux Virtual File System Peter J. Braam

P.J.Braam/CMU -- 1

Linux Virtual File System

Peter J. Braam

Page 2: P.J.Braam/CMU -- 1 Linux Virtual File System Peter J. Braam

P.J.Braam/CMU -- 2

Aims

• Present the data structures in Linux VFS

• Provide information about flow of control

• Describe methods and invariants needed to implement a new file system

• Illustrate with some examples

Page 3: P.J.Braam/CMU -- 1 Linux Virtual File System Peter J. Braam

P.J.Braam/CMU -- 3

File access

History

• BSD implemented VFS for NFS: aim dispatch to different filesystems

• VMS had elaborate filesystem

• NT/Win95 have VFS type interfaces

• Newer systems integrate VM with buffer cache.

Page 4: P.J.Braam/CMU -- 1 Linux Virtual File System Peter J. Braam

P.J.Braam/CMU -- 4

Linux Filesystems

• Media based– ext2 - Linux native– ufs - BSD– fat - DOS FS– vfat - win 95– hpfs - OS/2– minix - well….– Isofs - CDROM– sysv - Sysv Unix– hfs - Macintosh– affs - Amiga Fast FS– NTFS - NT’s FS– adfs - Acorn-strongarm

• Network– nfs– Coda – AFS - Andrew FS– smbfs - LanManager– ncpfs - Novell

• Special ones– procfs -/proc – umsdos - Unix in DOS– userfs - redirector to user

Page 5: P.J.Braam/CMU -- 1 Linux Virtual File System Peter J. Braam

P.J.Braam/CMU -- 5

Linux Filesystems (ctd)

• Forthcoming:– devfs - device file system– DFS - DCE distributed

FS• Varia:

– cfs - crypt filesystem– cfs - cache filesystem– ftpfs - ftp filesystem– mailfs - mail filesystem– pgfs - Postgres versioning

file system

• Linux serves (unrelated to the VFS!)– NFS - user & kernel– Coda– AppleShare -

netatalk/CAP– SMB - samba– NCP - Novell

Page 6: P.J.Braam/CMU -- 1 Linux Virtual File System Peter J. Braam

P.J.Braam/CMU -- 6

Linux is Obsolete

Andrew Tanenbaum

Usefulness

Page 7: P.J.Braam/CMU -- 1 Linux Virtual File System Peter J. Braam

P.J.Braam/CMU -- 7

File access

Linux VFS

• Multiple interfaces build up VFS:– files– dentries – inodes– superblock – quota

• VFS can do all caching & provides utility fctns to FS

• FS provides methods to VFS; many are optional

Page 8: P.J.Braam/CMU -- 1 Linux Virtual File System Peter J. Braam

P.J.Braam/CMU -- 8

User level file access

• Typical user level types and code:– pathnames: “/myfile”

– file descriptors: fd = open(“/myfile”…)

– attributes in struct stat: stat(“/myfile”, &mybuf), chmod, chown...

– offsets: write, read, lseek

– directory handles: DIR *dh = opendir(“/mydir”)

– directory entries: struct dirent *ent = readdir(dh)

Page 9: P.J.Braam/CMU -- 1 Linux Virtual File System Peter J. Braam

P.J.Braam/CMU -- 9

VFS

• Manages kernel level file abstractions in one format for all file systems

• Receives system call requests from user level (e.g. write, open, stat, link)

• Interacts with a specific file system based on mount point traversal

• Receives requests from other parts of the kernel, mostly from memory management

Page 10: P.J.Braam/CMU -- 1 Linux Virtual File System Peter J. Braam

P.J.Braam/CMU -- 10

File system level

• Individual File Systems– responsible for managing file & directory data

– responsible for managing meta-data: timestamps, owners, protection etc

– translates data between

• particular FS data: e.g. disk data, NFS data, Coda/AFS data

• VFS data: attributes etc in standard format

– e.g. nfs_getattr(….) returns attributes in VFS format, acquires attributes in NFS format to do so.

Page 11: P.J.Braam/CMU -- 1 Linux Virtual File System Peter J. Braam

P.J.Braam/CMU -- 11

Anatomy of stat system callsys_stat(path, buf) { dentry = namei(path); if ( dentry == NULL ) return -ENOENT;

inode = dentry->d_inode; rc =inode->i_op->i_permission(inode); if ( rc ) return -EPERM; rc = inode->i_op->i_getattr(inode, buf); dput(dentry); return rc;}

Establish VFS data

Call into inode layer of filesystem

Call into inode layer of filesystem

Page 12: P.J.Braam/CMU -- 1 Linux Virtual File System Peter J. Braam

P.J.Braam/CMU -- 12

sys_fstatfs(fd, buf) { /* for things like “df” */ file = fget(fd); if ( file == NULL ) return -EBADF; superb = file->f_dentry->d_inode->i_super; rc = superb->sb_op->sb_statfs(sb, buf); return rc;}

Call into superblock layer of filesystem

Translate fd to VFS data structure

Anatomy of fstatfs system call

Page 13: P.J.Braam/CMU -- 1 Linux Virtual File System Peter J. Braam

P.J.Braam/CMU -- 13

Data structures

• VFS data structures for:

– VFS handle to the file: inode (BSD: vnode)

– User instantiated file handle: file (BSD: file)

– The whole filesystem: superblock (BSD: vfs)

– A name to inode translation: dentry

Page 14: P.J.Braam/CMU -- 1 Linux Virtual File System Peter J. Braam

P.J.Braam/CMU -- 14

Shorthand method notation

• super block methods: sss_methodname

• inode methods: iii_methodname

• dentry methods: ddd_methodname

• file methods: fff_methodname

• instead of :

inode i_op lookup we write iii_lookup

Page 15: P.J.Braam/CMU -- 1 Linux Virtual File System Peter J. Braam

P.J.Braam/CMU -- 15

namei

struct dentry *namei(parent, name) {

if (dentry = d_lookup(parent,name))

else

ddd_hash(parent, name)

ddd_revalidate(dentry)

iii_lookup(parent, name)

sss_read_inode(…)

struct inode *iget(ino, dev) {

/* try cache else .. */

}

VFS FS

Page 16: P.J.Braam/CMU -- 1 Linux Virtual File System Peter J. Braam

P.J.Braam/CMU -- 16

Superblocks

• Handle metadata only (attributes etc)• Responsible for retrieving and storing

metadata from the FS media or peers• Struct superblocks hold things like:

– device, blocksize, dirty flags, list of dirty inodes– super operations– wait queue– pointer to the root inode of this FS

Page 17: P.J.Braam/CMU -- 1 Linux Virtual File System Peter J. Braam

P.J.Braam/CMU -- 17

Super Operations (sss_)

• Ops on Inodes:– read_inode– put_inode– write_inode– delete_inode– clear_inode– notify_change

• Superblock manips:– read_super (mount)– put_super (unmount) – write_super (unmount)– statfs (attributes)

Page 18: P.J.Braam/CMU -- 1 Linux Virtual File System Peter J. Braam

P.J.Braam/CMU -- 18

Inodes

• Inodes are VFS abstraction for the file• Inode has operations (iii_methods)• VFS maintains an inode cache, NOT the

individual FS’s (compare NT, BSD etc)• Inodes contain an FS specific area where:

– ext2 stores disk block numbers etc– AFS would store the FID

• Extraordinary inode ops are good for dealing with stale NFS file handles etc.

Page 19: P.J.Braam/CMU -- 1 Linux Virtual File System Peter J. Braam

P.J.Braam/CMU -- 19

What’s inside an inode - 1

list_head i_hashlist_head i_listlist_head i_dentryint i_count

long i_inoint i_dev

{m,a,c}time{u,g}idmodesizen_link

caching

Identifies file

Usual stuff

Page 20: P.J.Braam/CMU -- 1 Linux Virtual File System Peter J. Braam

P.J.Braam/CMU -- 20

What’s inside an inode -2

superblock i_sbinode_ops i_op

wait objects, semaphorelockvm_area_structpipe/socket info

page information

union { ext2fs_inode_info i_ext2 nfs_inode_info i_nfs coda_inode_info i_coda..} u

Which FS

For mmap,networking

waiting

FS Specificinfo:

blockno’sfids etc

Page 21: P.J.Braam/CMU -- 1 Linux Virtual File System Peter J. Braam

P.J.Braam/CMU -- 21

Inode state• Inode can be on one or two lists:

– (hash & in_use) or (hash & dirty ) or unused– inode has a use count i_count

• Transitions – unused hash: iget calls sss_read_inode

– dirty in_use: sss_write_inode

– hash unused: call on sss_clear_inode, but if

i_nlink = 0: iput calls sss_delete_inode when i_count falls to 0

Page 22: P.J.Braam/CMU -- 1 Linux Virtual File System Peter J. Braam

P.J.Braam/CMU -- 22

Dirty inodes

Inode_hashtable

1. iget: if i_count>0 ++2. iput: if i_count>1 - -

sss_write_inode(sync one)

Fs storage

Used inodes

Unused inodes

Fs storage

sss_read_inode(iget)

sss_clear_inode(freeing inos)orsss_delete_inode(iput)

media fs only

(mark_inode_dirty)

3. free_inodes4. syncing inodes

Players:

Fs storage

Inode Cache

Page 23: P.J.Braam/CMU -- 1 Linux Virtual File System Peter J. Braam

P.J.Braam/CMU -- 23

Red Hat Software sold 240,000 copies of Red Hat Linux in 1997 and expects to reach 400,000 in 1998.

Estimates of installed servers (InfoWorld):- Linux: 7 million- OS/2: 5 million- Macintosh: 1 million

Sales

Page 24: P.J.Braam/CMU -- 1 Linux Virtual File System Peter J. Braam

P.J.Braam/CMU -- 24

Inode operations (iii_)• lookup: return inode

– calls iget• creation/removal

– create– link– unlink– symlink– mkdir– rmdir– mknod– rename

• symbolic links– readlink– follow link

• pages– readpage, writepage,

updatepage - read or write page. Generic for mediafs.

– bmap - return disk block number of logical block

• special operations– revalidate - see dentry sect– truncate– permission

Page 25: P.J.Braam/CMU -- 1 Linux Virtual File System Peter J. Braam

P.J.Braam/CMU -- 25

Dentry world

• Dentry is a name to inode translation structure

• Cached agressively by VFS

• Eliminates lookups by FS & private caches– timing on Coda FS: ls -lR 1000 files after priming cache

• linux 2.0.32: 7.2secs

• linux 2.1.92: 0.6secs

– disk fs: less benefit, NFS even more

• Negative entries!

• Namei is dramatically simplified

Page 26: P.J.Braam/CMU -- 1 Linux Virtual File System Peter J. Braam

P.J.Braam/CMU -- 26

Inside dentry’s

• name

• pointer to inode

• pointer to parent dentry

• list head of children

• chains for lots of lists

• use count

Page 27: P.J.Braam/CMU -- 1 Linux Virtual File System Peter J. Braam

P.J.Braam/CMU -- 27

Dentry associated lists

d_alias chainsplace: d_instantiateremove: dentry_iput

inode I_dentry list head

d_child chainsplace: d_allocremove: d_prune, d_invalidate, d_put

inode i_dentry list head

= d_inode pointer = d_parent pointer

dentry inode relationship dentry tree relationship

Legend: inode dentry

Page 28: P.J.Braam/CMU -- 1 Linux Virtual File System Peter J. Braam

P.J.Braam/CMU -- 28

Dcachedentry_hashtable (d_hash chains)

unused dentries (d_lru chains)

namei iii_lookup d_add

pruned_invalidate d_drop

• namei tries cache: d_lookup– ddd_compare

• Success: ddd_revalidate– d_invalidate if fails– proceed if success

• Failure: iii_lookup– find inode– iget

• sss_read_inode– finish:

• d_add– can give negative entry

in dcache

dhash(parent, name) list head

Page 29: P.J.Braam/CMU -- 1 Linux Virtual File System Peter J. Braam

P.J.Braam/CMU -- 29

Dentry methods

• ddd_revalidate: can force new lookup

• ddd_hash: compute hash value of name

• ddd_compare: are names equal?

• ddd_delete, ddd_put, ddd_iput: FS cleanup opportunity

Page 30: P.J.Braam/CMU -- 1 Linux Virtual File System Peter J. Braam

P.J.Braam/CMU -- 30

Dentry particulars:

• ddd_hash and ddd_compare have to deal with extraordinary cases for msdos/vfat:– case insensitive– long and short filename pleasantries

• ddd_revalidate -- can force new lookup if inode not in use:– used for NFS/SMBfs aging– used for Coda/AFS callbacks

Page 31: P.J.Braam/CMU -- 1 Linux Virtual File System Peter J. Braam

P.J.Braam/CMU -- 31

Dijkstra probably hates me

Linus Torvalds

Style

Page 32: P.J.Braam/CMU -- 1 Linux Virtual File System Peter J. Braam

P.J.Braam/CMU -- 32

Memory mapping

• vm_area structure has – vm_operations– inode, addresses etc.

• vm_operations– map, unmap– swapin, swapout– nopage -- read when page isn’t in VM

• mmap– calls on iii_readpage– keeps a use count on the inode until unmap