View
27
Download
0
Category
Preview:
DESCRIPTION
Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18. Examples. UNIX: UFS based on FFS Windows: Disk: FAT, FAT32 and NTFS CD, DVD, floppy-disk .. filesystems Linux (40+): ext2, ext3, .. Distributed filesystems: NFS - PowerPoint PPT Presentation
Citation preview
1
Filesystem Implementation
Tanenbaum Ch. 6
Silberschatz Ch. 11
Bovet Ch. 12, 18
2
Examples
• UNIX: UFS based on FFS• Windows:
– Disk: FAT, FAT32 and NTFS– CD, DVD, floppy-disk .. filesystems
• Linux (40+): ext2, ext3, ..• Distributed filesystems: NFS
Modern OS must concurrently support multiple types of filesystems (fs)!
3
Layered Approach
HW (Disk)
Interrupt handlersDevice drivers
Handles basic reading/writing of physical blocks
Handles files (logical blocksphysical blocks)
Handles metadata and directory structures
Shared
Content / operation on files
4
Virtual File Systems (VFS)
• VFS provide an object-oriented way of implementing filesystems
• VFS allows the same system call interface (the API) to be used for different types of filesystems
• The interface is to the VFS interface, rather than any specific type of filesystem
5
Schematic View of VFS
Concurrently support of multiple filesystems
6
VFS and Linux
• VFS introduces a “common file model” – vnode: File representation structure
• “implemented” by– FAT32, NTFS, ext2/3, AFS, NFS, ReiserFS …
• Linux– i-node object (a file)
– file object (an open file)
– superblock object (entire file system)
– dentry object (directory entry)
7
The VFS Objects: Common File Model
Process struct file struct dentry struct inode
Represents an open file in a process
from <fs.h>Represents a directory entry
from <dcache.h>
Represents a file in the filesystem
from <fs.h>
static ssize_t fifo_read( struct file *file, char *buf, size_t count, loff_t *ppos )
{
struct inode *node = file->f_dentry->d_inode; unsigned int minor = iminor(node);…}
macro from < /usr/include/linux/fs.h>
struct super_block file
Represents a filesystem
8
Outline
• Implementing filesystems on disk– implementing files
• contiguous allocation• linked-list allocation• file allocation table (FAT)• i-node
– implementing directories– trade-offs and performance
• Look at some of the VFS objects for Linux– no complete listings
• Example of filesystem implementation– ext2, ext3
9
Filesystem Implementation
A possible file system layout
10
• Where to store/allocate blocks?
• How to find files/blocks?
• What is a good block size?
Logical address (block) Physical address (block)
Files Consist of Blocks of Data
1 2 3
4 5 6
7 8 9
10 11 12
9
5
11
4
7
2
12
6
10
8
1
3
11
Implementing Files (1)
(a) Contiguous allocation of disk space for 7 files(b) State of the disk after files D and E have been removed
12
Contiguous Allocation
+ Finding files/blocks is easy+ Offset + number of blocks
+ Excellent read performance
– Fragmentation– Compaction – Reuse of holes– Need to know max file size when allocating
• Where could this allocation be useful?
• What is the standard alternative to static allocation in computer science (think arrays in C)?
13
Implementing Files (2)
- Storing a file as a linked list of disk blocks- Directory contains a pointer to first and last blocksHow much data can be stored in 10 blocks?
14
Linked List Allocation
+ No holes, no pre-allocation problem+ Only address of first block needs to be stored
– Finding block n is expensive– Need to read all n-1 blocks prior to block n
– Size of data block is not 2x
– The pointer is not data
• Both disadvantages can be removed using a new data structure, which?
15
Implementing Files – FAT
Idea: store the pointers in a table• Fast random access
– Table can be stored in RAM
• Full 2x block size
This method is called FAT(File Allocation Table)
Disadvantage: table size20 GB, block size 1 KB 20M blocks 80 MB (4-byte entries)
or 60MB (3-byte entries)What can we do to reduce the storage requirement?
A: 4 – 7 – 2 – 10 – 12
16
FAT i-nodes
• Do we actually need to have the whole table in memory all the time?– table size proportional to disk size!
• Actually, only open files need to be there…
• Split the table into per-file tables, called i-nodes (index node)
17
Implementing Files (4)
An example i-node
Indirect block tohandle large files
18
Indirect Addressing
An i-node with 3 levels of indirect blocks
19
Directories
• Opening a file:– locate root directory, – search for desired directory, – directory contains info to find file blocks on
disk. • disk address (contiguous allocation)
• number of first block (linked list)
• i-node number
• Directory system: maps ASCII file name onto the information needed to open it
20
Implementing Directories (1)
Where to store file attributes?(a) A simple directory
fixed size entries (1 per file)disk addresses and attributes in directory entry
(b) Directory in which each entry just refers to an i-node
21
Implementing Directories (2)
• Directories are files (i-node) with i-node pointers
• Directory systems should translate a name to a file (i-node)– dentry keeps this info in VFS
22
Locating /usr/ast/mbox
23
Shared Files
Storing attributes in i-node simplifies sharing
File system containing a shared file
24
Hard/Symbolic Links
• Hard links are actually the same file– share the same i-node
– will be seen as the same file everywhere• same owner
• same contents
• same permissions
• keeps counter
• Symbolic links are dereferenced– a special file
• different owners/permissions
• can cross filesystem boundaries
– short cuts in Windows, alias in Mac
25
Shared Files
(a) Situation prior to linking
(b) After the link is created
(c) After the original owner removes the file
26
Check this under Linux..Execute as u1=user1, u2=user2 (make sure that u2 has write permissions)
1. u1: echo Hi > file-u12. u2: ln file-u1 file-u23. u2: ln –s file-u1 file-u2-s4. u2: cat file-u25. u2: cat file-u2-s6. u1: echo again >> file-u17. u1: rm file-u18. u2: cat file-u29. u2: cat file-u2-s
What is the output of line 4, 5 & 8, 9? Why?
27
Mounting
/
usr bin tmp windows
Windows Documents and Settings
Temp
• The directory i-node indicates that it is a mount point
28
Disk Space Management
Block size (bytes)
Store files in fixed-size blocks, how big the blocks should be?- Average file size is important
All files are 2KB large
29
Keeping Track of Free Blocks (1)
(a) Storing the free list on a linked list (32 bits / block)(b) A bit map (1 bit per block, but for all blocks)
30
Keeping Track of Free Blocks (2)
• Bitmap size depends on disk and block size• Linked list size depends on # free blocks• Bitmaps are generally smaller• Linked lists can use free blocks …• Only one block of the linked list needed in
main memory– The others are read/written on demand– Problems? What happens if files are deleted?
31
Keeping Track of Free Blocks (3)
(a) Almost-full block of pointers to free disk blocks in RAM- three blocks of pointers on disk
(b) Result of freeing a 3-block file(c) Alternative strategy for handling 3 free blocks
- shaded entries are pointers to free disk blocks
32
Quota
Quotas for keeping track of each user’s disk use
33
Backups
• Performing filesystem backups is essential for reliable systems
• Two types– Full– Incremental
• Typically a mixed algorithm is used– How to keep track of which files to save?
34
Backups
• A filesystem to be dumped– squares are directories, circles are files– shaded items, modified since last dump– each directory & file labeled by i-node number
File that hasnot changed
35
Backups
• Commonly all modified files and directories above them are stored– Can restore on another filesystem– Individual files can be restored from incremental
backup
• Bitmaps are used to find the modified i-nodes
36
Backups
4 phases of the algorithm1. Recursively mark each dir and each modified i-node (a)
2. Recursively unmark non-modified dirs (b)
3. Dump all dirs (c)
4. Dump all modified i-nodes (d)
37
Outline
• Implementing filesystems on disk– Implementing files
• contiguous allocation• linked-list allocation• file allocation table (FAT)• i-node
– Implementing directories– trade-offs and performance
• Look at some of the VFS objects for Linux– no complete listings
• Example of filesystem implementation– ext2, ext3
38
The Common File Model
Process struct file struct dentry struct inode
Represents an open file in a process
from <fs.h>Represents a directory entry
from <dcache.h>
Represents a file in the filesystem
from <fs.h>
struct super_block file
Represents a filesystem
39
task_struct (sched.h)struct task_struct {
volatile long state; struct thread_info *thread_info;atomic_t usage;unsigned long flags;unsigned long ptrace;int lock_depth;int prio, static_prio;struct list_head run_list;prio_array_t *array;unsigned long sleep_avg;long interactive_credit;
[…]/* file system info */
int link_count, total_link_count;struct tty_struct *tty; /* NULL if no tty */
/* ipc stuff */struct sysv_sem sysvsem;
/* CPU-specific state of this task */struct thread_struct thread;
/* filesystem information */struct fs_struct *fs;
/* open file information */struct files_struct *files;
/* namespace */struct namespace *namespace;
/* signal handlers */struct signal_struct *signal;struct sighand_struct *sighand;
[…]};
struct files_struct { atomic_t count; spinlock_t file_lock; int max_fds; int max_fdset; int next_fd; struct file ** fd; /* current fd array */ fd_set *close_on_exec; fd_set *open_fds; fd_set close_on_exec_init; fd_set open_fds_init; struct file *
fd_array[NR_OPEN_DEFAULT];};
Remember:• Each process is represented using
a task_struct• Keeps “a list” of open files
– files_struct
40
File (fs.h)
struct file {struct list_head f_list;struct dentry *f_dentry;struct vfsmount *f_vfsmnt;struct file_operations *f_op;atomic_t f_count;unsigned int f_flags;mode_t f_mode;loff_t f_pos;struct fown_struct f_owner;unsigned int f_uid, f_gid;int f_error;struct file_ra_state f_ra;
unsigned long f_version;void *f_security;
[..]
};
The file object:• Created by the OS when a file is opened• Does not exist on disk!
– no “dirty” bit is needed• Several processes can use the same file
object• Contains a list of pointers to operations
on this file
Set by the OS when file loaded from inode
Current file pointer (offset)
File reference count
Directory entry for the file!Directory entry for the file!
41
Operations of Files
struct file_operations {struct module *owner;loff_t (*llseek) (struct file *, loff_t, int);ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t);ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t, loff_t);int (*readdir) (struct file *, void *, filldir_t);unsigned int (*poll) (struct file *, struct poll_table_struct *);int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);int (*mmap) (struct file *, struct vm_area_struct *);int (*open) (struct inode *, struct file *);int (*flush) (struct file *);int (*release) (struct inode *, struct file *);int (*fsync) (struct file *, struct dentry *, int datasync);int (*aio_fsync) (struct kiocb *, int datasync);int (*fasync) (int, struct file *, int);int (*lock) (struct file *, int, struct file_lock *);ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *);ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *);ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t, void __user *);ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
};
42
Dentry (Directory Entry)
• Dentry does not represent directories!– i-nodes represent directories
• Used in directory related operations– e.g., pathname lookup
/users/aja/crap/exam.tex
1 dentry and 1 i-node for each component
43
Dentry Cache
• Dentry objects are created on the fly– time consuming!– inefficient
• dentry objects are often reused soon after creation
• Store dentry objects in a SW cache– the dentry cache (remember dcache.h)
44
Software Caches• The frequently used (created/destroyed) objects
are stored/allocated in SW caches• Basically three caches exists in Linux
– User mode memory (VM)– Slab allocator (common structures/objects)– Page cache (inodes, disk blocks)
• Disk caches (the Page Cache) are used to cache disk accesses (not VM pages!!)– Crucial to system performance!– Must also be part of the page replacement algorithm
• Bovet, Ch. 17
45
Dentry Cache
• Unused dentry objects stored in a list– Allows easy LRU replacement
• A hash table (name dentry)– Allows fast lookup
• Dentry states:– In use – used, and contains valid info– Unused – not used, but points to valid i-node– Negative – the i-node does not exist, kept to speed up lookups– Free – contains no valid info (stored in the slab cache)
Can safely be deleted by the page replacement algorithm
46
dentry (dcache.h)struct dentry {
atomic_t d_count;unsigned long d_vfs_flags; /* moved here to be on same cacheline */spinlock_t d_lock; /* per dentry lock */struct inode * d_inode; /* Where the name belongs to - NULL is negative */struct list_head d_lru; /* LRU list */struct list_head d_child; /* child of parent list */struct list_head d_subdirs; /* our children */struct list_head d_alias; /* inode alias list */unsigned long d_time; /* used by d_revalidate */struct dentry_operations *d_op;struct super_block * d_sb; /* The root of the dentry tree */unsigned int d_flags;int d_mounted;void * d_fsdata; /* fs-specific data */
struct rcu_head d_rcu;struct dcookie_struct * d_cookie; /* cookie, if any */unsigned long d_move_count;/* to indicated moved dentry while lockless lookup */struct qstr * d_qstr; /* quick str ptr used in lockless lookup and concurrent d_move */struct dentry * d_parent; /* parent directory */struct qstr d_name;struct hlist_node d_hash; /* lookup hash list */struct hlist_head * d_bucket; /* lookup hash bucket */unsigned char d_iname[DNAME_INLINE_LEN_MIN]; /* small names */
} ____cacheline_aligned;
dentry:• Associates the components
of a pathname to their inodes• Does not exist on disk
47
inode (fs.h)struct inode {
struct hlist_node i_hash;struct list_head i_list;struct list_head i_sb_list;struct list_head i_dentry;unsigned longi_ino;atomic_t i_count;umode_t i_mode;unsigned int i_nlink;uid_t i_uid;gid_t i_gid;dev_t i_rdev;loff_t i_size;struct timespec i_atime;struct timespec i_mtime;struct timespec i_ctime;unsigned int i_blkbits;unsigned long i_blksize;unsigned long i_version;unsigned long i_blocks;unsigned short i_bytes;spinlock_t i_lock;struct semaphore i_sem;struct inode_operations *i_op;struct file_operations *i_fop; struct super_block *i_sb;
struct file_lock *i_flock;struct address_space*i_mapping;struct address_spacei_data;struct dquot *i_dquot[MAXQUOTAS];/* These three should probably be a union */struct list_head i_devices;struct pipe_inode_info *i_pipe;struct block_device *i_bdev;struct cdev *i_cdev;int i_cindex;
unsigned long i_dnotify_mask; struct dnotify_struct *i_dnotify; unsigned long i_state;unsigned int i_flags;unsigned char i_sock;
atomic_t i_writecount;void *i_security;u32 i_generation;union {
void *generic_ip;} u;#ifdef __NEED_I_SIZE_ORDERED
seqcount_t i_size_seqcount;#endif
};
List of operations supported on this file(system) There is also an inode cache (inode.c)
Structure with pointers to the page cache
48
inode_operations (fs.h)struct inode_operations {
int (*create) (struct inode *,struct dentry *,int, struct nameidata *);struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameidata *);int (*link) (struct dentry *,struct inode *,struct dentry *);int (*unlink) (struct inode *,struct dentry *);int (*symlink) (struct inode *,struct dentry *,const char *);int (*mkdir) (struct inode *,struct dentry *,int);int (*rmdir) (struct inode *,struct dentry *);int (*mknod) (struct inode *,struct dentry *,int,dev_t);int (*rename) (struct inode *, struct dentry *,
struct inode *, struct dentry *);int (*readlink) (struct dentry *, char __user *,int);int (*follow_link) (struct dentry *, struct nameidata *);void (*truncate) (struct inode *);int (*permission) (struct inode *, int, struct nameidata *);int (*setattr) (struct dentry *, struct iattr *);int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *);int (*setxattr) (struct dentry *, const char *,const void *,size_t,int);ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);ssize_t (*listxattr) (struct dentry *, char *, size_t);int (*removexattr) (struct dentry *, const char *);
};
49
struct address_space
• Stores pages in the page cache as a radix tree– Remember digital search trees (tries)?
• Allows fast lookup and sorting– Retrieve all dirty blocks
• Read more on:– http://lwn.net/Articles/175432/
– Bovet, Ch. 15
50
super_block (fs.h)struct super_block {
struct list_head s_list; /* Keep this first */dev_t s_dev; /* search index; _not_ kdev_t */unsigned long s_blocksize;unsigned long s_old_blocksize;unsigned char s_blocksize_bits;unsigned char s_dirt;unsigned long long s_maxbytes; /* Max file size */struct file_system_type * s_type;struct super_operations * s_op;struct dquot_operations * dq_op;
struct quotactl_ops * s_qcop;struct export_operations * s_export_op;unsigned long s_flags;unsigned long s_magic;struct dentry * s_root;struct rw_semaphore s_umount;
Used to store filesystem specific information
This reflects VFS’s view of the fs!
51
struct semaphore s_lock;int s_count;int s_syncing;int s_need_sync_fs;atomic_t s_active;void * s_security;struct list_head s_inodes; /* all inodes */struct list_head s_dirty; /* dirty inodes */struct list_head s_io; /* parked for writeback */struct hlist_head s_anon; /* anonymous dentries for (nfs) exporting */struct list_head s_files;struct block_device * s_bdev;struct list_head s_instances;struct quota_info s_dquot; /* Diskquota specific options */char s_id[32]; /* Informational name */struct kobject kobj; /* anchor for sysfs */void * s_fs_info; /* Filesystem private info *//* * The next field is for VFS *only*. No filesystems have any business * even looking at it. You had been warned. */struct semaphore s_vfs_rename_sem; /* Kludge */
};
52
Outline
• Implementing filesystems on disk– Implementing files
• contiguous allocation• linked-list allocation• file allocation table (FAT)• i-node
– Implementing directories– trade-offs and performance
• Look at some of the VFS objects for Linux– no complete listings
• Example of filesystem implementation– Ext2: popular and robust– Ext3: extended with journaling Bovet, Ch. 18
53
Ext2
• Basic features– Native to Linux– Variable block size– “Related” blocks stored in Block Groups– Pre-allocates blocks to allow file growth– Supports fast symlinks
54
Ext2Ext2 partition:
Boot Block
Block Group 0 …
Super Block
Block Group n
Block group descriptors
Data Block Bitmap
inode Bitmap
inode Table
Data Blocks
Copy in every block group
One bit for each block in the group
s/(8b), s=partition size, b = block size (bytes)Contains:• pointer to block bitmap• pointer to inode bitmap• pointer to inode table• free blocks count• free inodes count• directory count• pads
1 n 1 1 n n
55
Disk vs. Memory Structs
• There needs to be a mapping– VFS ↔ disk structures– inode ↔ ext2_inode– superblock ↔ ext2_super_block
• Most structures are stored in page cache
• Some operations are generic VFS and some ext2-specific
56
Ext2Disk data structure for an inode (fixed 128 bytes size)
struct ext2_inode {__u16 i_mode; /* File mode */__u16 i_uid; /* Low 16 bits of Owner Uid */__u32 i_size; /* Size in bytes */__u32 i_atime; /* Access time */__u32 i_ctime; /* Creation time */__u32 i_mtime; /* Modification time */__u32 i_dtime; /* Deletion Time */__u16 i_gid; /* Low 16 bits of Group Id */__u16 i_links_count; /* Links count */__u32 i_blocks; /* Blocks count */__u32 i_flags; /* File flags */union osd1; /* OS dependent 1 */__u32 i_block[EXT2_N_BLOCKS]; /* Pointers to blocks */__u32 i_generation; /* File version (for NFS) */__u32 i_file_acl; /* File ACL */__u32 i_dir_acl; /* Directory ACL */__u32 i_faddr; /* Fragment address */union osd2; /* OS dependent 2 */
};
Pointer to extended attributes
Effective length of file
#blocks allocated to file
Pointer to the blocks
57
File Size
• i_size and i_blocks do not always match– internal fragmentation in blocks
• i_size < i_blocks*512
– File “holes”• i_size > i_blocks*512
echo -n "S" | dd of=hole bs=1024 seek=6
• Creates a file (hole) with zeroes and an ‘S’• Only one block is allocated
58
Ext2
• Ext2 supports the following file types– Unknown– Regular– Directory
• Stores names and inode numbers in data blocks
– Character and block devices, named pipes and sockets• Use no data blocks
– Symbolic links• Stores filenames < 60 characters in inode, else in data block• Uses the i_block[EXT2_N_BLOCKS] field
59
Directory Blocks
1 2 . . \0 \0
5 2 h o m e
3 2 u s r \0
4 1 f i l e
1 \0 \0 \0
21
22
53
47
12
16
28
12
1 2 . \0 \0 \021 12
inode rec_len
name_len file_type
name
Example from Bovet, 2005. p. 749
60
Ext2 – How to Find a Block
Finding the block number within a file is simple:
f div bwhere b is the block size, f is the position in the file
The fth character is in the f mod b position in the block
61
Ext2 – How to Find a Block
• Blocks on disk and blocks in files are not the same
4
3
2
1
04
3
2
1
0
9
8
7
6
5
File block numbers
Logical block numbers
62
Ext2 – How to Find a Block
• File block disk block mapping is stored (partly) in the inode
• Remember the
__u32 i_block[EXT2_N_BLOCKS]; /* Pointers to blocks */
• Usually fifteen 32-bits words used as indexes
63
Ext2 – How to Find a Block
10 32 54 76 98 1110 1312 14
0 6
…
12(b/4)+12(b/4)2 +
2*(b/4) + 11
…
…
…
…
…
(b/4) + 11(b/4)2 +
(b/4) + 12
Logical block number is 4 bytes
64
Holes again
echo -n "S" | dd of=hole bs=1024 seek=6
For block size = 4096 bytes:
10 32 54 76 98 1110 1312 14
\0\0 \0\0 \0 \0\0 \0\0 \0 \0\0 S… …
4096
6145
\0 \0\0 S…
65
Disk Space Management
• Want to avoid fragmentation– Block groups
• Management should be fast– SW Caches
• Where to create a new inode?– For regular files: in the block group of its parent
directory– Spread “root” directories in different block groups– Other nested directories in the same BG if not too full
66
Disk Space Management
• Where to allocate new blocks to a file– Near the already allocated blocks– In the same block group– Other block groups
• Pre-allocation– 8 blocks are (pre-)allocated– Released when a file is closed.
67
Disk Space Management
Data Block Allocation algorithm
1. Try to use an already (pre-)allocated block
2. Find a new block in the same block group
3. Consider all block groups1. Find a free “byte” (8 blocks)
2. Find a free “bit” (1 block)
68
Disk Space Management
69
Consistency
• What if page/block caches are not written to disk before a crash?
• Inconsistency = data was partially written
• How do we know if data is inconsistent?
70
Consistency
• File system states(a) consistent(b) missing block(c) duplicate block in free list(d) duplicate data block
Count blocks in:• inodes• Free list
71
Consistency Semantics
• Consistency semantics specify how multiple users are to access a shared file simultaneously– Unix file system (UFS) implements:
• Writes to an open file visible immediately to other users of the same open file
• Sharing file pointer to allow multiple users to read and write concurrently
– AFS has session semantics• Writes only visible to sessions starting after the file is closed
72
Ext3
• Ext3 is pretty much Ext2 + Journaling
• Ext2 works fine as long as the filesystem is cleanly unmounted (remember page cache)
• Consistency checks are expensive!
• Especially for large filesystems (think servers)!
73
Journaling
Idea: keep the most recent updates on diskEach filesystem change:1. Write “change” (log) to disk (journal)
• Called “commit log”2. After 1., write to disk, throw logConsistency check (boot time)• Crash before commit
– Discard changes• Crash after commit
– Redo changes
74
Ext3
Three alternatives• Journaling (slow)
– Log both data and metadata blocks
• Ordering (faster)– Log only metadata, but data is written before
metadata (default mode)
• Writeback (fastest)– Only metadata is logged
75
End of filesystems… but please read the chapters in the suggested book(s)
Recommended