Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
FILE SYSTEMS
2016 Operating Systems DesignEuiseong Seo ([email protected])
File System Variations
¨ FAT (file allocation table) variants¤ FAT12, FAT16, FAT32¤ VFAT¤ exFAT
¨ ext variants¤ ext2¤ ext3¤ ext4
¨ NTFS¨ UFS, HFS and so on…
ext2: Disk Data Structure
¨ Block groups¤ Used for ease of management¤ Each group contains several blocks¤ The number of blocks are determined by size of
partition and block size
Advanced Operating Systems
Disk data structure� Block groups� Are used rather than CG in FFS to ease management� Each group contains several blocks� Each block in the file system can be allocated of free� The number of blocks are determined by size of partition and block
size
19
BootBlock Block Group 0 … Block Group n
1 block n block 1 block 1 block n block n block
SuperBlock
GroupDescriptors
Data blockbitmap
Inodebitmap
InodeTable Data block
ext2: Boot Block
¨ First 1024 bytes of a disk¨ Reserved for partition boot sectors¨ Unused by ext2 FS
Advanced Operating Systems
Boot Block� Main features� First 1,024 bytes of the disk� Reserved for the partition boot sectors� Unused by the Ext2 FS
20
BootBlock Block Group 0 … Block Group n
1 block n block 1 block 1 block n block n block
SuperBlock
GroupDescriptors
Data blockbitmap
Inodebitmap
InodeTable Data block
☜
ext2: Super Block
¨ 1,024 bytes from the start of the file system ¨ Is copied on each block group boundary for backup
¤ If different, it indicates file system corruption
¨ Information type ¤ Parameters which are determined when a specific file
system was created¤ Parameters which are tunable¤ Current file system state
Advanced Operating Systems
Superblock� Main features� Defined in struct ext2_super_block of “ext2_fs.h”� Fixed offset 1,024 bytes from the start of the file system� Is copied on each block group boundary for backup¾ If different, it indicates file system corruption
� Information type¾ Parameters which are determined when a specific file system was
created – cannot e changed once the file system was created¾ Parameters which are tunable – can always be changed¾ Current file system state
21
1 block n block 1 block 1 block n block n block
SuperBlock
GroupDescriptors
Data blockbitmap
Inodebitmap
InodeTable Data block
☜
ext2: Super Block Parameters
Advanced Operating Systems
Superblock� Main paramters
22
Type Filed Description
__u32 s_inodes_count Total number of inodes__u32 s_blocks_count File system size__u32 s_r_blocks_count Number of reserved blocks__u32 s_free_blocks_count Free blocks counter__u32 s_free_inodes_count Free inodes counter__u32 s_first_data_block Number of first useful block (always 1)__u32 s_log_block_size Block size__s32 s_log_frag_size Fragment size__u32 s_blocks_per_group Number of blocks per group__u32 s_frags_per_group Number of fragments per group__u32 s_inodes_per_group Number of inodes per group__u32 s_mtime Time of last mount operation__u32 s_wtime Time of last write operation__u16 s_mnt_count Mount operations counter__s16 s_max_mnt_count Maximal mount count
ext2: Super Block Parameters
¨ Parameters related to the error handling (example)¤ Correction task is left to an external utility, such as
e2fsck ¤ s_state
n If value of bit 0 is zero on an unmounted file system means that the file system was not unmounted correctly
Advanced Operating Systems
Superblock� Parameters related to the error handling� Correction task is left to an external utility, such as e2fsck� s_state¾ If value of bit 0 is zero on an unmounted file system means that the file
system was not unmounted correctly
24
Value Description
Bit 00 When the partition is mounted
1 When the partition is unmounted
Bit 10 Kernel didn’t find any error
(It does not mean that there is no error)
1 When an error is detected by the kernel
ext2: Group Descriptors
¨ Main features¤ Summarizes necessary information about the specific
group block
Advanced Operating Systems
� Main features� Summarizes necessary information about the specific group block
� Main fields
28
1 block n block 1 block 1 block n block n block
SuperBlock
GroupDescriptors
Data blockbitmap
Inodebitmap
InodeTable Data block
☜Type Filed Description
__u32 bg_block_bitmap Block number of block bitmap__u32 bg_inode_bitmap Block number of inode bitmap__u32 bg_inode_table Block number of first inode table block__u16 bg_free_blocks_count Number of free blocks in the group__u16 bg_free_inodes_count Number of free inodes in the group__u16 bg_used_dirs_count Number of free directories in the group
Advanced Operating Systems
� Main features� Summarizes necessary information about the specific group block
� Main fields
28
1 block n block 1 block 1 block n block n block
SuperBlock
GroupDescriptors
Data blockbitmap
Inodebitmap
InodeTable Data block
☜Type Filed Description
__u32 bg_block_bitmap Block number of block bitmap__u32 bg_inode_bitmap Block number of inode bitmap__u32 bg_inode_table Block number of first inode table block__u16 bg_free_blocks_count Number of free blocks in the group__u16 bg_free_inodes_count Number of free inodes in the group__u16 bg_used_dirs_count Number of free directories in the group
ext2: Bitmaps
¨ Each bit in the block/inode bitmap indicates whether a specific block in the group is used or free
Advanced Operating Systems
� Main features� Each bit in the block/inode bitmap indicates whether a specific
block in the group is used or free
31
1 block n block 1 block 1 block n block n block
SuperBlock
GroupDescriptors
Data blockbitmap
Inodebitmap
InodeTable Data block
☜ ☜
ext2: Bitmaps
¨ Example format ¤ Block/inode bitmap can represent as a block size ¤ If the bock size is 1,024 bytes:
n There is a place for 1,024 × 8 = 8,192 blocks in a block group
¨ Values¤ 0: corresponding data block/inode is free¤ 1: corresponding data block/inode is used
Advanced Operating Systems
� Example format� Block/inode bitmap can represent as a block size� If the bock size is 1,024 bytes:¾ there is a place for 1,024 × 8 = 8,192 blocks in a block group
� Values� 0: corresponding data block/inode is free� 1: corresponding data block/inode is used
32
Block size (bytes) Block bitmap (block)
1,024 8,192
2,048 16,384
4,096 32,768
ext2: I-Node Table
¨ Each file/directory is allocated one inode¨ If all the inodes are used, any new files cannot be
created ¨ Each inode takes up 128 bytes (Ext4: 256 bytes)
¤ 1,024 bytes block contains 8 inodes¤ 4,096 bytes block contains 32 inodes¤ Total number of inodes in a block group is stored in the
superblock variable s_inodes_per_group¨ Goal is to place inodes and their related files in the
same block group
Advanced Operating Systems
Inode table� Main features� Each file/directory is allocated one inode� If all the inodes are used, any new files cannot be created � Each inode takes up 128 bytes (Ext4: 256 bytes)¾ 1,024 bytes block contains 8 inodes¾ 4,096 bytes block contains 32 inodes¾ Total number of inodes in a block group is stored in the superblock
variable s_inodes_per_group� Goal is to place inodes and their related files in the same block
group
34
1 block n block 1 block 1 block n block n block
SuperBlock
GroupDescriptors
Data blockbitmap
Inodebitmap
InodeTable Data block
☜
ext2: I-Node Table
Advanced Operating Systems
Inode table� Main fields
35
Type Filed Description
__u16 i_mode File type and access rights__u16 i_uid Owner identifier__u32 i_size File length in bytes__u32 i_atime Time of last file access__u32 i_ctime Time that inode last changed__u32 i_mtime Time that file contents last changed__u32 i_dtime Time of file deletion__u16 i_gid Group identifier__u16 i_link_count Hard links counter__u32 i_blocks Number of data blocks of the file__u32 i_flags File flagsunion osd1 Specific operating system information
__u32[EXT2_N_BLOCKS] i_block Pointers to data blocks
ext2: I-Node Table
Advanced Operating Systems
Inode table� Main fields� i_mode: determines the inode type and access permission
� File type
36
File type Executionoverride
Owner permission
Group permission
Otherpermission
015 11 25812 9 6 3
Identifier Value Description
S_IFSOCK A000 Socket
S_IFLNK C000 Symbolic link
S_IFREG 8000 Regular file
S_IFBLK 6000 Block device
S_IFDIR 4000 Directory
S_IFCHR 2000 Character device
S_IFIFO 1000 FIFO
ext2: I-Node Table
Advanced Operating Systems
Inode table� Main fields� i_mode: determines the inode type and access permission
� Execution override
37
File type Executionoverride
Owner permission
Group permission
Otherpermission
015 11 25812 9 6 3
Identifier Value Comment Description
S_ISUID 0800 Set UID Run the file with the owner permissions
S_ISGID 0400 Set GID Run the file with the group permissions
S_ISVTX 0200 Sticky bitRegular file: It should not be deleted in the swap areaDirectory: If it isn’t owner of the file, this directory should not be deleted
ext2: I-Node Table
Advanced Operating Systems
Inode table� Main fields� i_mode: determines the inode type and access permission
� Owner/Group/Other permission
38
File type Executionoverride
Owner permission
Group permission
Otherpermission
015 11 25812 9 6 3
Identifier Value Description
Read 0b100 Set read permission
Write 0b010 Set write permission
Execute 0b001 Set execute permission
ext2: I-Node Table
¨ Hard links ¤ Single inode may be pointed to from several directories
n In this case, there exist hard links to the file
¤ i_link_count keeps the number of hard linkn Is incremented with each additional linkn Is decremented when a file is deleted
¨ Time and data
Advanced Operating Systems
Inode table� Hard links� Single inode may be pointed to from several directories¾ In this case, there exist hard links to the file
� i_link_count keeps the number of hard link¾ Is incremented with each additional link¾ Is decremented when a file is deleted
– Only when this number reaches zero, the inode will be deallocated� Time and data
39
Fields Description
i_ctime Time in which the inode was last allocated
i_mtime Time in which the file was last modified
i_atime Time in which the file was last accessed
i_dtime Time in which the inode was deallocated
ext2: I-Node Table
Advanced Operating Systems 41
i_block[0]
i_block[1]
i_block[2]
i_block[3]
i_block[4]
i_block[5]
i_block[6]
i_block[7]
i_block[8]
i_block[9]
i_block[10]
i_block[11]
i_block[12]
i_block[13]
i_block[14]
Direct
IndirectDouble indir.Triple indir.
1
i_block
5
Direct block
…
12
𝑏4
2
+ 2𝑏4
+ 11
𝑏4
2
+𝑏4
+ 12
𝑏4+ 12
Direct block
…
Indirect block
…
Indirect block
…
Direct block
…
Indirect block
…
ext2: I-Node Table
¨ Maximum file size
Advanced Operating Systems
� Maximum file size
43
Block size Direct block Indirect block Double indirect block
Triple indirect block
1,024 12KB 268KB 64.26MB 16.062GB
2,048 24KB 1.02MB 513.02MB 256.5GB
4,096 48KB 4.04MB 4GB ~4TB
ext2: Disk Space Management
¨ Disk space management required for allocation and deallocation of the data blocks and inodes
¨ If possible, it should avoid the file fragmentation¤ Fragmentation Increases the average time of read
operation¤ Because position of the disk head is to be changed
frequently¤ Thus, space management must be operated as soon as
possible
ext2: Creation of I-Node
¨ ext2_new_inode() ¤ Creates an disk inode and return address of inode object
n If it fails, it returns NULL
¤ Selects block group so that new file is placed in the same group as the parent directory n Directories that is not associated with the inodes is distributed
between the groups
¤ Parameters n dir: address of directory corresponds to the new inode is inserted n mode: type of inode to create
ext2: Removal of I-Node
¨ ext2_free_inode() ¤ Removes the disk inode that is pointed from inode
object ¤ Before call the function
n Kernel performs the operation to clean up the internal data structure and file data
n Then, removes the inode object in inode hash table
¤ The function must be called after making length of file to zero to remove all data block
ext2: File Holes
¨ Used to prevent the waste of disk space¤ Such as database application
¨ Contains only NULL character as part of a regular file¨ Command to create a file with hole in the front
¤ 6,145 characters are in /tmp/holen NULL character: 6,144 and X character: 1n The file takes only one data block
¨ Based on dynamic data block assignment¤ Block is assigned to the file only when process writes a data
Advanced Operating Systems
� Used to prevent the waste of disk space� Such as database application
� Contains only NULL character as part of a regular file� Command to create a file with hole in the front
� 6,145 characters are in /tmp/hole¾ NULL character: 6,144 and X character: 1¾ The file takes only one data block
� Based on dynamic data block assignment� Block is assigned to the file only when process writes a data
53
$ echo –n “x” | dd of=/tmp/hole bs=1024 seek=6
ext2: Data Block (De)Assignment
¨ ext2_get_block()¤ Look for a block that has a regular file data¤ If there is no corresponding block, it is automatically assigned¤ Called each time read/write is requested on a regular file
¨ ext2_alloc_block()¤ Look for free block in Ext2 partition¤ If necessary, assign to the blocks used for indirect reference
¨ ext2_truncate() ¤ When process delete the file or make length to zero, all data
blocks of this file are returned ¤ Takes the address of the file inode object as a parameter
¨ ext2_free_blocks()¤ Frees one or more adjacent data blocks of data group
ext3: Overview
¨ ext3, or third extended filesystem, is a journaled file system that is commonly used by the Linux kernel
¨ Since 2001, it was merged with Linux kernel (2.4.15)¨ Its main advantage over ext2 is journaling ¨ ext3 adds the following features to ext2
¤ A journal¤ Online file system growth¤ Hash-Tree indexing for larger directories
¨ You can convert an ext2 file system to an ext3 file system directly¤ Without backup/restore
ext3: Journaling - Motivation
¨ Single I/O request may involve many disk writesJournaling: Motivation• Single I/O request may involve
many disk write
2015-11-04 Computer System Lab.
Disk
PageCache
Application
File System
Disk BA C
A
Metadata (ex: inode)
Journaling: Motivation• Single I/O request may involve
many disk write
2015-11-04 Computer System Lab.
Disk
PageCache
Application
File System
Disk BA C
A
File contents (Data blocks)
ext3: Journaling - Motivation
¨ Disk writes are cachedJournaling: Motivation• Single I/O request may involve
many disk write• Example: write()
• Update metadata of the file (timestamp, length…)
• Write the contents to the data blocks
2015-11-04 Computer System Lab.
Disk
PageCache
Application
BA C
Write A
A
File System
ext3: Journaling - Motivation
¨ What if a system failure occurs during this?Journaling: Motivation• Single I/O request may involve
many disk write• Example: write()
• Update metadata of the file (timestamp, length…)
• Write the contents to the data blocks
• What if a system failure occurs during the operation?
2015-11-04 Computer System Lab.
Disk
PageCache
Application
BA C
Write A
File System
A
ext3: Journaling - Motivation
¨ How can we successfully recover from these failures?Journaling: Motivation• Single I/O request may involve
many disk write• Example: write()
• Update metadata of the file (timestamp, length…)
• Write the contents to the data blocks
• What if a system failure occurs during the operation?• Cannot guarantee data consistency
¾ How can we successfully recover from these failures?
2015-11-04 Computer System Lab.
Disk
PageCache
Application
BA C
Write A
File System
ext3: Journaling - Concept
¨ Uses the concept of transaction ¤ A series of operations either all
occur, or nothing occurs
¨ Keeps track of changes not yet committed to the file system ¤ All the writes are recorded as a
journal and it is stored into a journal area
Journaling: Idea• Journaling file system
• Uses the concept of transaction• A series of operations either all occur, or
nothing occurs
• Keeps track of changes not yet committed to the file system• All the writes are recorded as a “journal”
and it is stored into a “journal area”
2015-11-04 Computer System Lab.
Disk
Page Cache
Application
B
File System
Journal Area
A
A
ext3: Journaling - ConceptJournaling: Idea• Journaling file system
• Uses the concept of transaction• A series of operations either all occur, or
nothing occurs
• Keeps track of changes not yet committed to the file system• All the writes are recorded as a “journal”
and it is stored into a “journal area”
2015-11-04 Computer System Lab.
Disk
Page Cache
Application
File System
Commit
A
BA
ext3: Journaling - ConceptJournaling: Idea• Journaling file system
• Uses the concept of transaction• A series of operations either all occur, or
nothing occurs
• Keeps track of changes not yet committed to the file system• All the writes are recorded as a “journal”
and it is stored into a “journal area”
2015-11-04 Computer System Lab.
Disk
Page Cache
Application
File System
AS E
Commit
A
BA
ext3: Journaling - Concept
¨ The file system periodically checkpoints committed journals to the original copies Journaling: Idea
• Journaling file system• Uses the concept of transaction
• A series of operations either all occur, or nothing occurs
• Keeps track of changes not yet committed to the file system• All the writes are recorded as a “journal”
and it is stored into a “journal area”• The file system periodically checkpoints
committed journals to the original copies
2015-11-04 Computer System Lab.
Disk
Page Cache
Application
File System
AS E
A
Checkpoint
BA
ext3: Journaling - Concept
¨ Such file system is easy to recover from a system failure
Journaling: Idea• Journaling file system (cont’d)
• Such file systems can be easily recovered from the failures• System failure during commit
2015-11-04 Computer System Lab.
Disk
Page Cache
Application
File System
BA S
Journaling: Idea• Journaling file system (cont’d)
• Such file systems can be easily recovered from the failures• System failure during commit• System failure during checkpoint
2015-11-04 Computer System Lab.
Disk
Page Cache
Application
File System
AS E
A
Checkpoint
BA
Case 1 Case 2
ext3: Journaling Implementation
¨ Backward and forward compatibility with ext2
¨ Existing ext2 partitions can be mounted as ext3
¨ JBD (Journal Block Device) ¤ A generic block device journaling
layer ¤ Journaling in ext3 is done at
block level, not structure level ¤ For the file system independency
Journaling: Ext3 Implementation• Remind: design goal of Ext3 FS
• Backward and forward compatibility with ext2• Existing ext2 partitions can be mounted as ext3
• JBD(Journal Block Device)• A generic block device journaling layer• Journaling in ext3 is done at block level, not
structure level• For the file system independency
2015-11-04 Computer System Lab.
File System
JBD
Block Device Joural Area
Transactionstart/update/commit
ext3: Journaling Implementation
ext3: Journal Modes
¨ Ext3 support three journaling modes: ¨ Journal
¤ Metadata + file contents are journaled ¨ Writeback
¤ Only metadata is journaled ¨ Ordered
¤ Only metadata is journaled, but it's guaranteed that file contents are written to disk before associated metadata is marked as committed in the journal
¤ Default option
ext4: Overview
¨ The new ext4 filesystem: current status and future plans¤ 2007 Linux Symposium, Ottawa, Canada July 27th - 30th
¨ Motivation¤ 16TB File System Size Limitation ¤ 32-bit Block Numbers : 4KB × 232 = 16TB¤ 32,768 Sub-Directory Limitation¤ Performance Limitation
¨ New Features¤ 48-bit Block Numbers: 4KB × 248 = 1¤ Replacing indirect blocks with extents¤ Optimized block allocation¤ Performance optimization
ext4: Changes
Changes of EXT File SystemExt2 Ext3 Ext4
Added to Kernel 1993 2001 (2.4.15) 2006(2.6.19)2008(2.6.28) ‐ Stable
Max File Size 16GB ~ 2TB 16GB ~ 2TB 16GB ~ 16TB
MAX File System Size 2TB ~ 32TB 2TB ~ 32TB 1EB(Exabyte = 1024TB)
Feature Block Group JournalingExtended MappingMultiblock AllocationDelayed Allocation
Block Size Max File Size Max File System Size
1 KB 16 GB 2 TB
2 KB 256 GB 8 TB
4 KB 2 TB 16 TB
8 KB 2 TB 32 TB
¾4 × 2 4 = 16¾2 × 2 4 = 8
ext4: Extended Mapping
¨ EXT2/3¤ Indirect block maps are incredibly inefficient for large files
n Double, triple indirect block mappingn One extra block read (and seek) every 1024 blocks
¨ EXT4¤ An extent is a single descriptor for a range of contiguous
blocks n Scalability Enhanced - An efficient way to represent large file
¤ Better CPU utilization, fewer metadata I/Os¤ Extents block mapping
ext4: Extended Mapping
¨ On-disk extents format ¤ Extent : represent a range of contiguous physical blocks ¤ 12 bytes ext4_extent structure
n Address 1 EB file system (48 bit physical block number)n Max extent 128 MB (16 bit extent length)n Address 16 TB file size (32 bit logical block number)