An Introduction to Disk-Based Linux File Systems
Avishay Traeger
IBM Haifa Research Lab Internal Storage Course
―October 2012
v1.3
Outline
The Basics The Virtual File System (VFS) File System Layout Journaling
3
What does a disk-basedfile system do?
Provides structure to the array of bits residing on the disk
File and directory naming and hierarchy File access – open, read, write, seek, close, ... Knows how to map <file,offset> to <sector,offset> Tracks which sectors are used and which are “free” Access control
Extra features (e.g., improved reliability, snapshots, compression, encryption)
4
Linux File System Types
Disk (ext2/3/4, xfs, btrfs, ntfs, vfat, etc.) Network (nfs, cifs, afs, ceph, etc.) Memory (ramfs, tmpfs, etc.) Pseudo (proc, sysfs, etc.) Stackable (ecryptfs, etc.) Object store (exofs) FUSE (Filesystem in USErspace): allows
developers to implement file systems in userspace (easier to develop, slower to run)
... Approximately 60 file systems currently in the
Linux kernel!
5
Important Metadata Structures
Superblock: on-disk metadata for entire file system Block size, pointer to fs root directory, ...
Inode: on-disk metadata for a single file Inode number (unique ID), owners, timestamps, size,
data block pointers, ... Dentry: metadata for a directory entry, a single
component of a path (not synced to disk) File: open file structure (not synced to disk)
File → Dentry → Inode → Superblock Each structure has associated operations that are
implemented by each file systemNote: All Linux file system implementations have the above structures in memory, but not all have superblocks and inodes on disk (especially file systems not native to Linux/Unix, like FAT). These must map on-disk structures to those in memory.
6
Directories
A directory is simply a special type of file Can contain other files, directories, links, etc. Each entry has an inode number and name The file system knows how to find a file based
on its inode number What are the basic steps for performing a
lookup on file /foo/bar?
7
Hard Links
Associate several names with one inode When creating a link, increment the inode's
reference count (refcount) The inode and associated data will only be
deleted when the refcount is zero Can only be used within a single
file-system Can only point to files. This
prevents cycles in the directory tree
Not supported by all file systems
dentry dentry
inode
datablocks
8
Symbolic Links (Symlinks)
A special file that contains a file name When the kernel encounters a symlink during a
pathname lookup it replaces the name of the link by its contents (the name of the target file), and restarts the pathname interpretation
Can point to files on another file system Can point to any type of file (e.g., directory) Can become a dangling pointer if the target file
is deleted Use more inodes than hard links (2 vs. 1) Higher overhead than hard links for resolution
Device Files
In Linux, devices can be accessed via special files, generally found under /dev
Two main types: Character: stream of bytes (keyboard, serial) Block: random access of blocks (hard disk, CD-ROM)
Outline
The Basics The Virtual File System (VFS) File System Layout Journaling
11
The Virtual File System (VFS)
When we have so many file systems, we need to ensure that:
User programs do not need to be file-system--aware File systems don't re-implement similar functionality
Solution: The VFS. A kernel layer that: Handles all system calls related to a standard Unix
file system (all file systems have the same API) Handles generic activities (e.g., caching, readahead) Has generic file system “library” functions that can be
used by any file system (e.g., fs/libfs.c) Each specific file system implements a set of
functions (operations vectors) Object oriented programming in C
12
The Virtual File System (VFS)
ext3
Application
isofs NFS
System callhandler
Scheduler
Memorymanagement
Interrupthandler Driver
(Disk)Driver
(CD-ROM)Driver
(Network)
user-space
kernel
VFS
Page Cache
13
Readahead
Takes advantage of the page cache When a page is read, the VFS code may ask
the file system to read the next several contiguous blocks.
Hopefully, the next block read by the application will already be loaded into the page cache.
Performed during: Sequential reads on files Directory reads
The VFS contains the logic to perform readahead effectively
14
Example File System Operations
ext3
/
mnthome etc …
15
Example File System Operations
xfs
ext3
/
mnthome etc …
avishay
mount t xfs /dev/sdb1 /home
The VFS mount operation:1) Calls the xfs get_sb function to read the superblock from the partition2) This function also reads the inode of the root directory
Note that performing a lookup on 'home' would have previously invoked ext3, but now it is xfs. Any files/directories in 'home' on ext3 will now be hidden by 'home' on xfs.
16
Example File System Operations
isofsxfs
ext3
/
cdrom
foo
mnthome etc …
avishay
mount t xfs /dev/sdb1 /home
mount t isofs /dev/hdc1 /mnt/cdrom
A similar sequence of events occurs here, this time mounting an isofs file system on a CD-ROM drive.
17
Example File System Operations
isofsxfs
ext3
/
cdrom
bar foo
mnthome etc …
avishay
mount t xfs /dev/sdb1 /home
mount t isofs /dev/hdc1 /mnt/cdrom
cp /mnt/cdrom/foo /home/avishay/bar
Lookup operations will be performed on all 3 file systems. The copy operation will read from 'foo' (isofs) and write to 'bar' (xfs). The VFS determines which file system to invoke.
Outline
The Basics The Virtual File System (VFS) File System Layout Journaling
19
File System Layout
Some considerations: Minimize seeks between metadata and related data Minimize number of disk reads required to get to data Maximize readahead (sequential access) Recovery from disk corruption, power outage, etc. Management: fragmentation, compaction, etc.
20
Contiguous Allocation
Files are allocated contiguously on the disk Space for entire file must be requested in advance Search bit map or linked list to locate a space
Pros Fast sequential access Easy random access
Cons External fragmentation Hard to grow files: may have to move (large) files May need compaction
B CA D
E
21
Linked Files (Alto)
Each file is a linked list File header (like inode) points to first block on disk Each block points to the next
Pros Can grow files dynamically Free list is similar to a file No external fragmentation
or need to move files Cons Random access is horrible Even sequential access needs one seek per block Unreliable: losing one block means losing the rest
File block 1
File header
File block N
22
File Allocation Table (FAT)
Table of “next pointers”, indexed by block Dentry points to 1st block of file Two copies of FAT, at the beginning of the volume
Pros Faster random access Cache FAT table and
traverse in memory Cons FAT table may be too large to cache - long seeks Pointers for all files are interspersed in FAT table
Need full table in memory, even for one file Solution: indexed files
23
Single-Level Indexed Files
User declares maximum file size A file header holds an array of pointers to disk
blocks Pros Random access is fast Better metadata caching than FAT
Cons Clumsy to grow beyond the limit Many seeks
Fileheader
Diskblocks
24
ext2: Block Groups
BootBlock
Blockgroup 0
Blockgroup 1 ... Block
group n
Superblock
GroupDescriptors
Data BlockBitmap
inodeBitmap
inodeTable
DataBlocks
Improved reliability Control structures are replicated Easy to recover the superblock
Improved performance Reduces the distance between the inodes and
related data blocks It is possible to reduce the disk head seeks during
I/O on files
25
ext2: Multi-Level Indexed Files
The inode contains 15 pointers: 12 direct pointers 13: 1-level indirect 14: 2-level indirect 15: 3-level indirect
Pros & Cons In favor of small files Can grow Lots of seeking
(somewhat limited byblock groups)
ext3: same on-disk formatplus journal (covered later)
1
inode
data
data2
...
131415
data
data
data
26
ext4/xfs/Btrfs: Extents & Trees
Extent: set of logically contiguous blocks within a file that are stored contiguously on disk
Single ext4 extent: up to 128MB with 4KB block size Less meta-data: Only need to remember:
<1st logical block, # blocks, 1st physical block> xfs and Btrfs store extents in B-tree variants
These are newer and very interesting Linux disk-based file systems and have become more “standard”
27
Log-structured File System
Will be covered separately tomorrow
Outline
The Basics The Virtual File System (VFS) File System Layout Journaling
File System Corruption
Some FS operations require multiple writes which may not all complete (power fail, crash)
The on-disk state will be invalid on next mount Example: To write to a file, 3 main operations:
1.Write data to disk block2.Update the free space map3.Update pointer from inode to block
With no help, detecting and recovering from errors require examining all data structures
In Linux, this is done by fsck (file system check) This was acceptable in the past, but takes too
long for larger file systems
Journaling
Journal: a special file that logs the changes destined for the file system in a circular buffer
Idea: use a journal to log changes before they're committed to the file system to avoid metadata corruption
Examples: JFS/JFS2, ext3/4, XFS, ReiserFS
ext3 Journaling Modes
Writeback: Only metadata is journaled. Data is written indepentently. Preserves file system structure and avoids corruption, but files may contain stale data (like ext2 + fast fsck).
Ordered (default): Data written to disk before metadata transactions commit → no stale data blocks.
Journal: Journals all data and metadata, so data is written twice (same consistency guarantees as 'ordered', different performance).
32
References & Further Reading
References in this presentation refer to Linux 2.6.35 http://lxr.linux.no/#linux+v2.6.35/
Further reading Linux Kernel Development (Love): Good for overview – 3rd edition recently published
2nd edition: http://linuxkernel2.atw.hu/ (hopefully posted with the author's permission...) Understanding the Linux Kernel (Bovet & Cesati): Good for reference btrfs: http://lwn.net/Articles/342892/
Some of content in these slides taken from: http://www.cs.princeton.edu/courses/archive/fall09/cos318/lectures/FileLayout.pdf http://www.ntfs.com/fat-allocation.htm http://www.ibm.com/developerworks/library/l-journaling-filesystems/index.html Tel-Aviv University advanced storage course slides by Ronen Kat and Ohad Rodeh Various wikipedia articles http://static.usenix.org/event/usenix05/tech/general/full_papers/prabhakaran/prabhakaran_ht
ml/main.html