Ext34 Disk Layout

Embed Size (px)

Citation preview

  • 8/2/2019 Ext34 Disk Layout

    1/16

    1

    Kalpak Shah,Lustre Group

    Ext3/4 on-disk layout

    1

    GEEP - geeksofpune.org

  • 8/2/2019 Ext34 Disk Layout

    2/16

    2

    AGENGA

    Layout of EXT3/4Essential on-disk data structuresNew features in ext4

    Extents, uninit_bg, nanosecond timestamps,48-bit support, preallocation, mballoc,

    flex_bg, journal checksumsIts effects on on-disk layout

    Crash recoveryLatest filesystem design layouts

  • 8/2/2019 Ext34 Disk Layout

    3/16

    3

    Basic layout of EXT2/3/4 partition

    All block groups are of same size and stored sequentially.Superblock and group descriptors are duplicated in

    multiple block groups as per SPARSE_SUPER feature.Block sizes starting from 512 bytes upto 8KB are

  • 8/2/2019 Ext34 Disk Layout

    4/16

    4

    Creating an ext3 fsmkfs.ext3 -b 4096 -I 512 -i 8192 -J size=256

    /dev/sda1Blocksize considerationNumber of inodes and inode sizes

    Journal size

    For example, consider an 8GB ext3 fs with a 4KBblocksize. In this case, each 4KB block bitmap

    describes 32K data blocks that is, 128MB. Therefore64 block groups will be present in this fs.

  • 8/2/2019 Ext34 Disk Layout

    5/16

    5

    EXT3 superblock

    The ext3 superblock is stored in an ext3_super_block structure. Some important fields are listed here:

    s_inodes_count, s_blocks_count,s_free_blocks_count, s_free_inodes_count,

    s_inode_sizeblocks_per_group, inodes_per_groups_mnt_count, s_max_mnt_count s_feature_{compat, incompat, rocompat}s_uuid, s_volume_names_journal_inum, s_journal_dev s_state, s_errors

  • 8/2/2019 Ext34 Disk Layout

    6/16

    6

    Group Descriptors

    Each block group has its own group descriptor,represented by ext3_group_desc structure, which hasthese fields:

    bg_block_bitmap

    bg_inode_bitmapbg_inode_tablebg_free_{blocks,inode}_count bg_used_dirs_count

    Most field are useful for inode/block allocator

  • 8/2/2019 Ext34 Disk Layout

    7/167

    EXT3/4 inodeThe on-disk ext3/4 inode structure has these fields:

    i_mode, i_uid, i_gid i_size, i_blocksi_atime, i_mtime, i_dtime, i_ctime

    i_links_count i_block[EXT2_N_BLOCKS(15)]i_version (for NFS)i_file_acl i_dir_acl (i_size_high)

    New in ext4:i_extra_isizei_size_hi, i_size_high, l_i_file_acl_highi_{ctime,mtime,atime,crtime}_extra

    i_version_hi

  • 8/2/2019 Ext34 Disk Layout

    8/168

    Directory layoutEXT3/4 implements directories using a special kind of

    file whose data blocks store filenames along withcorresponding inode numbers. Such data blocks basicallycontain structures of type ext3_dir_entry_2. Thisstructure contains the following fields:

    Inode numberDirectory entry lengthName lengthFiletypeName

    Directories entries are stored using a 2-level hashing forfast retrieval.

  • 8/2/2019 Ext34 Disk Layout

    9/169

    EXT4 features - EXTENTS Replaces traditional indirect block mapping scheme

    which causes high metadata overhead and poorperformance with large files.

    An extent is a single descriptor that represents a rangeof contiguous blocks:

    struct ext4_extent { __le32 ee_block; /* first logical block */ __le16 ee_len; /* no of blocks */ __le16 ee_start_hi; /* high 16 bits of phy blk */ __le32 ee_start_lo; /* low 32 bits of phy blk */

    };Extents tree leads to efficient lookups and improves performanceon sequential IO as well as mail server workloads.

    Ext4 supports both extents and indirect mapping schemes and filescan be converted between the two formats.

  • 8/2/2019 Ext34 Disk Layout

    10/1610

    EXTENTS

  • 8/2/2019 Ext34 Disk Layout

    11/1611

    EXT4 featuresUNINIT_BG

    For very large filesystems, e2fsck times arestarting to become unacceptable.The uninitialized block groups feature uses flags inthe group descriptor to indicate of the block group

    is initialized or not. Efsck can just ignore blockgroups that are marked as uninitialized .The flags marking the block group uninitializedand the high watermark are checksummed so wecan detect corruption.

    We have seen 2-10x speedup for e2fsck in manycases.Nanosecond timestamp support

    Using the i_{atime, ctime, mtime, crtime}_extrafields.

  • 8/2/2019 Ext34 Disk Layout

    12/1612

    EXT4 features

    Large FS supportExt3 used 32-bit block numbers and with 4KBblocksize, the filesystem is limited to maximum16TB size.Ext4 uses 48-bit block numbers. All on-disk

    structures needed to be changed to support the48-bit block number.Persistent preallocation (fallocate support)

    Apps such as large databases often write zeros toa file for guaranteed and contiguous space

    reservation.Ext4 improves this scenario by skipping the zero-out and marking the extents as uninitializedinstead.

  • 8/2/2019 Ext34 Disk Layout

    13/1613

    EXT4 features

    Multi-block-allocatorAllocates multiple blocks at once using buddy data structure.Includes inode and group preallocationIncludes special allocation modes for smallfiles and GOAL blocks.

    flex_bgThis feature groups meta-data(inode,blockbitmap and indoe table) from a series of groups at the beginning of a flex group inorder to improve performance during heavymeta-data operations.

  • 8/2/2019 Ext34 Disk Layout

    14/1614

    Crash recovery - JBD/2 First a copy of the blocks to be written is stored in the

    journal. Then, when the I/O transfer to the journal iscompleted (commit block is written), the blocks are written( replayed) in the filesystem.

    Journaling modes : Journal All data and metadata is journaled.

    Ordered Only metadata changes are journaled.Data blocks are written to disk before themetadata to avoid data corruption.Writeback Only metadata is journaled. Fastestmode.

    Journal checksumsAll blocks in a transaction are checksummed andthe checksum is stored in the commit header.While replaying the transaction(either by e2fsck orext4), this checksum ensures that corrupt or

    partial transactions are not written to disk.

  • 8/2/2019 Ext34 Disk Layout

    15/1615

    Latest filesystem design layouts

    TreesLatest filesystems like ZFS, BtrFS, Tux3 useindexed trees for efficient directory layouts,blocks, objects(inodes, EAs) and snapshots. With64-bit or 128-bit pointers, we literally end all

    limits imposed on filesystems no of inodes, EAsizes, no of files within directories.Checksumming

    All data/metadata is checksummed for earlydetection and possible correction.

    In-built VMVolume manager and filesystem are tightlycoupled to take advantage of mirroring and RAIDlike functionality.

    In-built encryption, compression

  • 8/2/2019 Ext34 Disk Layout

    16/1616

    QUESTIONS?