XFS In Rapid Development · Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen, Jan Kara, Jeff Liu, Mark Tinguely, •

<Insert Picture Here>

XFS In Rapid Development

Jeff Liu <[email protected]>

2

We have many requests to provide a supported option for the XFS file system on Oracle Linux – Oracle Linux Blog Feb 28, 2013

3

About This Talk

• Introduction

- About XFS

- XFS Development Community

• How Fast XFS Is Going

- Kernel changes (> Linux 3.0)

- User space Programs

- XFS Test Suite

• Upcoming Features

- Kernel and user space

- Preview of the self describing metadata

4

About XFS

• Full 64-bit journaling file system

• Well-known for high-performance and scalability

• Maximum filesystem size/file size: 16 EiB/8EiB

• Variable blocks sizes: 512 bytes to 64 KB

• Freeze/Thaw to support volume level snapshot - xfs_freeze(8)

• Online filesystem/file defragmentation - xfs_fsr(8)

• Online filesystem resize – xfs_growfs(8)

• Internal log space/External log volume

• Realtime subvolume - Provide very deterministic data rates suitable for media streaming applications

5

XFS Development Community

• Developers From Corporations

- SGI, Redhat, Oracle, SuSE, IBM

• Main Contributors – In alphabetical order

Dave Chinner, Christoph Hellwig - Preeminent Individual Contributors

Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen,

Jan Kara, Jeff Liu, Mark Tinguely, <leave the seat of honour open for you>

• Maintainer

Ben Myers @SGI

• Join us via Mailing list: [email protected] and IRC Channel: irc.freenode.net#xfs

• Newcomers are always welcome!

mailto:[email protected]

6

How Fast XFS Is Going

The statistics of code changes between Linux v3.0 - v3.10-rc1 (Jul 21 2011 - May 11 2013)

Btrfs/Ext4 with JBD2/XFS

Files changed Insertions Deletions0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

git diff --stat --minimal -C -M v3.0..v3.10-rc1 -- fs/[btrfs|xfs|ext4 with jbd2]

Ext4&JBD2

XFS

Btrfs

Linux v3.0 ~ v3.10-rc1

Th

e n

um

be

r o

f fil

es

cha

ng

ed

, in

sert

ion

s a

nd

de

letio

ns

7


• XFS changes were made up of

- Improvements – performance/scalability improvements, code base refactoring

- New features – anything new

- Bug fixes

- Misc – trivial fix, code style adjustment, dead code cleanups

8


The proportion of the XFS kernel changes between Linux 3.0 to Linux 3.10-rc1

Based on the number of Patches

Improvement

New feature

Bug fix

Misc

9


The proportion of the XFS kernel changes between Linux 3.0 to Linux 3.10-rc1

Based on the lines (+/-)

ImprovementNew featureBug fixMisc

10


• Xfsprogs v3.1.6 ~ v3.1.11 (Oct 11 2011 ~ May 09 2013)

- 15 Contributors

- 106 patches

$ git diff stat minimal C M v3.1.6 v3.1.11 |grep changed

108 files changed, 11113 insertions(+), 11418 deletions()

11


• XFS test suite - xfstests

- A generic test tool for Linux local filesystems

- 300+ test cases overall

- 170+ special test cases for XFS

• Test cases are well-organized for different filesystems

$ ls l xfstests/tests/

btrfs/ ext4/ generic/ Makefile shared/ udf/ xfs/

12

Speedup Direct-IO R/W On High IOPS Devices

• XFS inode locking modes, e.g. shared/exclusive

- The name convention is inherited from SGI IRIX

- Equivalent is the read/write modes on Linux

• Issues faced before Linux 3.2

- Exclusive lock range is too extensive

- Concurrent direct-IO reads are serialized on page cache check up

- Exclusive lock mode is used for direct-IO write by default

13


• Solutions

- Use shared lock for direct-IO read, take the exclusive mode if the page invalidation is needed

- Use shared lock for direct-IO writes by default, take the exclusive lock during IO submission if extent allocation is required

14

FIO Scenario Storage formated with default options

Fio version 2.1

Direct=1 rw=randrw bs=4k size=10G Numjobs=10 #[20,40,80] Runtime=120 Thread ioengine=psync

Simplified output of xfs_info(8)

Metadata: isize=256 agcount=4 agsize=937408 blks sectsz=512

Data: bsize=4096 blocks=3749632 sunit=0 swidth=0 blks

Log: internal bsize=4096 blocks=2560 version=2


15

10 20 40 800

2000

4000

6000

8000

10000

12000

14000

XFS Read IOPS, SSD SATA3

Vanilla 3.7.0 vs 2.6.39 in delaylog mode

2.6.39

3.7.0

Threads

Inpu

t/O

utpu

t op

erat

ions

per

sec

ond


16

10 20 40 800

2000

4000

6000

8000

10000

12000

14000

XFS Write IOPS, SSD SATA3

Vanilla 3.7.0 vs 2.6.39 in delaylog mode

2.6.39

3.7.0

Threads

Inp

ut/O

utp

ut o

pe

ratio

ns

pe

r se

con

d


17

Sync Story

• Improve concurrency for fsync(2) on files

- Unlock inode before the log force

• Optimizations for fsync(2) on directories

- Directories are only updated transactionally

- No file data need to flush

- Does not have to flush disk caches except as part of a transaction commit

• Improved sync behavior in the face of aggressive dirtying

- Writes data out itself two times per filesystem sync that overriding the livelock protection in the core writeback code path

18

Sync Story

• Xfssyncd workqueue was removed, Instead

- New dedicated workqueue for inode reclaim

- New dedicated workqueue for log operation

- Now the sync work is periodic log work only for xfsyncd_centisecs sysctl

19

Efficient Sparse File Handing

• SEEK_DATA/SEEK_HOLE options to lseek(2)

- Derive from Solaris ZFS

- Neater call interface than FIEMAP ioctl(2)

• Use scenarios

- cp(1), GNU tar(1), etc...

- Virtual image(XEN, KVM) backup

- Sparse file detection

20


• Refinement for unwritten extents

• Create a sparse file with unwritten extents mixed with data and holes

#!/bin/bash

xfs_io F f 'c falloc 0 10G' /xfs/sparse

for i in $(seq 0 30 120); dooffset=$(($i * $((1 << 20))))xfs_io "c pwrite $offset 500m" /xfs/sparse

done

21


• Layout of the created sparse file

$ filefrag v sparse Filesystem type is: 58465342File size of sparse is 10737418240 (2621440 blocks, blocksize 4096) ext logical physical expected length flags 0 0 43547551 151040 1 151040 43698591 1946111 unwritten 2 2097151 43008572 45644702 524289 unwritten,eofsparse: 2 extents found

22


Improved Non-improved0

20

40

60

80

100

120

140

With/Without unwritten extents refinement

Sparse file copy via xfstests/seek_copy_test on laptop with normal SATA diskT

ime

in s

econ

ds

23

Quota Improvements

• XFS disk quota supports

- User quota

- Group quota

- Project quota – per directory quota (limit disk quota per directory)

24

Quota Improvements

• Bad scalability for tens thousands of in-memory dquot searching, why?

- User/Group/Project dquots are stored at a global hash table which is shared between file systems

• Hash table at worst O(n) search/insert/delete while Radix tree at worst O(k) on insertion and deletion

• Solutions

- Replace global hash tables with per-filesystem radix tree

- Replace global dquot lru lists with per-filesystems

- Remove the global xfs_Gqm structure

25

Fighting With Process 8K Stack Space Limitation

• 8K process stack space for x86_64 in Linux 2.6 by default

- Every process has a dedicated kernel stack

- Kernel stacks are a fixed size, can not be expanded as required

- Can not be swapped

• Extreme stack use in the Linux VM/VFS call chain

• The old problems for XFS

- Significant issues with the amount of stack that allocation in XFS uses, especially in the memory reclaim situations (writeback path)

26

Fighting With Process 8K Stack Space Limitation

• Buffer cache miss that triggers I/O vs CPU cache miss

• Solution

- Alleviate stack allocation in allocation call chain, e.g. Delayed allocation

- Move all allocations to a new worker thread combine with a completion

- Avoid context switch overhead if an allocation request is comes in with

large stack

27

Bounds Checking Enabled XFS Kernel

• Alternative CONFIG_XFS_WARN Support

- Depends on XFS_FS && !XFS_DEBUG

- Converts ASSERT() checks to WARN_ON(1)

- Does not modify algorithms

- Does not cause kernel to panic on non-fatal errors

- Allow to find strange "out of bounds" problems more easily

- Already turned on Fedora kernel-debug packages

• Suggest applying this feature for other Linux distributions with XFS support

28

Bounds Checking Enabled Kernel

• XFS with CONFIG_XFS_DEBUG

- Very efficient buddy for developers

- Weak points from a user perspective

. Significant overhead in production environment

. Change the behavior of algorithms(such as allocation) to improve the test coverage, e.g. xfs_alloc_ag_vextent_near()

. Would intentionally panic the machine on non-fatal errors by design

• Only advisable to use for debugging purpose

29

Misc Changes

• Mount options

- Nodelaylog mode is removed, using delaylog mode by default ( >= Linux 3.3)

- Inode64 re-mountable

- Inode32 re-mountable

• Speculative preallocation improvements

- Trimming the speculative preallocation near ENOSPC/quota limits/sparse file

• Discontiguous buffers

- Virtually contiguous in the buffers, but non-contiguous on disk

30

Upcoming – Self Describing Metadata Preview

31

Upcoming – Self Describing Metadata Preview

• XFS utilities for forensic analysis of the file system structures

- xfs_repair(8)

- xfs_db(8)

• Analyze the structure of 100TB to 1PB storage :(

• Primary concern for supporting PB scale file system

- Minimize the time and effort required for basic forensic analysis of the file system structures

32

Self Describing Metadata Preview

• Problems with the current metadata format

- Magic number is the only way

- Lack of magic number identifying in AGFL, remote symlinks and remote attribute blocks

33


• Additional information need to be recored

- CRC32c validation

- Filesystem identifier

- The owner of a metadata block

- Logical sequence number (LSN) of the most recent transaction

34


• The typical on-disk structure●

struct xfs_ondisk_hdr { __be32 magic; /* magic number */ __be32 crc; /* CRC, not logged */ uuid_t uuid; /* filesystem identifier */ __be64 owner; /* parent object */ __be64 blkno; /* location on disk */ __be64 lsn; /* last modification in log, not logged */};

35


• Additional information format

- According to the type of metadata blocks

• Runtime Validation

- Immediately after a successful read from disk

- Immediately prior to write IO submission

36


• Compatibilities

- No forwards compatibility, old filesystem will not support the new disk format

- No backwards compatibility, old kernels and userspace will not be able to read the new format

- Kernel and userspace that support the new format will still work just fine with the old, Non-CRC check enabled format

- Support two different incompatible disk formats from this point onwards

- Will not provide tools to convert the format of existing file system?

37

Upcoming - Kernel

• Splitting project quota support from group quota support

- Same quota inode is used for project and group quota

- Introduce the 3rd quota inode for project quota so that they can be enabled

at the same time

• Online shrink

- Initial patch set has already been posted to mailing list, dependent on xfs_agstate(8) with kernel changes as well as xfs_reno(8)

- Challenge of internal log blocks moving

38

Upcoming – User space

• xfs_reno(8)

- Allows an inode64 filesystem to be converted to inode32

- The file system has to be mounted with inode32 in advance

• xfs_agstate(8)

- Allow to turn an allocation group offline or back to online

39

XFS's 20th birthday is coming in October :)

Thank you!

40

References

• http://xfs.org/index.php/XFS_status_update_for_2011

• http://xfs.org/index.php/XFS_status_update_for_2012

• http://oss.sgi.com/projects/xfs/papers/xfs_usenix/index.html

• http://oss.sgi.com/archives/xfs/2013-04/msg00100.html

• http://lwn.net/Articles/476267/



• http://en.wikipedia.org/wiki/Hash_table

http://xfs.org/index.php/XFS_status_update_for_2011

http://xfs.org/index.php/XFS_status_update_for_2012

http://oss.sgi.com/projects/xfs/papers/xfs_usenix/index.html

http://oss.sgi.com/archives/xfs/2013-04/msg00100.html

http://lwn.net/Articles/476267/



http://en.wikipedia.org/wiki/Hash_table

41

Acknowledgments

• Thanks you guys for reviewing this document with nice comments in alphabetical order: Ben Myers, Dave Chinner, Eric Sandeen, Mark Tinguely

• I would like to thank Christoph Hellwig for updating the XFS development status per every Linux official release between 2011 to 2012 as those updates saved me a lot of time to see the progress in that period.

Documents

XFS In Rapid Development · Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen, Jan Kara, Jeff Liu, Mark Tinguely, •