41
<Insert Picture Here> XFS In Rapid Development Jeff Liu <[email protected]>

XFS In Rapid Development · Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen, Jan Kara, Jeff Liu, Mark Tinguely, •

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: XFS In Rapid Development · Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen, Jan Kara, Jeff Liu, Mark Tinguely,  •

<Insert Picture Here>

XFS In Rapid Development

Jeff Liu <[email protected]>

Page 2: XFS In Rapid Development · Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen, Jan Kara, Jeff Liu, Mark Tinguely,  •

2

We have many requests to provide a supported option for the XFS file system on Oracle Linux – Oracle Linux Blog Feb 28, 2013

Page 3: XFS In Rapid Development · Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen, Jan Kara, Jeff Liu, Mark Tinguely,  •

3

About This Talk

• Introduction

- About XFS

- XFS Development Community

• How Fast XFS Is Going

- Kernel changes (> Linux 3.0)

- User space Programs

- XFS Test Suite

• Upcoming Features

- Kernel and user space

- Preview of the self describing metadata

Page 4: XFS In Rapid Development · Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen, Jan Kara, Jeff Liu, Mark Tinguely,  •

4

About XFS

• Full 64-bit journaling file system

• Well-known for high-performance and scalability

• Maximum filesystem size/file size: 16 EiB/8EiB

• Variable blocks sizes: 512 bytes to 64 KB

• Freeze/Thaw to support volume level snapshot - xfs_freeze(8)

• Online filesystem/file defragmentation - xfs_fsr(8)

• Online filesystem resize – xfs_growfs(8)

• Internal log space/External log volume

• Realtime subvolume - Provide very deterministic data rates suitable for media streaming applications

Page 5: XFS In Rapid Development · Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen, Jan Kara, Jeff Liu, Mark Tinguely,  •

5

XFS Development Community

• Developers From Corporations

- SGI, Redhat, Oracle, SuSE, IBM

• Main Contributors – In alphabetical order

Dave Chinner, Christoph Hellwig - Preeminent Individual Contributors

Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen,

Jan Kara, Jeff Liu, Mark Tinguely, <leave the seat of honour open for you>

• Maintainer

Ben Myers @SGI

• Join us via Mailing list: [email protected] and IRC Channel: irc.freenode.net#xfs

• Newcomers are always welcome!

Page 6: XFS In Rapid Development · Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen, Jan Kara, Jeff Liu, Mark Tinguely,  •

6

How Fast XFS Is Going

The statistics of code changes between Linux v3.0 - v3.10-rc1 (Jul 21 2011 - May 11 2013)

Btrfs/Ext4 with JBD2/XFS

Files changed Insertions Deletions0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

git diff --stat --minimal -C -M v3.0..v3.10-rc1 -- fs/[btrfs|xfs|ext4 with jbd2]

Ext4&JBD2

XFS

Btrfs

Linux v3.0 ~ v3.10-rc1

Th

e n

um

be

r o

f fil

es

cha

ng

ed

, in

sert

ion

s a

nd

de

letio

ns

Page 7: XFS In Rapid Development · Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen, Jan Kara, Jeff Liu, Mark Tinguely,  •

7

How Fast XFS Is Going

• XFS changes were made up of

- Improvements – performance/scalability improvements, code base refactoring

- New features – anything new

- Bug fixes

- Misc – trivial fix, code style adjustment, dead code cleanups

Page 8: XFS In Rapid Development · Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen, Jan Kara, Jeff Liu, Mark Tinguely,  •

8

How Fast XFS Is Going

The proportion of the XFS kernel changes between Linux 3.0 to Linux 3.10-rc1

Based on the number of Patches

Improvement

New feature

Bug fix

Misc

Page 9: XFS In Rapid Development · Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen, Jan Kara, Jeff Liu, Mark Tinguely,  •

9

How Fast XFS Is Going

The proportion of the XFS kernel changes between Linux 3.0 to Linux 3.10-rc1

Based on the lines (+/-)

ImprovementNew featureBug fixMisc

Page 10: XFS In Rapid Development · Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen, Jan Kara, Jeff Liu, Mark Tinguely,  •

10

How Fast XFS Is Going

• Xfsprogs v3.1.6 ~ v3.1.11 (Oct 11 2011 ~ May 09 2013)

- 15 Contributors

- 106 patches

$ git diff ­­stat ­­minimal ­C ­M v3.1.6 v3.1.11 |grep changed

 108 files changed, 11113 insertions(+), 11418 deletions(­)

Page 11: XFS In Rapid Development · Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen, Jan Kara, Jeff Liu, Mark Tinguely,  •

11

How Fast XFS Is Going

• XFS test suite - xfstests

- A generic test tool for Linux local filesystems

- 300+ test cases overall

- 170+ special test cases for XFS

• Test cases are well-organized for different filesystems

$ ls ­l xfstests/tests/

btrfs/  ext4/  generic/  Makefile  shared/  udf/  xfs/  

Page 12: XFS In Rapid Development · Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen, Jan Kara, Jeff Liu, Mark Tinguely,  •

12

Speedup Direct-IO R/W On High IOPS Devices

• XFS inode locking modes, e.g. shared/exclusive

- The name convention is inherited from SGI IRIX

- Equivalent is the read/write modes on Linux

• Issues faced before Linux 3.2

- Exclusive lock range is too extensive

- Concurrent direct-IO reads are serialized on page cache check up

- Exclusive lock mode is used for direct-IO write by default

Page 13: XFS In Rapid Development · Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen, Jan Kara, Jeff Liu, Mark Tinguely,  •

13

Speedup Direct-IO R/W On High IOPS Devices

• Solutions

- Use shared lock for direct-IO read, take the exclusive mode if the page invalidation is needed

- Use shared lock for direct-IO writes by default, take the exclusive lock during IO submission if extent allocation is required

Page 14: XFS In Rapid Development · Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen, Jan Kara, Jeff Liu, Mark Tinguely,  •

14

FIO Scenario Storage formated with default options

   Fio version 2.1

   Direct=1   rw=randrw   bs=4k   size=10G   Numjobs=10 #[20,40,80]   Runtime=120   Thread   ioengine=psync

        Simplified output of xfs_info(8)

Metadata: isize=256          agcount=4             agsize=937408 blks sectsz=512

Data:     bsize=4096         blocks=3749632             sunit=0            swidth=0 blks

Log:      internal           bsize=4096                 blocks=2560        version=2                

Speedup Direct-IO R/W On High IOPS Devices

Page 15: XFS In Rapid Development · Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen, Jan Kara, Jeff Liu, Mark Tinguely,  •

15

10 20 40 800

2000

4000

6000

8000

10000

12000

14000

XFS Read IOPS, SSD SATA3

Vanilla 3.7.0 vs 2.6.39 in delaylog mode

2.6.39

3.7.0

Threads

Inpu

t/O

utpu

t op

erat

ions

per

sec

ond

Speedup Direct-IO R/W On High IOPS Devices

Page 16: XFS In Rapid Development · Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen, Jan Kara, Jeff Liu, Mark Tinguely,  •

16

10 20 40 800

2000

4000

6000

8000

10000

12000

14000

XFS Write IOPS, SSD SATA3

Vanilla 3.7.0 vs 2.6.39 in delaylog mode

2.6.39

3.7.0

Threads

Inp

ut/O

utp

ut o

pe

ratio

ns

pe

r se

con

d

Speedup Direct-IO R/W On High IOPS Devices

Page 17: XFS In Rapid Development · Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen, Jan Kara, Jeff Liu, Mark Tinguely,  •

17

Sync Story

• Improve concurrency for fsync(2) on files

- Unlock inode before the log force

• Optimizations for fsync(2) on directories

- Directories are only updated transactionally

- No file data need to flush

- Does not have to flush disk caches except as part of a transaction commit

• Improved sync behavior in the face of aggressive dirtying

- Writes data out itself two times per filesystem sync that overriding the livelock protection in the core writeback code path

Page 18: XFS In Rapid Development · Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen, Jan Kara, Jeff Liu, Mark Tinguely,  •

18

Sync Story

• Xfssyncd workqueue was removed, Instead

- New dedicated workqueue for inode reclaim

- New dedicated workqueue for log operation

- Now the sync work is periodic log work only for xfsyncd_centisecs sysctl

Page 19: XFS In Rapid Development · Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen, Jan Kara, Jeff Liu, Mark Tinguely,  •

19

Efficient Sparse File Handing

• SEEK_DATA/SEEK_HOLE options to lseek(2)

- Derive from Solaris ZFS

- Neater call interface than FIEMAP ioctl(2)

• Use scenarios

- cp(1), GNU tar(1), etc...

- Virtual image(XEN, KVM) backup

- Sparse file detection

Page 20: XFS In Rapid Development · Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen, Jan Kara, Jeff Liu, Mark Tinguely,  •

20

Efficient Sparse File Handing

• Refinement for unwritten extents

• Create a sparse file with unwritten extents mixed with data and holes

#!/bin/bash

xfs_io ­F ­f '­c falloc 0 10G' /xfs/sparse

for i in $(seq 0 30 120); dooffset=$(($i * $((1 << 20))))xfs_io "­c pwrite $offset 500m" /xfs/sparse

done

Page 21: XFS In Rapid Development · Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen, Jan Kara, Jeff Liu, Mark Tinguely,  •

21

Efficient Sparse File Handing

• Layout of the created sparse file

$ filefrag ­v sparse Filesystem type is: 58465342File size of sparse is 10737418240 (2621440 blocks, blocksize 4096) ext logical physical expected length flags   0       0 43547551               151040    1  151040 43698591          1946111 unwritten   2 2097151 43008572 45644702 524289 unwritten,eofsparse: 2 extents found

Page 22: XFS In Rapid Development · Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen, Jan Kara, Jeff Liu, Mark Tinguely,  •

22

Efficient Sparse File Handing

Improved Non-improved0

20

40

60

80

100

120

140

With/Without unwritten extents refinement

Sparse file copy via xfstests/seek_copy_test on laptop with normal SATA diskT

ime

in s

econ

ds

Page 23: XFS In Rapid Development · Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen, Jan Kara, Jeff Liu, Mark Tinguely,  •

23

Quota Improvements

• XFS disk quota supports

- User quota

- Group quota

- Project quota – per directory quota (limit disk quota per directory)

Page 24: XFS In Rapid Development · Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen, Jan Kara, Jeff Liu, Mark Tinguely,  •

24

Quota Improvements

• Bad scalability for tens thousands of in-memory dquot searching, why?

- User/Group/Project dquots are stored at a global hash table which is shared between file systems

• Hash table at worst O(n) search/insert/delete while Radix tree at worst O(k) on insertion and deletion

• Solutions

- Replace global hash tables with per-filesystem radix tree

- Replace global dquot lru lists with per-filesystems

- Remove the global xfs_Gqm structure

Page 25: XFS In Rapid Development · Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen, Jan Kara, Jeff Liu, Mark Tinguely,  •

25

Fighting With Process 8K Stack Space Limitation

• 8K process stack space for x86_64 in Linux 2.6 by default

- Every process has a dedicated kernel stack

- Kernel stacks are a fixed size, can not be expanded as required

- Can not be swapped

• Extreme stack use in the Linux VM/VFS call chain

• The old problems for XFS

- Significant issues with the amount of stack that allocation in XFS uses, especially in the memory reclaim situations (writeback path)

Page 26: XFS In Rapid Development · Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen, Jan Kara, Jeff Liu, Mark Tinguely,  •

26

Fighting With Process 8K Stack Space Limitation

• Buffer cache miss that triggers I/O vs CPU cache miss

• Solution

- Alleviate stack allocation in allocation call chain, e.g. Delayed allocation

- Move all allocations to a new worker thread combine with a completion

- Avoid context switch overhead if an allocation request is comes in with

large stack

Page 27: XFS In Rapid Development · Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen, Jan Kara, Jeff Liu, Mark Tinguely,  •

27

Bounds Checking Enabled XFS Kernel

• Alternative CONFIG_XFS_WARN Support

- Depends on XFS_FS && !XFS_DEBUG

- Converts ASSERT() checks to WARN_ON(1)

- Does not modify algorithms

- Does not cause kernel to panic on non-fatal errors

- Allow to find strange "out of bounds" problems more easily

- Already turned on Fedora kernel-debug packages

• Suggest applying this feature for other Linux distributions with XFS support

Page 28: XFS In Rapid Development · Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen, Jan Kara, Jeff Liu, Mark Tinguely,  •

28

Bounds Checking Enabled Kernel

• XFS with CONFIG_XFS_DEBUG

- Very efficient buddy for developers

- Weak points from a user perspective

. Significant overhead in production environment

. Change the behavior of algorithms(such as allocation) to improve the test coverage, e.g. xfs_alloc_ag_vextent_near()

. Would intentionally panic the machine on non-fatal errors by design

• Only advisable to use for debugging purpose

Page 29: XFS In Rapid Development · Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen, Jan Kara, Jeff Liu, Mark Tinguely,  •

29

Misc Changes

• Mount options

- Nodelaylog mode is removed, using delaylog mode by default ( >= Linux 3.3)

- Inode64 re-mountable

- Inode32 re-mountable

• Speculative preallocation improvements

- Trimming the speculative preallocation near ENOSPC/quota limits/sparse file

• Discontiguous buffers

- Virtually contiguous in the buffers, but non-contiguous on disk

Page 30: XFS In Rapid Development · Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen, Jan Kara, Jeff Liu, Mark Tinguely,  •

30

Upcoming – Self Describing Metadata Preview

Page 31: XFS In Rapid Development · Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen, Jan Kara, Jeff Liu, Mark Tinguely,  •

31

Upcoming – Self Describing Metadata Preview

• XFS utilities for forensic analysis of the file system structures

- xfs_repair(8)

- xfs_db(8)

• Analyze the structure of 100TB to 1PB storage :(

• Primary concern for supporting PB scale file system

- Minimize the time and effort required for basic forensic analysis of the file system structures

Page 32: XFS In Rapid Development · Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen, Jan Kara, Jeff Liu, Mark Tinguely,  •

32

Self Describing Metadata Preview

• Problems with the current metadata format

- Magic number is the only way

- Lack of magic number identifying in AGFL, remote symlinks and remote attribute blocks

Page 33: XFS In Rapid Development · Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen, Jan Kara, Jeff Liu, Mark Tinguely,  •

33

Self Describing Metadata Preview

• Additional information need to be recored

- CRC32c validation

- Filesystem identifier

- The owner of a metadata block

- Logical sequence number (LSN) of the most recent transaction

Page 34: XFS In Rapid Development · Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen, Jan Kara, Jeff Liu, Mark Tinguely,  •

34

Self Describing Metadata Preview

• The typical on-disk structure●

struct xfs_ondisk_hdr {    __be32 magic;    /* magic number */    __be32 crc;    /* CRC, not logged */    uuid_t uuid;    /* filesystem identifier */    __be64 owner;    /* parent object */    __be64 blkno;    /* location on disk */    __be64 lsn;      /* last modification in log, not logged */};

Page 35: XFS In Rapid Development · Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen, Jan Kara, Jeff Liu, Mark Tinguely,  •

35

Self Describing Metadata Preview

• Additional information format

- According to the type of metadata blocks

• Runtime Validation

- Immediately after a successful read from disk

- Immediately prior to write IO submission

Page 36: XFS In Rapid Development · Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen, Jan Kara, Jeff Liu, Mark Tinguely,  •

36

Self Describing Metadata Preview

• Compatibilities

- No forwards compatibility, old filesystem will not support the new disk format

- No backwards compatibility, old kernels and userspace will not be able to read the new format

- Kernel and userspace that support the new format will still work just fine with the old, Non-CRC check enabled format

- Support two different incompatible disk formats from this point onwards

- Will not provide tools to convert the format of existing file system?

Page 37: XFS In Rapid Development · Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen, Jan Kara, Jeff Liu, Mark Tinguely,  •

37

Upcoming - Kernel

• Splitting project quota support from group quota support

- Same quota inode is used for project and group quota

- Introduce the 3rd quota inode for project quota so that they can be enabled

at the same time

• Online shrink

- Initial patch set has already been posted to mailing list, dependent on xfs_agstate(8) with kernel changes as well as xfs_reno(8)

- Challenge of internal log blocks moving

Page 38: XFS In Rapid Development · Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen, Jan Kara, Jeff Liu, Mark Tinguely,  •

38

Upcoming – User space

• xfs_reno(8)

- Allows an inode64 filesystem to be converted to inode32

- The file system has to be mounted with inode32 in advance

• xfs_agstate(8)

- Allow to turn an allocation group offline or back to online

Page 39: XFS In Rapid Development · Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen, Jan Kara, Jeff Liu, Mark Tinguely,  •

39

XFS's 20th birthday is coming in October :)

Thank you!

Page 40: XFS In Rapid Development · Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen, Jan Kara, Jeff Liu, Mark Tinguely,  •

40

References

• http://xfs.org/index.php/XFS_status_update_for_2011

• http://xfs.org/index.php/XFS_status_update_for_2012

• http://oss.sgi.com/projects/xfs/papers/xfs_usenix/index.html

• http://oss.sgi.com/archives/xfs/2013-04/msg00100.html

• http://lwn.net/Articles/476267/

• http://lwn.net/Articles/476263/

• http://lwn.net/Articles/84583/

• http://en.wikipedia.org/wiki/Hash_table

Page 41: XFS In Rapid Development · Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen, Jan Kara, Jeff Liu, Mark Tinguely,  •

41

Acknowledgments

• Thanks you guys for reviewing this document with nice comments in alphabetical order: Ben Myers, Dave Chinner, Eric Sandeen, Mark Tinguely

• I would like to thank Christoph Hellwig for updating the XFS development status per every Linux official release between 2011 to 2012 as those updates saved me a lot of time to see the progress in that period.