Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
2
We have many requests to provide a supported option for the XFS file system on Oracle Linux – Oracle Linux Blog Feb 28, 2013
3
About This Talk
• Introduction
- About XFS
- XFS Development Community
• How Fast XFS Is Going
- Kernel changes (> Linux 3.0)
- User space Programs
- XFS Test Suite
• Upcoming Features
- Kernel and user space
- Preview of the self describing metadata
4
About XFS
• Full 64-bit journaling file system
• Well-known for high-performance and scalability
• Maximum filesystem size/file size: 16 EiB/8EiB
• Variable blocks sizes: 512 bytes to 64 KB
• Freeze/Thaw to support volume level snapshot - xfs_freeze(8)
• Online filesystem/file defragmentation - xfs_fsr(8)
• Online filesystem resize – xfs_growfs(8)
• Internal log space/External log volume
• Realtime subvolume - Provide very deterministic data rates suitable for media streaming applications
5
XFS Development Community
• Developers From Corporations
- SGI, Redhat, Oracle, SuSE, IBM
• Main Contributors – In alphabetical order
Dave Chinner, Christoph Hellwig - Preeminent Individual Contributors
Brian Foster, Carlos Maiolino, Chandra Seetharaman, Eric Sandeen,
Jan Kara, Jeff Liu, Mark Tinguely, <leave the seat of honour open for you>
• Maintainer
Ben Myers @SGI
• Join us via Mailing list: [email protected] and IRC Channel: irc.freenode.net#xfs
• Newcomers are always welcome!
6
How Fast XFS Is Going
The statistics of code changes between Linux v3.0 - v3.10-rc1 (Jul 21 2011 - May 11 2013)
Btrfs/Ext4 with JBD2/XFS
Files changed Insertions Deletions0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
git diff --stat --minimal -C -M v3.0..v3.10-rc1 -- fs/[btrfs|xfs|ext4 with jbd2]
Ext4&JBD2
XFS
Btrfs
Linux v3.0 ~ v3.10-rc1
Th
e n
um
be
r o
f fil
es
cha
ng
ed
, in
sert
ion
s a
nd
de
letio
ns
7
How Fast XFS Is Going
• XFS changes were made up of
- Improvements – performance/scalability improvements, code base refactoring
- New features – anything new
- Bug fixes
- Misc – trivial fix, code style adjustment, dead code cleanups
8
How Fast XFS Is Going
The proportion of the XFS kernel changes between Linux 3.0 to Linux 3.10-rc1
Based on the number of Patches
Improvement
New feature
Bug fix
Misc
9
How Fast XFS Is Going
The proportion of the XFS kernel changes between Linux 3.0 to Linux 3.10-rc1
Based on the lines (+/-)
ImprovementNew featureBug fixMisc
10
How Fast XFS Is Going
• Xfsprogs v3.1.6 ~ v3.1.11 (Oct 11 2011 ~ May 09 2013)
- 15 Contributors
- 106 patches
$ git diff stat minimal C M v3.1.6 v3.1.11 |grep changed
108 files changed, 11113 insertions(+), 11418 deletions()
11
How Fast XFS Is Going
• XFS test suite - xfstests
- A generic test tool for Linux local filesystems
- 300+ test cases overall
- 170+ special test cases for XFS
• Test cases are well-organized for different filesystems
$ ls l xfstests/tests/
btrfs/ ext4/ generic/ Makefile shared/ udf/ xfs/
12
Speedup Direct-IO R/W On High IOPS Devices
• XFS inode locking modes, e.g. shared/exclusive
- The name convention is inherited from SGI IRIX
- Equivalent is the read/write modes on Linux
• Issues faced before Linux 3.2
- Exclusive lock range is too extensive
- Concurrent direct-IO reads are serialized on page cache check up
- Exclusive lock mode is used for direct-IO write by default
13
Speedup Direct-IO R/W On High IOPS Devices
• Solutions
- Use shared lock for direct-IO read, take the exclusive mode if the page invalidation is needed
- Use shared lock for direct-IO writes by default, take the exclusive lock during IO submission if extent allocation is required
14
FIO Scenario Storage formated with default options
Fio version 2.1
Direct=1 rw=randrw bs=4k size=10G Numjobs=10 #[20,40,80] Runtime=120 Thread ioengine=psync
Simplified output of xfs_info(8)
Metadata: isize=256 agcount=4 agsize=937408 blks sectsz=512
Data: bsize=4096 blocks=3749632 sunit=0 swidth=0 blks
Log: internal bsize=4096 blocks=2560 version=2
Speedup Direct-IO R/W On High IOPS Devices
15
10 20 40 800
2000
4000
6000
8000
10000
12000
14000
XFS Read IOPS, SSD SATA3
Vanilla 3.7.0 vs 2.6.39 in delaylog mode
2.6.39
3.7.0
Threads
Inpu
t/O
utpu
t op
erat
ions
per
sec
ond
Speedup Direct-IO R/W On High IOPS Devices
16
10 20 40 800
2000
4000
6000
8000
10000
12000
14000
XFS Write IOPS, SSD SATA3
Vanilla 3.7.0 vs 2.6.39 in delaylog mode
2.6.39
3.7.0
Threads
Inp
ut/O
utp
ut o
pe
ratio
ns
pe
r se
con
d
Speedup Direct-IO R/W On High IOPS Devices
17
Sync Story
• Improve concurrency for fsync(2) on files
- Unlock inode before the log force
• Optimizations for fsync(2) on directories
- Directories are only updated transactionally
- No file data need to flush
- Does not have to flush disk caches except as part of a transaction commit
• Improved sync behavior in the face of aggressive dirtying
- Writes data out itself two times per filesystem sync that overriding the livelock protection in the core writeback code path
18
Sync Story
• Xfssyncd workqueue was removed, Instead
- New dedicated workqueue for inode reclaim
- New dedicated workqueue for log operation
- Now the sync work is periodic log work only for xfsyncd_centisecs sysctl
19
Efficient Sparse File Handing
• SEEK_DATA/SEEK_HOLE options to lseek(2)
- Derive from Solaris ZFS
- Neater call interface than FIEMAP ioctl(2)
• Use scenarios
- cp(1), GNU tar(1), etc...
- Virtual image(XEN, KVM) backup
- Sparse file detection
20
Efficient Sparse File Handing
• Refinement for unwritten extents
• Create a sparse file with unwritten extents mixed with data and holes
#!/bin/bash
xfs_io F f 'c falloc 0 10G' /xfs/sparse
for i in $(seq 0 30 120); dooffset=$(($i * $((1 << 20))))xfs_io "c pwrite $offset 500m" /xfs/sparse
done
21
Efficient Sparse File Handing
• Layout of the created sparse file
$ filefrag v sparse Filesystem type is: 58465342File size of sparse is 10737418240 (2621440 blocks, blocksize 4096) ext logical physical expected length flags 0 0 43547551 151040 1 151040 43698591 1946111 unwritten 2 2097151 43008572 45644702 524289 unwritten,eofsparse: 2 extents found
22
Efficient Sparse File Handing
Improved Non-improved0
20
40
60
80
100
120
140
With/Without unwritten extents refinement
Sparse file copy via xfstests/seek_copy_test on laptop with normal SATA diskT
ime
in s
econ
ds
23
Quota Improvements
• XFS disk quota supports
- User quota
- Group quota
- Project quota – per directory quota (limit disk quota per directory)
24
Quota Improvements
• Bad scalability for tens thousands of in-memory dquot searching, why?
- User/Group/Project dquots are stored at a global hash table which is shared between file systems
• Hash table at worst O(n) search/insert/delete while Radix tree at worst O(k) on insertion and deletion
• Solutions
- Replace global hash tables with per-filesystem radix tree
- Replace global dquot lru lists with per-filesystems
- Remove the global xfs_Gqm structure
25
Fighting With Process 8K Stack Space Limitation
• 8K process stack space for x86_64 in Linux 2.6 by default
- Every process has a dedicated kernel stack
- Kernel stacks are a fixed size, can not be expanded as required
- Can not be swapped
• Extreme stack use in the Linux VM/VFS call chain
• The old problems for XFS
- Significant issues with the amount of stack that allocation in XFS uses, especially in the memory reclaim situations (writeback path)
26
Fighting With Process 8K Stack Space Limitation
• Buffer cache miss that triggers I/O vs CPU cache miss
• Solution
- Alleviate stack allocation in allocation call chain, e.g. Delayed allocation
- Move all allocations to a new worker thread combine with a completion
- Avoid context switch overhead if an allocation request is comes in with
large stack
27
Bounds Checking Enabled XFS Kernel
• Alternative CONFIG_XFS_WARN Support
- Depends on XFS_FS && !XFS_DEBUG
- Converts ASSERT() checks to WARN_ON(1)
- Does not modify algorithms
- Does not cause kernel to panic on non-fatal errors
- Allow to find strange "out of bounds" problems more easily
- Already turned on Fedora kernel-debug packages
• Suggest applying this feature for other Linux distributions with XFS support
28
Bounds Checking Enabled Kernel
• XFS with CONFIG_XFS_DEBUG
- Very efficient buddy for developers
- Weak points from a user perspective
. Significant overhead in production environment
. Change the behavior of algorithms(such as allocation) to improve the test coverage, e.g. xfs_alloc_ag_vextent_near()
. Would intentionally panic the machine on non-fatal errors by design
• Only advisable to use for debugging purpose
29
Misc Changes
• Mount options
- Nodelaylog mode is removed, using delaylog mode by default ( >= Linux 3.3)
- Inode64 re-mountable
- Inode32 re-mountable
• Speculative preallocation improvements
- Trimming the speculative preallocation near ENOSPC/quota limits/sparse file
• Discontiguous buffers
- Virtually contiguous in the buffers, but non-contiguous on disk
30
Upcoming – Self Describing Metadata Preview
31
Upcoming – Self Describing Metadata Preview
• XFS utilities for forensic analysis of the file system structures
- xfs_repair(8)
- xfs_db(8)
• Analyze the structure of 100TB to 1PB storage :(
• Primary concern for supporting PB scale file system
- Minimize the time and effort required for basic forensic analysis of the file system structures
32
Self Describing Metadata Preview
• Problems with the current metadata format
- Magic number is the only way
- Lack of magic number identifying in AGFL, remote symlinks and remote attribute blocks
33
Self Describing Metadata Preview
• Additional information need to be recored
- CRC32c validation
- Filesystem identifier
- The owner of a metadata block
- Logical sequence number (LSN) of the most recent transaction
34
Self Describing Metadata Preview
• The typical on-disk structure●
struct xfs_ondisk_hdr { __be32 magic; /* magic number */ __be32 crc; /* CRC, not logged */ uuid_t uuid; /* filesystem identifier */ __be64 owner; /* parent object */ __be64 blkno; /* location on disk */ __be64 lsn; /* last modification in log, not logged */};
35
Self Describing Metadata Preview
• Additional information format
- According to the type of metadata blocks
• Runtime Validation
- Immediately after a successful read from disk
- Immediately prior to write IO submission
36
Self Describing Metadata Preview
• Compatibilities
- No forwards compatibility, old filesystem will not support the new disk format
- No backwards compatibility, old kernels and userspace will not be able to read the new format
- Kernel and userspace that support the new format will still work just fine with the old, Non-CRC check enabled format
- Support two different incompatible disk formats from this point onwards
- Will not provide tools to convert the format of existing file system?
37
Upcoming - Kernel
• Splitting project quota support from group quota support
- Same quota inode is used for project and group quota
- Introduce the 3rd quota inode for project quota so that they can be enabled
at the same time
• Online shrink
- Initial patch set has already been posted to mailing list, dependent on xfs_agstate(8) with kernel changes as well as xfs_reno(8)
- Challenge of internal log blocks moving
38
Upcoming – User space
• xfs_reno(8)
- Allows an inode64 filesystem to be converted to inode32
- The file system has to be mounted with inode32 in advance
• xfs_agstate(8)
- Allow to turn an allocation group offline or back to online
39
XFS's 20th birthday is coming in October :)
Thank you!
40
References
• http://xfs.org/index.php/XFS_status_update_for_2011
• http://xfs.org/index.php/XFS_status_update_for_2012
• http://oss.sgi.com/projects/xfs/papers/xfs_usenix/index.html
• http://oss.sgi.com/archives/xfs/2013-04/msg00100.html
• http://lwn.net/Articles/476267/
• http://lwn.net/Articles/476263/
• http://lwn.net/Articles/84583/
• http://en.wikipedia.org/wiki/Hash_table
41
Acknowledgments
• Thanks you guys for reviewing this document with nice comments in alphabetical order: Ben Myers, Dave Chinner, Eric Sandeen, Mark Tinguely
• I would like to thank Christoph Hellwig for updating the XFS development status per every Linux official release between 2011 to 2012 as those updates saved me a lot of time to see the progress in that period.