Upload
others
View
11
Download
0
Embed Size (px)
Citation preview
ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected])
File Systems – F2FS
Dongkun Shin ([email protected])
Embedded Software Laboratory
Sungkyunkwan University
http://nyx.skku.ac.kr/
ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 2
Log-Structured File System
• Assume the whole disk space as a big contiguous area
• Write all data sequentially
– Application’s random I/O is converted to sequential I/O through
LFS
• “frequent metadata updates” is key challenge in LFS
• Recover quickly with “checkpoint”
ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 3
Log-Structured File System
ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 4
Log-Structured File System
• Garbage Collection (Cleaner)
– Reuse of segments while writing
– A key challenge in LFS with snapshot
ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 5
Critical Issues of LFS
• Wandering tree problem
• High cleaning overhead
ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 6
LFS cleaning
• To make free segments– LFS cleaner copies valid blocks to other free segment
• victim segment selection– utilization: how much is to be gained by cleaning
– age: how likely is the segment to change soon anyway
• On-demand cleaning– Overall performance decreases
• Background cleaning – It does not affect the performance
A B C D A B C D
Copy
Segment 1 Segment 2 Segment 3
ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 7
Hole-plugging
• Matthews et al. employed Hole-plugging in LFS [Matth
ews et al, ACM OSR ’97]
• The cleaner copies valid blocks to holes of other
segments
• Invoke random writes
A B C D
Copy
Segment 1 Segment 2 Segment 3
A B
ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 8
Slack Space Recycling (SSR)
• directly recycles slack space to avoid on-demand
cleaning
• Slack Space is invalid area in used segment
• Invoke random writes
SSRSSR
A B C D
Segment 1 Segment 2 Segment 3
Segment buffer
E F G H
E F G H
ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 9
F2FS
• A mainlined filesystem (since kernel 3.8)
• Article: An f2fs teardown
– http://lwn.net/Articles/518988/
• Flash-Friendly File System
– Log-structured approach
– Various parameters for adjusting to geometry of flash
memory
• Developed by Samsung
https://www.sammobile.com/news/galaxy-note-10-uses-f2fs-not-ext4-file-system-whats-the-difference/
ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 10
F2FS Features
• Flash Awareness
– Log-structured approach
– Adjust to the geometry of flash memory
– Align FS data structures to the FTL operation units.
ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 11
F2FS Features
• Solved wandering tree problem
– Fixed Location Area and New Indexing Scheme (Over-
writable)
– Avoiding Metadata Update Propagation
ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 12
F2FS Features
• Solved High Cleaning Overhead of LFS
• Multi-head Logs and Hot/Cold Data Separation
– Write-time data separation → more chances to get binomial
distribution
– Two different victim selection policies for foreground and
background cleaning
• Automatic background cleaning
• Adaptive Write Policy for High Utilization
– Switches write policy to threaded logging at right time
(logging to FTL overprovision space)
– Graceful performance degradation at high utilization
ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 13
File Structure
ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 14
Disk Layout
• Superblock– Basic information of the file system
– Disk layout parameters
– Pointer to valid check point
– 1 copy block
• Block (4KB)
• Segment (2MB)– 512 blocks
Random Write Area Sequential Write Area
over-write Allow over-writes No over-writes
Update Checkpoint op File op / Checkpoint op
Area Checkpoint AreaNode Address TableSegment Information TableSegment Summary Area
Main Area- File Contents ( + Directory )- File Nodes
ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 15
Disk Layout
• Sections (Collection of segments: configurable by power of 2)– Cleaning unit: one section at a time
– Aligned to the zone size
• Six open sections– Hot / Warm / Cold for nodes and data
– file content (data) are separate from indexing information(nodes)
– hotness based on various heuristics
• Zones: Collection of sections (2MB)– Keep six open sections in different zones.
ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 16
Segment Type
NAT
Dir InodeDirectory Data
File InodeFile Data File Data
.jpg
. . .
Indirect Node
Direct NodeCOLD
WARM
HOT
small size write
ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 17
Random Write Area (Metadata)
• Checkpoint (CP) – File system information at the moment
– bitmaps for valid NAT/SIT sets, inode lists, summary entries of current active segments
• Node Address Table (NAT)– Block address table for all the node blocks stored in main area
– Inode number, pointer to block address of node block
• Segment Information Table (SIT)– Segment information such as valid block count and bitmap for the validity
of all the blocks
• Segment Summary Area (SSA)– Summary entries which contains the owner information(back pointers to
NAT-id, inode) of all the data and node blocks
ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 18
Node Address Table (NAT)
• NAT structure
struct f2fs_nat_entry {__u8 version; /* latest version of cached nat entry */__le32 ino; /* inode number */__le32 block_addr; /* block address */
} __packed;
struct f2fs_nat_block {struct f2fs_nat_entry entries[NAT_ENTRY_PER_BLOCK];
} __packed;
ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 19
Segment summary area (SSA)
• Summary block– Has the information of one segment
– 512 (=ENTRIES_IN_SUM) summary entries
• Summary entry– node id, node version, offset in node
– Data block: nid of its direct node
– Node block: nid of the node
– Referred at Cleaning and Crash Recovery
• Spare area– Journal for NAT/SIT entry
struct f2fs_summary_block {struct f2fs_summary entries[ENTRIES_IN_SUM];struct f2fs_journal journal;struct summary_footer footer;
} __packed;
ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 20
Cleaning
• Cleaning Process– Reclaim obsolete data scattered across the whole storage for new
empty log space
– Get victim segments through referencing segment usage table
– Load parent index structures of there-in data identified from segment summary blocks
– Move valid data by checking their cross-reference
• Goal– Hide cleaning latencies to users
– Reduce the amount of valid data to be moved
– Move data quickly
• Issues– Hot and cold data separation
– Victim selection policy
ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 21
Cleaning
• Six active logs for static hot and cold data separation
• Support a background Cleaning process using Kernel thread
• Background Cleaning (BG Cleaning)
– Triggered the cleaning job when the system is idle
– Cost-benefit policy
• Foreground Cleaning (FG Cleaning)
– Triggered when there are not enough free segments to serve VFS calls
– Greedy policy
Policy Cost-benefit Greedy
Selectingvictim
segment
According to thenumber of valid blocks andthe segment age
Having the smallest number
of valid blocks
ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 22
GC Condition
• Background GC triggering condition
1. Operated by a kernel thread
2. GC is not conducted currently.
3. There are enough invalid blocks.
• Adjust sleep time (3min ~ 6min) based on # of invalid blocks
4. IO subsystem is idle by checking the # of writeback pages.
5. IO subsystem is idle by checking the # of requests in bdev's request
list.
• Foreground GC triggering condition
– When there are not enough free segments to serve VFS calls.
• Free Sections ≤ Reserved Sections + Dirty node
• Reserved Section =(1
𝑜𝑣𝑒𝑟𝑝𝑟𝑜𝑣𝑖𝑠𝑖𝑜𝑛 𝑟𝑎𝑡𝑖𝑜+ 5) × 𝑠𝑒𝑔𝑠_𝑝𝑒𝑟_𝑠𝑒𝑐 (Default = 25)
ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 23
Segment Allocation
• Copy-and-compactions scheme (LFS Alloc)
– Good for sequential write performance
– Cause cleaning overhead under high utilization
• Threaded log scheme (SSR Alloc)
– No cleaning process is needed
– Cause random write
LFS Alloc - Allocating free segment
SSR Alloc - Allocating used segment
FREE VALID INVALID
ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 24
Segment Allocation
• Adaptive logging
– Normally, copy-and-compaction is adopted
– If there is not enough free space, the policy is dynamically
changed to threaded logging
Reserved Section threshold
Over-provisioningthreshold
LFS Alloc
SSR Alloc + LFS Alloc (warm node)
LFS Alloc + SSR Alloc + FG Cleaning
At idle time,BG Cleaning Process
Warm nodes is NOTselected for an SSR victim
ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 25
SSR Condition
• Exception
– Warm Node• For recovery
• LFS Allocation
– Utilization == 100%• Inplace Update (linux 3.8)
FG
When there are not enough free segments to serve VFS calls
- Free Sections ≤ Reserved Sections + Dirty Node
- Reserved Sections → Depend on overprovision ratio (Default = 25)
SSR
Checking SSR condition before allocating segment
- Free sections ≤ Overprovision Sections
- Default overprovision ratio = 5 (format option)
ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 26
Adaptive Write Policy
• Experiment
– Embedded system with eMMC 12GB partition
– Iozone random write tests on several 1GB files
• Results
– Sustained performance is improved by adaptive write.
ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 27
Performance
ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 28
Block Trace
ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 29
Performance Drop under Aging
ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected])
F2FS version-up history
4.0 ~ 4.19
30
ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 31
4.0, 4.1
• batched trim
– submits split discard commands
– To avoid long latency due to huge trim commands• Trim invokes checkpoint
• Add an optional rb-tree based extent cache
– an improvement over the original extent info cache
– extent maps between contiguous logical address and
physical address
– Lower memory consumption
• Enable inline data by default
– Small (< 3.4KB) files can be stored directly in the inode
ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 32
4.3
• ioctl F2FS_IOC_GARBAGE_COLLECT
– triggers a cleaning job explicitly by users
• Enhance multithread performance
– Protect submit_bio operation with writepages mutex lock
– writepages mutex lock serializes all block address allocation
and page submitting pairs from different inodes
• Introduce a shrinker
– reclaim memory consumed by a number of in-memory f2fs
data structures
ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 33
4.5, 4.7
• data_flush mount option
– enables data flushing before checkpoint in order to persist
data of regular and symlink
– Flush all dirty pages as well as write-submitted data
• /proc entry to show valid block bitmap for user to be
aware of fragmentation
• Speedup fallocate
– improve the expand_inode speed in fallocate by allocating
data blocks as many as possible in single locked node page
ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 34
4.8
• Add lazytime mount option– keep atime in a file's in-memory inode until there is some other reason
to do so, or until the inode itself is being pushed out of memory
• Add discard/nodiscard mount option– Enable/disable real-time discard in f2fs, if discard is enabled, f2fs will issue
discard/TRIM commands when a segment is cleaned.
• Support the move_range ioctl– move a range of data blocks from one file to another
• mode=lfs mount option– Turn IPU/SSR off
• flush_merge option by default – Merge concurrent cache_flush commands as much as possible to
eliminate redundant command issues.
– If the underlying device handles the cache_flush command relatively slowly, recommend to enable this option.
ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 35
4.9
• Support async discard
– all discard commands can be issued and be waited for
endio in batch to improve performance.
ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 36
4.11
• inline_xattr/noinline_xattr– stores xattr entry in each inode
• Support IO alignment for DATA and NODE writes– fill dummy blocks in write bios
– eliminate underlying dummy page problem which FTL conducts in order to close MLC or TLC partial written pages
– BIO_MAX_PAGES = 256
• Add a kernel thread to issue discard commands asynchronously
xattr : http://man7.org/linux/man-pages/man7/xattr.7.html
ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 37
4.12, 4.13
• Enable small discard by default– 4K granularity small discard
– discard_granularity in /sysfs• controls the granularity of discard command size. • It will issue discard commands if the size is larger than given granularity. • Its unit size is 4KB, and 4 (=16KB) is set by default.• The maximum value is 128 (=512KB).
• Write small sized IO to hot log– Split small and large IOs separately in order to get more
consecutive big writes
– small sized IO threshold: min_hot_blocks in /sysfs
• Add ioctl to do gc with target block address– f2fs_ioc_gc_range() to move blocks located in the given
range.
ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 38
4.14
• Support F2FS_IOC_FS{GET,SET}XATTR
• Support inode checksum
– Inode read verifies checksum
• Introduce gc_urgent mode for background
GC
– /via sysfs
– gc_urgent = 0 [default]
– gc_urgent = 1, background thread starts to do GC by
given gc_urgent_sleep_time interval.
ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 39
GC-related sysfs files
• gc_urgent
• gc_urgent_sleep_time– controls sleep time for gc_urgent. 500 ms is set by default.
• gc_min_sleep_time, gc_max_sleep_time
– controls the minimum/maximum sleep time for the garbage collection thread.
• gc_no_gc_sleep_time– controls the default sleep time for the garbage collection thread.
• gc_idle: select victim policy for garbage collection. – gc_idle = 0 (default) will disable this option. (Adaptive)
– gc_idle = 1: Cost Benefit, gc_idle = 2: Greedy
• reclaim_segments– If the number of prefree segments > reclaim_segments, f2fs tries to conduct
checkpoint to reclaim the prefree segments to free segments.
– By default, 5% over total # of segments.
Prefree segment has only pre-invalid blocks and invalid blocks.
Checkpoint will change Prefree segment to free segment.
ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 40
4.15
• Support flexible inline xattr size
– expand inline xattr size flexibly according to user's
requirement.
• Export SSR allocation threshold in sysfs
– min_ssr_sections
ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 41
4.16
• sysfs interface readdir_ra
– to enable/disable readaheading inode block
• Add an ioctl to disable GC for specific file
– would be useful, when user wants to keep its block map
• Add F2FS_IOC_PRECACHE_EXTENTS
ioctl
– precache extent info like ext4
– eliminate synchronous waiting of mapping info
ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 42
4.17
• Add block allocation policies to pass down write
hints given by user
– whint_mode=off (default). F2FS only passes down
WRITE_LIFE_NOT_SET
– whint_mode=user-based. F2FS tries to pass down hints given
by users
– whint_mode=fs-based. F2FS passes down hints with its policy.
• Expose extension_list to user and introduce hot
file extension
– Via /sysfs
ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 43
write hintsWrite-hint Policy-----------------1) whint_mode=off. F2FS only passes down WRITE_LIFE_NOT_SET.
2) whint_mode=user-based. F2FS tries to pass down hints given by users.
User F2FS Block---- ---- -----
META WRITE_LIFE_NOT_SETHOT_NODE "WARM_NODE "COLD_NODE "
ioctl(COLD) COLD_DATA WRITE_LIFE_EXTREMEextension list " "
-- buffered ioWRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREMEWRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORTWRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SETWRITE_LIFE_NONE " "WRITE_LIFE_MEDIUM " "WRITE_LIFE_LONG " "
-- direct ioWRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREMEWRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORTWRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SETWRITE_LIFE_NONE “ WRITE_LIFE_NONEWRITE_LIFE_MEDIUM “ WRITE_LIFE_MEDIUMWRITE_LIFE_LONG “ WRITE_LIFE_LONG
3) whint_mode=fs-based. F2FS passes down hints with its policy.
User F2FS Block---- ---- -----
META WRITE_LIFE_MEDIUMHOT_NODE WRITE_LIFE_NOT_SETWARM_NODE "COLD_NODE WRITE_LIFE_NONE
ioctl(COLD) COLD_DATA WRITE_LIFE_EXTREMEextension list " "
-- buffered ioWRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREMEWRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORTWRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_LONGWRITE_LIFE_NONE " "WRITE_LIFE_MEDIUM " "WRITE_LIFE_LONG " "
-- direct ioWRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREMEWRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORTWRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SETWRITE_LIFE_NONE " WRITE_LIFE_NONEWRITE_LIFE_MEDIUM " WRITE_LIFE_MEDIUMWRITE_LIFE_LONG " WRITE_LIFE_LONG
short, medium, long, extreme
not_set, none
ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 44
write hints
f2fs_io_type_to_rw_hint()
Data Hot WRITE_LIFE_SHORT
Warm WRITE_LIFE_NOT_SET
Cold WRITE_LIFE_EXTREME
Node
Meta
WRITE_LIFE_NOT_SET
whint_mode == WHINT_MODE_USER
Data Hot WRITE_LIFE_SHORT
Warm WRITE_LIFE_LONG
Cold WRITE_LIFE_EXTREME
Node Hot
Warm
WRITE_LIFE_NOT_SET
Cold WRITE_LIFE_NONE
Meta WRITE_LIFE_MEDIUM
whint_mode == WHINT_MODE_FS
ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 45
4.17
• Add mount option for segment allocation policy
– "alloc_mode=reuse" case,
• allocate segments from 0'th segment all the time to reassign segments
• It'd be useful for small-sized eMMC parts
– "alloc_mode=default“
• heap-based
• nowait aio support
– non-blocking AIO
• Support large NAT bitmap
– More number of nodes
ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 46
4.18, 4.19
• Enable -o discard by default
– enable real-time discard by default
– Issue discard at every segment cleaning
• Add fsync_mode=nobarrier for non-atomic files
– No flush command
• fsync_mode=%s
– Control the policy of fsync. Currently supports posix, strict, and nobarrier.
– posix mode (default)
• fsync will follow POSIX semantics and does a light operation to improve the filesystem
performance.
– strict mode
• fsync will be heavy and behaves in line with xfs, ext4 and btrfs, where xfstest generic/342
will pass, but the performance will regress.
– nobarrier is based on posix
• doesn't issue flush command for non-atomic files likewise "nobarrier" mount option.