65
Joonwon Lee [email protected] File System

Joonwon Lee [email protected] File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

Embed Size (px)

Citation preview

Page 1: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

Joonwon [email protected]

File System

Page 2: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

2

Long-term Information Storage

1. Must store large amounts of data

2. Information stored must survive the termination of the process using it

3. Multiple processes must be able to access the information concurrently

Page 3: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

3

File Structure

• Three kinds of files– byte sequence

– record sequence

– tree

Page 4: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

4

File Types

(a) An executable file (b) An archive

Page 5: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

5

File Access• Sequential access

– read all bytes/records from the beginning

– cannot jump around, could rewind or back up

– convenient when medium was mag tape

• Random access– bytes/records read in any order

– essential for data base systems

– read can be …

• move file marker (seek), then read or …

• read and then move file marker

Page 6: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

6

Memory-Mapped Files

• map() and unmap()– map a file onto a portion of the address space

– read( ) and write( ) are replaced with memory operations

• implementation– page tables map the file like ordinary pages

– same sharing/protection as pages

• issues– interaction between file system and VM when two

processes access the same file via different methods

Page 7: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

7

File System Implementation

A possible file system layout

Page 8: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

8

Implementing Files (1)

(a) Contiguous allocation of disk space for 7 files(b) State of the disk after files D and E have been removed

Page 9: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

9

Implementing Files (2)

Storing a file as a linked list of disk blocks

Page 10: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

10

Implementing Files (3)

Linked list allocation using a file allocation table in RAM

Page 11: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

11

Implementing Files (4)

An example i-node

Page 12: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

12

Implementing Directories (1)

(a) A simple directoryfixed size entries

disk addresses and attributes in directory entry

(b) Directory in which each entry just refers to an i-node

Page 13: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

13

Implementing Directories (2)

• Two ways of handling long file names in directory– (a) In-line– (b) In a heap

Page 14: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

14

Shared Files (1)

File system containing a shared file

Page 15: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

15

Shared Files (2)

(a) Situation prior to linking

(b) After the link is created

(c)After the original owner removes the file

Page 16: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

16

Disk Space Management (1)

• Dark line (left hand scale) gives data rate of a disk

• Dotted line (right hand scale) gives disk space efficiency

• All files 2KB

Block size

Page 17: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

17

Disk Space Management (2)

(a) Storing the free list on a linked list(b) A bit map

Page 18: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

18

Disk Space Management (3)

(a) Almost-full block of pointers to free disk blocks in RAM- three blocks of pointers on disk

(b) Result of freeing a 3-block file

(c) Alternative strategy for handling 3 free blocks- shaded entries are pointers to free disk blocks

Page 19: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

19

Disk Space Management (4)

Quotas for keeping track of each user’s disk use

Page 20: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

20

fsck() - blocks

• File system states(a) consistent(b) missing block –2: put it to the free list(c) duplicate block in free list –4: happens only in linked list(d) duplicate data block – 5: a delete will set the block as used and free

sol: copy the data to a new block

Page 21: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

21

fsck() - files• examines directory system

– counter per file

– hard link makes a file to be in multiple directories

• compares the counter values with link counter in the i-node– higher link counter value: i-node will not be deleted

– higher counter value: a linked file can be deleted

Page 22: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

22

Buffer Cache

The block cache data structures

Page 23: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

23

File System Performance (2)

• i-nodes placed at the start of the disk• Disk divided into cylinder groups

– each with its own blocks and i-nodes

Page 24: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

24

The MS-DOS File System (1)

The MS-DOS directory entry

Page 25: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

25

The MS-DOS File System (2)

• Maximum partition for different block sizes• The empty boxes represent forbidden combinations

Page 26: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

26

The Windows 98 File System (1)

• The extended MOS-DOS directory entry used in Windows 98• 32 bit block address is split into two places• for long name, a file has two names;

– My Document = MYDOCU~1

Bytes

Page 27: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

27

The Windows 98 File System (2)

An example of how a long name is stored in Windows 98

Page 28: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

28

The UNIX V7 File System (1)

A UNIX V7 directory entry

Page 29: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

29

The UNIX V7 File System (2)

A UNIX i-node

Page 30: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

30

The UNIX V7 File System (3)

The steps in looking up /usr/ast/mbox

Page 31: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

DEMOS(Cray-1)• in normal case, contiguous allocation• flexibility for non-contiguous allocation• file header

– table of base and size (10 entries)

each block group is contiguous on disk

block group(group ofblocks)

base size

Page 32: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

32

DEMOS (2)

– if a file needs more than 10 block groups, set flag in file header: BIGFILE (max 10GB)

• each block group contains pointers to block group

– pros & cons

• + easy to find free block groups (small bitmap)

• + free areas merge automatically

• - when disk comes close to full– no long runs of blocks (fragmentation)

– CPU overhead to find free block

– disk should preserve some reservation

• experience tells us 10% would be good

Page 33: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

Transactions in File System• reliability from unreliable components• concepts

– atomicity : all or nothing

– durability : once it happens, it is there

– serializability : transactions appear to happen one by one

Page 34: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

34

Transactions in File System(2)• Motivation

– File Systems have lots of data structures

• bitmap for free blocks

• directory

• file header

• indirect blocks

• data blocks

– for performance reason, all must be cached

• read requests are easy

• what about writes?

Page 35: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

35

Transactions in File System• Write to cache

– write through: cache is not of any help

– write back: data can be lost on a crash

• Multiple updates are usual for a single file operation– what happen if a crash occurs between updates

– e.g. 1: move a file between directories

• delete file from old directory

• add file to new directory

– e.g. 2: create a new file

• allocate space on disk for header, data

• write new header to disk

• add the new file to the directory

Page 36: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

Transactions in File System• Unix Approach (ad hoc)

– meta-data consistency

• synchronous write-through

• multiple updates are done in specific order

• after crash, fsck program fixes up anything in progress– file created, but not yet in a directory => delete file

– blocks allocated, but not in bitmap => update bitmap

– user data consistency

• write back to disk every 30 seconds or by user request

• no guarantee that blocks are written to disk in any order

– no support for transaction

• user may want multiple file operation done as a unit

Page 37: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

37

Transactions in File System• Write-ahead logging

– Almost all the file systems since 1985 use write-ahead logging

• Windows/NT, Solaris, OSF, etc.

– mechanism

• operation– write all changes in a transaction to log

– send file changes to disk

– reclaim log space

• if crash, read log:– if log isn't complete, no change!

– if log is completely written, apply all changes to disk

– if log is zero, then don't worry. All updates have gotten to disk

Page 38: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

Log-Structured File Systems• Idea

– write data once

– log is the only copy of the data

– as you modify disk blocks, store them in log

– put everything: data blocks, file header, etc, on log

• Data fetch– if need to get data from disk, get it from the log

– keep map in memory

• tells you where everything is

• map should be in the log for crash recovery

Page 39: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

39

Log-Structured File Systems• Advantage

– all writes are sequential!!

– no seeks, except for reads which can be handled by cache

• cache is getting bigger

• in extreme case, disk IO only for writes which are sequential

– same problems of contiguous allocation

• many files are deleted in the first 5 minutes

• need garbage collection

• if disk fills up, problem!!– keep disk under-utilized

Page 40: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

Log-Structured File Systems• Mechanism

– Issues for implementing the log• how to retrieve information from the log• enough free space for the log

– Cache file changes, and writes sequentially on the disk in a single operation

• fast writes– Information retrieval

• inode map at a fixed checkpoint region– indices to inodes contained in the write– most of them are cached in memory

• fast reads

Page 41: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

41

Log Examples

• In FFS, each inode is at a fixed location on disk– an index into the inode set is sufficient to find it

• in LFS, a map is needed to locate inode since it is mixed with data on the log

data i-node dir i-node data i-node dir i-node map log

i-node i-node data dir i-node i-node dir data

LFS

FFS

Page 42: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

Log-Structured File Systems• Space management

–holes left by deleting files

– threading

• use the dispersed holes like a linked list

• fragmentation will get worse

–copying

• copy a file out of the log to a leave large hole

• expensive especially for long-lived files

Page 43: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

43

Segment of LFS• Concept

– clean segments are linked (threading)– segments with holes may be copied into a clean segment– collect long-lived files into the same segment

• Cleaning Policy– when? low watermark for clean segments– how many segments? high watermark– which segments? - most fragmented– how to group files?

• files in the same directory• aging sort: sort by the last modification time

Page 44: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

Recovery• checkpoints and roll-forward (NOT a roll-back!!)

–possible since all the file operations are in the log

• checkpoint–a point in the log at which file system is completed–contains

• address of inode maps• segment usage table• current time

–checkpoint region• contains checkpoint• placed at a specific location on disk

Page 45: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

45

Recovery(2)• operation

– 1. write out all modified information to disk

– 2. write out checkpoint region

• on a crash,– roll-forward operations logged after the last checkpoint

– if the crash occurs while writing a checkpoint,

• keep old checkpoint

• need two checkpoint regions

Page 46: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

46

Roll-Forward• Recover as much information as possible

• in segment summary block, there exist– a new inode: then, there must be data blocks before it. Just update inode

map

– data blocks without inode: ignores them since we don’t know if the data blocks are complete

• Each inode has counter to indicate how many directories refer it– reference counter updated, but directory is not written

– directory is written, but the reference counter is not updated

– sol: employs special write ahead log for directory changes

Page 47: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

47

Informed Prefetching ..• Prefetching

– memory prefetching (to cache memory)

– disk prefetching (to memory buffer)

• disk latency is larger in different order of magnitude

• Pros & Cons of prefetching– reduce latency when the prefetched data is accessed

– file cache may be wasted if the prefetched data is unused

– difficult to know when the prefetched data will be used

– interference with other cached data and virtual memory is difficult to understand

• Assumptions– disk parallelism is underutilized

– applications provide hints

Page 48: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

48

Limits of RAID• RAID increases disk throughput when the workload

can be processed in parallel– very large accesses

– multiple concurrent accesses

• Many real I/O workload is not parallel– get a byte from a file

– think

– get another byte from (the same or another) file

– access only a single disk at a time

Page 49: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

49

Real I/O Workload• Recent trends

– faster CPU generated I/O requests more often

– programs favor larger data objects

– file cache hit ratio is more important than before

• Most workload is “read”– writes can be done behind in parallel - Linux

– processes are blocked on “read”

– most access patterns are predictable.

• Let’s use the predictability as “hints”

Page 50: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

50

Overview• Application discloses its future resource requirements

– the system makes the final decisions, not applications

• Disclosing hints are issued through ioctl– file specifier

• file name or file descriptor

– pattern specifier

• sequential

• list of <offset, length>

• What to do with the disclosing hints– parallelize the I/O request for RAID

– keep the data in the cache

– schedule disk to reduce seek time

Page 51: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

51

Informed Cache Manager

Page 52: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

52

A System Model

• total execution time T = NI/O(TCPU + TI/O)– number of IO X (time between IO + IO time)

• TI/O = Tmiss + Thit

• Tmiss = Thit + Tdriver + Tdisk

– Tdisk – latency of the disk fetch

– Tdriver – buffer allocation, queueing at the driver, and interrupt service

Page 53: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

53

Benefit of a bufferTstall (x)– read stall time when there are x buffers for x prefetches

Tpf (x)– service time for a hinted read when there are x buffers

- benefit of using one more buffer

application accesses the data

application issuesa hint

x buffers

Page 54: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

54

Stall time Tstall (x)

• before x-th request generates, it takes at least x(TCPU + Thit + Tdriver) CPU time – all cache hits, no stall

no stall

prefetchissued

prefetched datais accessed

Tdisk

x(TCPU + Thit + Tdriver)

prefetchissued

prefetched datais accessed

Tdisk

x(TCPU + Thit + Tdriver)stall time

Page 55: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

55

Prefetch Horizon

prefetch horizon P(TCPU) – distance

at which Tstall becomes zero, i.e.,

there is no need to prefetch beyond this point

• stall time Tstall (x) is bounded by

Tdisk - x(TCPU + Thit + Tdriver)

- it overestimates !!

Page 56: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

56

What really happens …

• 3 buffers are assumed• so,

Page 57: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

57

Benefit of a single buffer• When used for prefetching

• When used for demand miss

Page 58: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

58

Model Verification

• The model underestimates the stall time due to– neglecting disk contention

– variation in disk service time (queueing effect)

• overall, it is a good estimator

Page 59: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

59

Cost of Shrinking LRU buffer cache• hit ratio: H(n) for file cache with n buffers• service time

TLRU(n) = H(n)Thit + (1-H(n))Tmiss

• cost for taking a buffer from the file cache

△ TLRU(n) = TLRU(n-1) - TLRU(n)

= (H(n) - H(n-1))(Tmiss – Thit)

– H(n) varies with workload– need dynamic run time monitoring

Page 60: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

60

Cost for Ejecting a Prefetched Block• cost is paid when the ejected block is accessed again later

– if the block stays in the cache, it would be Thit

• cost when that block is prefetched in x accesses in advance– ejection frees one block for y-x accesses

– increase in service time per access is

eject prefetch

reaccess

yx

region affectedby eviction

Page 61: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

61

Local Value Estimates

Page 62: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

62

Seeking Global Optimum• Normalization of each estimate; LRU, hinted

prefetch– multiply each with usage rate– unhinted demand access rate × LRU cache estimate(TLRU)– access rate to the hinted sequence × (TPF)

• When a manager needs a new block– each estimator selects the least valuable block

• hint: the block that is accessed in the furthest future• LRU: the block at the bottom of the LRU stack• the manager selects the least valuable block

– compare the benefit with the cost of least valuable block

Page 63: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

63

After 4 Years, • Providing hints is too much burden to programmers

• Automatic hints generation is desired– there are idle CPU times when program blocks for I/O

– speculative execution can provide hints for future I/O accesses

• Approaches made– a kernel threads performs the speculative execution

– this speculating thread shares the address space

• Issues– run time overhead

– incorrectness

• may affect the correctness of the results

• incorrect hints may waste I/O bandwidth

Page 64: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

64

Ensuring Program Correctness• Software copy-on-write

– prevents code/data distortion

– for each new write to a memory region, make a copy

– insert code to every load/store to check if it is to a copied region

• software fault isolation

– code is inserted to a copy of code (shadow code)

• original code is not changed, so no overhead for normal execution

• Generates no system call– system state is not changed by the speculative execution

• Signal handler– catches all exceptions that may disturb normal execution

Page 65: Joonwon Lee joon@kaist.ac.kr File System. 2 Long-term Information Storage 1.Must store large amounts of data 2.Information stored must survive the termination

65

Generating Correct and Timely Hints• Problems

– the speculating thread may lack behind generating stale hints

– the speculating thread may stray from the execution path

• How to detect the problems– the original thread checks the hint log prepared by the

speculating thread

– if it is wrong, the original thread prepares a copy of register set and sets the flag

– when the speculating thread is invoked, checks the flag

• if set, restart using the register set