Strata: A Cross Media File System - ACM SIGOPS · Digest optimization: Log coalescing SQLite, Mail server: crash consistent update using write ahead logging 16 Create journal file

Strata: A Cross Media File System

1

Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

Let’s build a fast server

2

Requirements

• Small updates (1 Kbytes) dominate

• Dataset scales up to 10 TB

• Updates must be crash consistent

NoSQL store, Database, File server, Mail server …

Storage diversification

Latency $/GB

DRAM 100 ns 8.6

NVM (soon) 300 ns 4.0

SSD 10 us 0.25

HDD 10 ms 0.02 Bet

ter p

erfo

rman

ce

Hig

her c

apac

ity

3


Latency $/GB

DRAM 100 ns 8.6


SSD 10 us 0.25

HDD 10 ms 0.02 Bet

ter p

erfo

rman

ce

Hig

her c

apac

ity

3

Byte-addressable: cache-line granularity IO


Latency $/GB

DRAM 100 ns 8.6


SSD 10 us 0.25

HDD 10 ms 0.02 Bet

ter p

erfo

rman

ce

Hig

her c

apac

ity

3

Large erasure blocks need to be sequentially written Random writes: 5~6x slowdown due to GC [FAST’15]

Byte-addressable: cache-line granularity IO

Application

A fast server on today’s file system

4

• Small updates (1 Kbytes) dominate • Dataset scales up to 10TB • Updates must be crash consistent

91%Kernel file system

NVM

Kernel file system: NOVA [FAST 16, SOSP 17]

Application


4

Small, random IO is slow!



NVM


Application


4




NVM

1 KB

IO latency (us)0 1.5 3 4.5 6

Write to device Kernel code91%


Application


4




NVM

1 KB

IO latency (us)0 1.5 3 4.5 6

Write to device Kernel code91%

Kernel file system: NOVA [FAST 16, SOSP 17] NVM is so fast that kernel is the bottleneck


5


Kernel file system

NVM

Application


5


Need huge capacity, but NVM alone is too expensive! ($40K for 10TB)

Kernel file system

NVM

Application


5


Need huge capacity, but NVM alone is too expensive! ($40K for 10TB)

Kernel file system

NVM

Application

For low-cost capacity with high performance, must leverage multiple device types


6


Block-level caching

NVM SSD HDD

Kernel file system

Application


6


• Block-level caching manages data in blocks, but NVM is byte-addressable

• Extra level of indirection

Block-level caching

NVM SSD HDD

Kernel file system

Application


6




Block-level caching

NVM SSD HDD

Kernel file system

Application

1 KB

IO latency (us)

0 3 6 9 12

NOVA Block-level caching

Better


6




Block-level caching

NVM SSD HDD

Kernel file system

Application

1 KB

IO latency (us)

0 3 6 9 12

NOVA Block-level caching

Better

Block-level caching is too slow


7



7


Pillai et al., OSDI 2014

SQLiteHDFS

ZooKeeperLevelDB

HSQLDBMercurial

Git

Crash vulnerabilities0 2 4 6 8 10


7


Pillai et al., OSDI 2014

SQLiteHDFS

ZooKeeperLevelDB

HSQLDBMercurial

Git

Crash vulnerabilities0 2 4 6 8 10

Applications struggle for crash consistency

Problems in today’s file systems

8

• Kernel mediates every operation NVM is so fast that kernel is the bottleneck

• Tied to a single type of device For low-cost capacity with high performance, must leverage multiple device types NVM (soon), SSD, HDD

• Aggressive caching in DRAM, write to device only when you must (fsync)

Applications struggle for crash consistency

Strata: A Cross Media File System

9

Performance: especially small, random IO • Fast user-level device access

Low-cost capacity: leverage NVM, SSD & HDD • Transparent data migration across different storage media • Efficiently handle device IO properties

Simplicity: intuitive crash consistency model • In-order, synchronous IO • No fsync() required

Strata: main design principle

Log operations to NVM at user-level

10

Digest and migrate data in kernel



Performance: Kernel bypass, but private

10




Simplicity: Intuitive crash consistencyPerformance: Kernel bypass, but private

10





Coordinate multi-process accesses

10






10


Apply log operations to shared data





10


Apply log operations to shared data

LibFS

KernelFS

Outline• LibFS: Log operations to NVM at user-level

• Fast user-level access • In-order, synchronous IO

• KernelFS: Digest and migrate data in kernel • Asynchronous digest • Transparent data migration • Shared file access

• Evaluation

11


12

unmodified application

Strata: LibFS

NVM

POSIX API

Private operation log

creat write …

File operations (data & metadata)

rename


12

• Fast writes

• Directly access fast NVM

• Sequentially append data

• Cache-line granularity

• Blind writes


Strata: LibFS

Kernel-bypass

NVM

POSIX API


creat write …


rename


12

• Fast writes

• Directly access fast NVM

• Sequentially append data

• Cache-line granularity

• Blind writes


Strata: LibFS

Kernel-bypass

NVM

POSIX API


creat write …


• Crash consistency

• On crash, kernel replays logrename


Intuitive crash consistency

13

Strata: LibFS

Kernel-bypass

NVM

Synchronous IO

POSIX API




13

Strata: LibFS

Kernel-bypass

NVM

Synchronous IO

• When each system call returns:

• Data/metadata is durable

• In-order update

• Atomic write

• Limited size (log size)

POSIX API




13

Strata: LibFS

Kernel-bypass

NVM

Synchronous IO



• In-order update

• Atomic write


POSIX API


fsync() is no-op



13

Strata: LibFS

Kernel-bypass

NVM

Synchronous IO



• In-order update

• Atomic write


POSIX API

Fast synchronous IO: NVM and kernel-bypass


fsync() is no-op




• Evaluation

14

15

Digest data in kernel

NVMNVM Shared areaPrivate operation log

Application

Strata: LibFS

POSIX API

Strata: KernelFS

15


Write

NVMNVM Shared areaPrivate operation log

Application

Strata: LibFS

POSIX API

Strata: KernelFS

15


Write

NVM

Digest

NVM Shared areaPrivate operation log

Application

Strata: LibFS

POSIX API

Strata: KernelFS

15

• Visibility: make private log visible to other applications

• Data layout: turn write-optimized to read-optimized format (extent tree)

• Large, batched IO

• Coalesce log


Write

NVM

Digest


Application

Strata: LibFS

POSIX API

Strata: KernelFS

Digest optimization: Log coalescing

SQLite, Mail server: crash consistent update using write ahead logging

16

Digest eliminates unneeded work

. . .. . .

Remove temporary durable writes


Application

Strata: LibFS

Strata: KernelFS

NVM Shared area



16

Create journal fileWrite data to journal fileWrite data to database fileDelete journal file


. . .. . .



Application

Strata: LibFS

Strata: KernelFS

NVM Shared area



16



. . .. . .

Write data to database file



Application

Strata: LibFS

Strata: KernelFS

NVM Shared area



16



. . .. . .




Application

Strata: LibFS

Strata: KernelFS

Throughput optimization: Log coalescing saves IO while digesting

NVM Shared area

17

Application

Strata: LibFS

Strata: KernelFS



Application

Strata: LibFS

Strata: KernelFS


18

SSD Shared area

HDD Shared area

• Low-cost capacity

• KernelFS migrates cold data to lower layers


Application

Strata: LibFS

Strata: KernelFS


18

SSD Shared area

HDD Shared area




NVM data

Application

Strata: LibFS

Strata: KernelFS


18

SSD Shared area

HDD Shared area




NVM dataLogs

Digest

Application

Strata: LibFS

Strata: KernelFS


18

SSD Shared area

HDD Shared area




NVM dataLogs

Digest

Application

Strata: LibFS

Strata: KernelFS


18

SSD Shared area

HDD Shared area




NVM data

Application

Strata: LibFS

Strata: KernelFS


18

SSD Shared area

HDD Shared area




NVM data

• Handle device IO properties

• Migrate 1 GB blocks

• Avoid SSD garbage collection overhead

Application

Strata: LibFS

Strata: KernelFS


18

SSD Shared area

HDD Shared area




NVM data

Resembles log-structured merge (LSM) tree

• Handle device IO properties

• Migrate 1 GB blocks

• Avoid SSD garbage collection overhead

Read: hierarchical search

19

Application

Strata: LibFS

Strata: KernelFS

NVM Shared areaPrivate OP log

SSD Shared area

HDD Shared area

NVM data

SSD data

HDD data

Log data

21

3

4

Search order

Shared file access

20

• Leases grant access rights to applications [SOSP’89] • Required for files and directories • Function like lock, but revocable • Exclusive writer, shared readers

Shared file access

20

• Leases grant access rights to applications [SOSP’89] • Required for files and directories • Function like lock, but revocable • Exclusive writer, shared readers

• On revocation, LibFS digests leased data • Private data made public before losing lease

• Leases serialize concurrent updates




• Evaluation

21

Experimental setup• 2x Intel Xeon E5-2640 CPU, 64 GB DRAM

• 400 GB NVMe SSD, 1 TB HDD • Ubuntu 16.04 LTS, Linux kernel 4.8.12

• Emulated NVM • Use 40 GB of DRAM • Performance model [Y. Zhang et al. MSST 2015]

• Throttle latency & throughput in software

22

Evaluation questions

23

• Latency:

• Does Strata efficiently support small, random writes?

• Does asynchronous digest have an impact on latency?

• Throughput:

• Strata writes data twice (logging and digesting). Can Strata sustain high throughput?

• How well does Strata perform when managing data across storage layers?

Related work

24

• NVM file systems PMFS[EuroSys 14]: In-place update file system

• NOVA[FAST 16]: log-structured file system

• EXT4-DAX: NVM support for EXT4

• SSD file system

• F2FS[FAST 15]: log-structured file system

Microbenchmark: write latency

25

• Strata logs to NVM

• Compare to NVM kernel file systems:PMFS, NOVA, EXT4-DAX

• Strata, NOVA

• In-order, synchronous IO

• Atomic write

• PMFS, EXT4-DAX

• No atomic write 0

2

4

6

8

10

IO size

128 B 1 KB 4 KB 16 KB

Strata PMFSNOVA EXT4-DAX

Latency (us)

Better

17 21 23 29

Microbenchmark: write latency

25

• Strata logs to NVM

• Compare to NVM kernel file systems:PMFS, NOVA, EXT4-DAX

• Strata, NOVA

• In-order, synchronous IO

• Atomic write

• PMFS, EXT4-DAX

• No atomic write 0

2

4

6

8

10

IO size

128 B 1 KB 4 KB 16 KB


Latency (us)

Better

17 21 23 29

Avg.: 26% better Tail : 43% better

Latency: persistent RPC

26

0

15

30

45

60

RPC size (IO size)1 KB 4 KB 64 KB

Strata PMFSNOVA EXT4-DAXNo persist

Better

98

Latency (us)

• Foundation of most servers

• Persist RPC data before sending ACK to client

• RPC over RDMA

• 40 Gb/s Infiniband NIC

• For small IO (1 KB)

• 25% slower than No persist

• 35% faster than PMFS 7x faster than EXT4-DAX

Latency: persistent RPC

26

0

15

30

45

60

RPC size (IO size)1 KB 4 KB 64 KB

Strata PMFSNOVA EXT4-DAXNo persist

Better

98

Latency (us)

• Foundation of most servers

• Persist RPC data before sending ACK to client

• RPC over RDMA

• 40 Gb/s Infiniband NIC

• For small IO (1 KB)

• 25% slower than No persist

• 35% faster than PMFS 7x faster than EXT4-DAX

35% better

Latency: LevelDB

0

10

20

30

Write sync.

Write seq.

Write rand.

Overwrite Readrand.


35.2 49.2 37.7

27

BetterLatency (us)

• LevelDB (NVM)

• Key size: 16 B

• Value size: 1 KB

• 300,000 objects

• Workload causes asynchronous digests

• Fast user-level logging

• Random write • 25% better than PMFS

• Random read • Tied with PMFS

Latency: LevelDB

0

10

20

30

Write sync.

Write seq.

Write rand.

Overwrite Readrand.


35.2 49.2 37.7

27

Better

25% better

TiedLatency (us)

• LevelDB (NVM)

• Key size: 16 B


• 300,000 objects





Latency: LevelDB

0

10

20

30

Write sync.

Write seq.

Write rand.

Overwrite Readrand.


35.2 49.2 37.7

27

Better

25% better

TiedLatency (us)

• LevelDB (NVM)

• Key size: 16 B


• 300,000 objects





IO latency not impacted by asynchronous digest

Evaluation questions

28

• Latency:

• Does Strata efficiently support small, random writes?

• Does asynchronous digest have an impact on latency?

• Throughput:

• Strata writes data twice (logging and digesting). Can Strata sustain high throughput?

• How well does Strata perform when managing data across storage layers?

Throughput: Varmail

29

Mail server workload from Filebench • Using only NVM • 10000 files • Read/Write ratio is 1:1 • Write-ahead logging Create journal file

Write data to journalWrite data to database fileDelete journal file



Removes temporary durable writes

KernelFS

Application

LibFS

Log coalescing

Throughput: Varmail

29

Mail server workload from Filebench • Using only NVM • 10000 files • Read/Write ratio is 1:1 • Write-ahead logging

StrataPMFSNOVA

EXT4-DAX

Throughput (op/s)

0K 100K 200K 300K 400KBetter

29% better

Create journal fileWrite data to journalWrite data to database fileDelete journal file




KernelFS

Application

LibFS

Log coalescing

Throughput: Varmail

29


Log coalescing eliminates 86% of log entries, saving 14 GB of IO

StrataPMFSNOVA

EXT4-DAX

Throughput (op/s)

0K 100K 200K 300K 400KBetter

29% better





KernelFS

Application

LibFS

Log coalescing

Throughput: Varmail

29


Log coalescing eliminates 86% of log entries, saving 14 GB of IO

StrataPMFSNOVA

EXT4-DAX

Throughput (op/s)

0K 100K 200K 300K 400KBetter

29% better





KernelFS

Application

LibFS

Log coalescing No kernel file system has both low latency and high throughput:

• PMFS: better latency • NOVA: better throughput

Strata achieves both low latency and high throughput

Throughput: data migration

30

File server workload from Filebench • Working set starts at NVM, grows to SSD, HDD • Read/Write ratio is 1:2


30


User-level migration • LRU: whole file granularity • Treat each file system as a black-box • NVM: NOVA, SSD: F2FS, HDD: EXT4


30



Block-level caching • Linux LVM cache, formatted with F2FS


30



StrataUser-level migrationBlock-level caching

Avg. throughput (ops/s)0K 2K 4K 6K 8K 10K

2x faster



30



22% faster than user-level migration Cross layer optimization: placing hot metadata in faster layers

StrataUser-level migrationBlock-level caching

Avg. throughput (ops/s)0K 2K 4K 6K 8K 10K

2x faster


Conclusion

Source code is available at https://github.com/ut-osa/strata

31

Server applications need fast, small random IO on vast datasets with intuitive crash consistency

Strata, a cross media file system, addresses these concerns

Performance: low latency, high throughput • Novel split of LibFS, KernelFS • Fast user-level access

Low-cost capacity: leverage NVM, SSD & HDD • Asynchronous digest • Transparent data migration with large, sequential IO

Simplicity: intuitive crash consistency model • In-order, synchronous IO

Backup

32

Device management overhead

SSD

Thro

ughp

ut (M

B/s)

0

250

500

750

1000

SSD utilization0.1 0.25 0.5 0.6 0.7 0.8 0.9 1

64 MB 128 MB 256 MB512 MB 1024 MB

For example, SSD Random write:

Sequential writes avoid management overhead33

5-6x difference by hardware GC

SSD, HDD prefer large sequential IO

Documents

Strata: A Cross Media File System - ACM SIGOPS · Digest optimization: Log coalescing SQLite, Mail server: crash consistent update using write ahead logging 16 Create journal file