Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane

NVRAMOS 2019

Asynchronous I/O Stack:A Low-latency Kernel I/O Stack

for Ultra-Low Latency SSDs

Jinkyu JeongSungkyunkwan University (SKKU)

Source: Gyusun Lee et al., Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra-Low Latency SSDs, USENIX ATC 19

NVRAMOS 2019

• Emerging ultra-low latency SSDs deliver I/Os in a few µs

Storage Performance Trends

2

Source: R. E. Bryant and D. R. O'Hallaron, Computer Systems: A Programmer's Perspective, Second Edition, Pearson Education, Inc., 2015

1985 1990 1995 2000 2003 2005 2010 2017

Late

ncy

(ns)

Year

HDD

SSD

ULL-SSD

DRAM

SRAM

Samsung Z-SSD Intel Optane SSD

101010101010101010

NVRAMOS 2019

0

0.2

0.4

0.6

0.8

1

SATASSD

NVMeSSD

Z-SSD OptaneSSD

SATASSD

NVMeSSD

Z-SSD OptaneSSD

Read Write (+fsync)

Nor

mal

ized

Lat

ency

Device

Kernel

User

• Low-latency SSDs expose the overhead of kernel I/O stack

Overhead of Kernel I/O Stack

3

NVRAMOS 2019

Synchronous I/O vs. Asynchronous I/O

4

Our Idea: apply asynchronous I/O concept to the I/O stack itself

Throughput Total latency

CPU

Device

A (computation)

B (I/O)

Total latency

Synchronous I/O

CPU

Device

A’ A”

B

Total latency

A” is independent to BAsynchronous I/O

NVRAMOS 2019

Read Path Overview

5

CPU

Device I/O

I/O stack operations I/O stack operations

CPU

Device I/O Latency reductionAsync. operations

sys_read() Return to user


Vanilla Read Path

Proposed Read Path

NVRAMOS 2019

Write Path Overview

CPU

Device

sys_write()

Buffered write

Vanilla Write Path

Return to user

NVRAMOS 2019

Write Path Overview

7

CPU

Device

CPU

DeviceLatency reduction

sys_fsync()

I/O I/O I/O

I/O stack ops. I/O stack ops. I/O stack ops.

sys_fsync()

I/O stack ops.

I/O I/O

I/O stack ops.

I/O stack ops.Return to user

Return to user

Vanilla Fsync Path

Proposed Fsync Path

I/O

NVRAMOS 2019

• Read path−Analysis of vanilla read path−Proposed read path

• Light-weight block I/O layer• Write path−Analysis of vanilla write path−Proposed write path

• Evaluation• Conclusion

Agenda

8

NVRAMOS 2019

Interrupt

Analysis of Vanilla Read Path

9

CPU

DeviceI/O submit

Total latency 12.82μs I/O 7.26µs

Page cache lookup 0.30µs

Page allocation 0.19µs

Page cache insertion 0.33µs

LBA retrieval 0.09µs

BIO submission 0.72µs

DMA mapping 0.29µs

NVMe I/O submission 0.37µs

Context switch 0.95µs

DMA unmapping 0.23µs

Request completion 0.81µs

Copy-to-user 0.21µs

Context switch 0.95µs

Page cacheFile systemBlock layerDevice driver Interrupt handler


NVRAMOS 2019

Page Allocation / DMA Mapping

10

CPU

DeviceI/O submit

I/O 7.26µs

Page allocation 0.19µs DMA mapping 0.29µs

Interrupt

NVRAMOS 2019

• DMA-mapped page pool

Asynchronous Page Allocation / DMA Mapping

11

CPU

DeviceI/O submit

I/O 7.26µs

Page allocation 0.19µs DMA mapping 0.29µs

Interrupt

…Core 0

…

64 pages

Core N …4KB DMA-mapped pages

Pagepool allocation

NVRAMOS 2019

• DMA-mapped page pool

Asynchronous Page Allocation / DMA Mapping

12

…Core 0

64 pages

Core N …

Page refill

CPU

DeviceI/O submit

I/O 7.26µs

Pagepool allocation 0.016µs DMA mapping 0.29µsPage allocation 0.19µs

Interrupt

Pagepool allocation

…

4KB DMA-mapped pages

NVRAMOS 2019

Page Cache Insertion

13

CPU

DeviceI/O submit

I/O 7.26µs


Interrupt

Page Page Page

…

…

Root

Leaf node

Prevention from duplicated I/O requestsfor the same file index

Page cache lookup overhead

Page cache tree extension overhead

Make I/O request

Cache insertion successWait for page read

Page cache tree

Cache lookup?Miss Cache lookup?

Hit

NVRAMOS 2019

Lazy Page Cache Insertion

14

CPU

DeviceI/O submit

I/O 7.26µs


Interrupt

Page Page

…

Root

Leaf node

Page freeDuplicated I/O requests

Page cache treePage cache lookup overhead

Page cache tree extension overhead

Lazy cache insertion?Success

Cache lookup?

Make I/O request

Cache lookup?

Lazy cache insertion?

Make I/O request

Fail

Miss

Miss

(extremely low frequency)

NVRAMOS 2019

DMA Unmapping

15

CPU

DeviceI/O submit

I/O 7.26µs

DMA unmapping 0.23µs

Interrupt

NVRAMOS 2019

• Implementation−Delays DMA unmapping to when a system is idle or waiting for another

I/O requests−Extended version of the deferred protection scheme in Linux [ASPLOS’16]−Optionally disabled for safety

Lazy DMA Unmapping

16

CPU

DeviceI/O submit

I/O 7.26µs

Lazy DMA unmapping 0.35µs

Interrupt

NVRAMOS 2019

Remaining Overheads in the Proposed Read Path

17

CPU

DeviceI/O submit

I/O 7.26µsInterrupt

BIO submission 0.72µs

NVMe I/O submission 0.37µs Request completion 0.81µs

NVRAMOS 2019




Agenda

18

NVRAMOS 2019

• Structure conversion−Merge bio with pending request via I/O merging−Assign new tag & request and convert from bio

• Multi-queue structure−Software staging queue (SW queue)

ü Support I/O scheduling and reordering

−Hardware dispatch queue (HW queue)ü Deliver the I/O request to the device driver

• Multiple dynamic memory allocations−Bio (block layer)−NVMe iod, scatter/gather list, NVMe PRP* list

(device driver)

Linux Multi-queue Block I/O Layer

19

iod

prp_list… NVMe Queue Pairs

Multi-queueBlock Layer

sg_list

bio Device Driver

Linux multi-queue block layer

Per-coreSW Queues

HW Queues

…

request: length, bio(s)

…

Tagrequest

NVMe CMD

bio: LBA, length, page(s), …

page

submit_bio()

*PRP: physical region page

NVRAMOS 2019

• Structure conversion− Inefficiency of I/O merging [Zhang,OSDI’18]

ü Useful feature for low-performance storage device

• Multi-queue structure• Multiple dynamic memory allocations


20

iod



sg_list

bio Device Driver


Per-coreSW Queues

HW Queues

…


…

Tagrequest

NVMe CMD


page

submit_bio()

NVRAMOS 2019

• Structure conversion− Inefficiency of I/O merging [Zhang,OSDI’18]

ü Useful feature for low-performance storage device

• Multi-queue structure− Inefficiency of I/O scheduling for low-latency

SSDs [Saxena,ATC’10] [Xu,SYSTOR’15]ü Default configuration is noop scheduler

−Bypass multi-queue structure [Zhang,OSDI’18]

−Device-side I/O scheduling [Peter,OSDI’14] [Joshi,HotStorage’17]

• Multiple dynamic memory allocations


21

iod



sg_list

bio Device Driver


Per-coreSW Queues

HW Queues

…


…

Tagrequest

NVMe CMD


page

submit_bio()

NVRAMOS 2019

Light-weight Block I/O Layer

22

prp_list

… NVMe Queue Pairs

Light-weightBlock Layer

Device Driver

Light-weight block layer

Per-CPULbio Pool

Tag

lbio: LBA, length, prp_list, page(s),

dma_addr(s) page

submit_lbio()

DMA-mappedPage PoolCore 0

lbio

NVMe CMD

…

…

… ……

Core n

• Light-weight bio (lbio) structure− Contains only essential arguments for to make

NVMe I/O request− Eliminates unnecessary structure conversions

and allocations

• Per-CPU lbio pool−Supports lockless lbio object allocation−Supports tagging function

• Single dynamic memory allocation−NVMe PRP* list (device driver)

*PRP: physical region page

NVRAMOS 2019

Read Path Comparison

23

Proposed Read Path

Proposed Read Path (before applying light-weight block I/O layer)

CPU

Device I/O 7.26µs

LBIO submission 0.13µs LBIO completion 0.65µs

I/O submit Interrupt Latency reduction

CPU

DeviceI/O submit

I/O 7.26µs

Interrupt

BIO submission 0.72µs Request completion 0.81µsNVMe I/O submission 0.37µs



NVRAMOS 2019

Read Path Comparison

CPU

Device I/O 7.26µs

24

CPU

DeviceI/O submit

Total latency 12.82μs I/O 7.26µs

Total latency 10.10μs

Interrupt

I/O submit Interrupt

Latency reduction

Vanilla Read Path

Proposed Read Path



NVRAMOS 2019




Agenda

25

NVRAMOS 2019

Analysis of Vanilla Fsync Path (Ext4 Ordered Mode)

26

Data block I/O

Data block submit Journal block submitFlush &

commit blocksubmit

CPU

jbd2

Device Journal block I/O Commit block I/O12.73μs 10.72μs12.57μs

Data writeback 5.68µs

jbd2 call 0.80µs

Journal block preparation 5.55µs- Allocating buffer pages- Allocating journal area block- Checksum computation…

Commit block preparation2.15µs

sys_fsync() Return to user

NVRAMOS 2019

Proposed Fsync Path (Ext4 Ordered Mode)

27

Data block I/O

CPU

jbd2

Device11.37μs

Data blocksubmit

Journal blocksubmit

Data Writeback 4.18µs- Reduced latency by applying light-weight block I/O layer

jbd2 call 0.78µs- Wake jbd2 before data block wait

Journal blockpreparation

5.21µsCommit block preparation 1.90µs

sys_fsync()

jbd2 commit wait

NVRAMOS 2019


28

Data block I/O Journal block I/O Commit block I/O

CPU

jbd2

Device10.61μs 10.44μs11.37μs

Data blocksubmit

Journal blocksubmit

Flush &commit block

submit





Data & journal block I/O wait Flush & commit block dispatch 0.04µs


jbd2 commit wait

NVRAMOS 2019

Fsync Path Comparison (Ext4 Ordered Mode)

29


CPU

jbd2

Device

Data blocksubmit

Journal blocksubmit

Flush &commit block submit


CPU

jbd2

Data block I/ODevice Journal block I/O Commit block I/O

Data block submit Journal block submitFlush &

commit block submit


Latency reduction

Vanilla Fsync Path

Proposed Fsync Path

NVRAMOS 2019




Agenda

30

NVRAMOS 2019

Experimental Setup

31

Server Dell R730

OS Ubuntu 16.04.4

Base kernel Linux 5.0.5

CPU Intel Xeon E5-2640v3 2.6GHz 8-cores

Memory DDR4 32GB

Storage devices Z-SSD: Samsung SZ985 800GBOptane SSD: Intel Optane 905P 960GB

Workloads Synthetic micro-benchmark: FIOReal-world workload: RocksDB DBbench

NVRAMOS 2019

• 4KB block size

FIO Performance (Random Read)• Single thread

32

0

20

40

60

80

100

Late

ncy

(µs)

Block size

Vanilla AIOS AIOS-poll

Up to 23%latency reduction

7.6µs

0

100

200

300

400

500

600

700

1 2 4 8 16 32 64 128IO

PS (k

)Threads

Vanilla AIOSUp to 26% IOPS improvement

NVRAMOS 2019

• 4KB block size

FIO Performance (Random Write+Fsync, Ext4 Ordered)• Single thread

33

0

50

100

150

200

250

Late

ncy

(µs)

Block size

Vanilla AIOS

Up to 26%latency reduction

0

20

40

60

80

100

120

140

160

1 2 4 8 16IO

PS (k

)Threads

Vanilla AIOS

Up to 27%IOPS improvement

NVRAMOS 2019

• DBbench readrandom− 64GB dataset

• DBbench fillsync− 16GB dataset

050

100150200250300350400450500

1 2 4 8 16

OP/

s (k

)

Threads

Vanilla AIOS

0102030405060708090

100

1 2 4 8 16O

P/s

(k)

Threads

Vanilla AIOS

RocksDB Performance

34

Up to 44%performanceimprovement

Up to 27%performanceimprovement

27.2% ▲22.3% ▲

22.9% ▲

18.5% ▲

10.7% ▲

44.3% ▲38.9% ▲

35.7% ▲

30.6% ▲

23.6% ▲

NVRAMOS 2019

Conclusion

35

• Asynchronous I/O stack−Applies asynchronous I/O concept to the kernel I/O stack itself

−Overlaps computation with I/O to reduce total I/O latency

• Light-weight block I/O layer−Provides low-latency block I/O services for low-latency NVMe SSDs

• Performance evaluation−Achieves a single-digit microsecond I/O latency on Optane SSD

−Achieves significant latency reduction and performance improvement on

real-world workloadsSource code: https://github.com/skkucsl/aios

NVRAMOS 2019

Comparison with User-level Direct Access Approach

36

0102030405060708090

Ran

dom

Rea

d La

tenc

y (µ

s) User+DeviceCopy-to-userKernelUser+Device (SPDK)

4KB 8KB 16KB 32KB 64KB 128KBBlock size

NVRAMOS 2019

Q&A

• Thank you

37

NVRAMOS 2019

Acknowledgment

• This talk is supported by the National Research Foundation of

Korea (NRF) grant funded by the Korea government (MSIT)

(NRF-2017R1C1B2007273).

38

NVRAMOS 2019

Extra Slides

NVRAMOS 2019

• Extent status tree (in-memory structure)− Contains LBA mapping info. for file blocks (composed of extent objects)−Occurs additional I/O request to read mapping block in case of extent lookup miss

• Control & data plane approach [OSDI‘14]− Preloads the entire mapping info. into the memory in the control plane (e.g., file open)− Selectively applies to files requiring low latency access

Preloading Extent Tree

40

CPU

DeviceI/O submit

Device I/O 7.26µs

LBA retrieval 0.09µs 0.07µs

Interrupt

NVRAMOS 2019

CDF of the Block I/O Submission Latency

41

4KB random read 32KB random read

• Block I/O submission latency− Time from allocating a object (bio or lbio) to dispatching an I/O command to a device

NVRAMOS 2019

Read Performance Analysis with Opt. Levels

42

00.10.20.30.40.50.60.70.80.9

1

4KB 8KB 16KB 32KB 64KB 128KB

Nor

mal

ized

Lat

ency

Block size

Vanilla + Preload + MQ-bypass+ LBIO + Async-page + Async-DMA • Preload

− Preloading extent tree• MQ-bypass− Bypassing SW, HW queue

• LBIO− Light-weight block I/O layer

• Async-page− Asynchronous page allocation− Lazy page cache insertion

• Async-DMA− Asynchronous DMA mapping− Lazy DMA unmapping

NVRAMOS 2019

Write(+fsync) Performance Analysis with Opt. Levels

43

00.10.20.30.40.50.60.70.80.9

1

4KB 8KB 16KB 32KB 64KB 128KB

Nor

mal

ized

Lat

ency

Block size

Vanilla + LBIO + AIOS

• LBIO− Light-weight block I/O layer− Synchronous page allocation /

DMA mapping • AIOS− Proposed fsync path

NVRAMOS 2019

CPU Usage Breakdown

• RocksDB DBbench readrandom / fillsync using 8 threads

44

00.10.20.30.40.50.60.70.80.9

1

Vanilla AIOS Vanilla AIOS

readrandom fillsync

Nor

mal

ized

CPU

Usa

ge

Idle

I/O wait

Kernel

UserEffective overlapping between

I/O and kernel operationsthrough AIOS

Reduced overall runtime

Reduced I/O wait & kernel CPU time

NVRAMOS 2019

Comparison with User-level Direct Access Approach

45

0102030405060708090

Ran

dom

Rea

d La

tenc

y (µ

s) User+DeviceCopy-to-userKernelUser+Device (SPDK)

4KB 8KB 16KB 32KB 64KB 128KBBlock size

NVRAMOS 2019

Memory Overheads on Asynchronous I/O Stack

46

Object Linuxblock layer

Light-weightblock layer

Statically allocated objects (per-core)

request pool 412KB -

lbio array - 192KB

Free page pool - 256KB

128KB block request

(l)bio + (l)bio_vec 648B 704B

request 384B -

iod 974B -

prp_list 4096B 4096B

Total 6104B 4800B

NVRAMOS 2019


47


CPU

jbd2

Device10.61μs 10.44μs11.37μs

Data blocksubmit

Journal blocksubmit

Flush &commit block

submit





Data & journal block I/O wait Flush & commit block dispatch 0.04µs


jbd2 commit wait

NVRAMOS 2019

• Implementation−Delays DMA unmapping to when a system is idle or waiting for another

I/O requests−Extended version of the deferred protection scheme in Linux [ASPLOS’16]−Optionally disabled for safety

Lazy DMA Unmapping

48

CPU

DeviceI/O submit

I/O 7.26µs

Lazy DMA unmapping 0.35µs

Interrupt

Documents

Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra … · 2019. 10. 28. · NVRAMOS 2019 0 0.2 0.4 0.6 0.8 1 SATA SSD NVMe SSD Z-SSDOptane SSD SATA SSD NVMe SSD Z-SSDOptane