Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
NVRAMOS 2019
Asynchronous I/O Stack:A Low-latency Kernel I/O Stack
for Ultra-Low Latency SSDs
Jinkyu JeongSungkyunkwan University (SKKU)
Source: Gyusun Lee et al., Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra-Low Latency SSDs, USENIX ATC 19
NVRAMOS 2019
• Emerging ultra-low latency SSDs deliver I/Os in a few µs
Storage Performance Trends
2
Source: R. E. Bryant and D. R. O'Hallaron, Computer Systems: A Programmer's Perspective, Second Edition, Pearson Education, Inc., 2015
1985 1990 1995 2000 2003 2005 2010 2017
Late
ncy
(ns)
Year
HDD
SSD
ULL-SSD
DRAM
SRAM
Samsung Z-SSD Intel Optane SSD
101010101010101010
NVRAMOS 2019
0
0.2
0.4
0.6
0.8
1
SATASSD
NVMeSSD
Z-SSD OptaneSSD
SATASSD
NVMeSSD
Z-SSD OptaneSSD
Read Write (+fsync)
Nor
mal
ized
Lat
ency
Device
Kernel
User
• Low-latency SSDs expose the overhead of kernel I/O stack
Overhead of Kernel I/O Stack
3
NVRAMOS 2019
Synchronous I/O vs. Asynchronous I/O
4
Our Idea: apply asynchronous I/O concept to the I/O stack itself
Throughput Total latency
CPU
Device
A (computation)
B (I/O)
Total latency
Synchronous I/O
CPU
Device
A’ A”
B
Total latency
A” is independent to BAsynchronous I/O
NVRAMOS 2019
Read Path Overview
5
CPU
Device I/O
I/O stack operations I/O stack operations
CPU
Device I/O Latency reductionAsync. operations
sys_read() Return to user
sys_read() Return to user
Vanilla Read Path
Proposed Read Path
NVRAMOS 2019
Write Path Overview
CPU
Device
sys_write()
Buffered write
Vanilla Write Path
Return to user
NVRAMOS 2019
Write Path Overview
7
CPU
Device
CPU
DeviceLatency reduction
sys_fsync()
I/O I/O I/O
I/O stack ops. I/O stack ops. I/O stack ops.
sys_fsync()
I/O stack ops.
I/O I/O
I/O stack ops.
I/O stack ops.Return to user
Return to user
Vanilla Fsync Path
Proposed Fsync Path
I/O
NVRAMOS 2019
• Read path−Analysis of vanilla read path−Proposed read path
• Light-weight block I/O layer• Write path−Analysis of vanilla write path−Proposed write path
• Evaluation• Conclusion
Agenda
8
NVRAMOS 2019
Interrupt
Analysis of Vanilla Read Path
9
CPU
DeviceI/O submit
Total latency 12.82μs I/O 7.26µs
Page cache lookup 0.30µs
Page allocation 0.19µs
Page cache insertion 0.33µs
LBA retrieval 0.09µs
BIO submission 0.72µs
DMA mapping 0.29µs
NVMe I/O submission 0.37µs
Context switch 0.95µs
DMA unmapping 0.23µs
Request completion 0.81µs
Copy-to-user 0.21µs
Context switch 0.95µs
Page cacheFile systemBlock layerDevice driver Interrupt handler
sys_read() Return to user
NVRAMOS 2019
Page Allocation / DMA Mapping
10
CPU
DeviceI/O submit
I/O 7.26µs
Page allocation 0.19µs DMA mapping 0.29µs
Interrupt
NVRAMOS 2019
• DMA-mapped page pool
Asynchronous Page Allocation / DMA Mapping
11
CPU
DeviceI/O submit
I/O 7.26µs
Page allocation 0.19µs DMA mapping 0.29µs
Interrupt
…Core 0
…
64 pages
Core N …4KB DMA-mapped pages
Pagepool allocation
NVRAMOS 2019
• DMA-mapped page pool
Asynchronous Page Allocation / DMA Mapping
12
…Core 0
64 pages
Core N …
Page refill
CPU
DeviceI/O submit
I/O 7.26µs
Pagepool allocation 0.016µs DMA mapping 0.29µsPage allocation 0.19µs
Interrupt
Pagepool allocation
…
4KB DMA-mapped pages
NVRAMOS 2019
Page Cache Insertion
13
CPU
DeviceI/O submit
I/O 7.26µs
Page cache insertion 0.33µs
Interrupt
Page Page Page
…
…
Root
Leaf node
Prevention from duplicated I/O requestsfor the same file index
Page cache lookup overhead
Page cache tree extension overhead
Make I/O request
Cache insertion successWait for page read
Page cache tree
Cache lookup?Miss Cache lookup?
Hit
NVRAMOS 2019
Lazy Page Cache Insertion
14
CPU
DeviceI/O submit
I/O 7.26µs
Page cache insertion 0.35µs
Interrupt
Page Page
…
Root
Leaf node
Page freeDuplicated I/O requests
Page cache treePage cache lookup overhead
Page cache tree extension overhead
Lazy cache insertion?Success
Cache lookup?
Make I/O request
Cache lookup?
Lazy cache insertion?
Make I/O request
Fail
Miss
Miss
(extremely low frequency)
NVRAMOS 2019
DMA Unmapping
15
CPU
DeviceI/O submit
I/O 7.26µs
DMA unmapping 0.23µs
Interrupt
NVRAMOS 2019
• Implementation−Delays DMA unmapping to when a system is idle or waiting for another
I/O requests−Extended version of the deferred protection scheme in Linux [ASPLOS’16]−Optionally disabled for safety
Lazy DMA Unmapping
16
CPU
DeviceI/O submit
I/O 7.26µs
Lazy DMA unmapping 0.35µs
Interrupt
NVRAMOS 2019
Remaining Overheads in the Proposed Read Path
17
CPU
DeviceI/O submit
I/O 7.26µsInterrupt
BIO submission 0.72µs
NVMe I/O submission 0.37µs Request completion 0.81µs
NVRAMOS 2019
• Read path−Analysis of vanilla read path−Proposed read path
• Light-weight block I/O layer• Write path−Analysis of vanilla write path−Proposed write path
• Evaluation• Conclusion
Agenda
18
NVRAMOS 2019
• Structure conversion−Merge bio with pending request via I/O merging−Assign new tag & request and convert from bio
• Multi-queue structure−Software staging queue (SW queue)
ü Support I/O scheduling and reordering
−Hardware dispatch queue (HW queue)ü Deliver the I/O request to the device driver
• Multiple dynamic memory allocations−Bio (block layer)−NVMe iod, scatter/gather list, NVMe PRP* list
(device driver)
Linux Multi-queue Block I/O Layer
19
iod
prp_list… NVMe Queue Pairs
Multi-queueBlock Layer
sg_list
bio Device Driver
Linux multi-queue block layer
Per-coreSW Queues
HW Queues
…
request: length, bio(s)
…
Tagrequest
NVMe CMD
bio: LBA, length, page(s), …
page
submit_bio()
*PRP: physical region page
NVRAMOS 2019
• Structure conversion− Inefficiency of I/O merging [Zhang,OSDI’18]
ü Useful feature for low-performance storage device
• Multi-queue structure• Multiple dynamic memory allocations
Linux Multi-queue Block I/O Layer
20
iod
prp_list… NVMe Queue Pairs
Multi-queueBlock Layer
sg_list
bio Device Driver
Linux multi-queue block layer
Per-coreSW Queues
HW Queues
…
request: length, bio(s)
…
Tagrequest
NVMe CMD
bio: LBA, length, page(s), …
page
submit_bio()
NVRAMOS 2019
• Structure conversion− Inefficiency of I/O merging [Zhang,OSDI’18]
ü Useful feature for low-performance storage device
• Multi-queue structure− Inefficiency of I/O scheduling for low-latency
SSDs [Saxena,ATC’10] [Xu,SYSTOR’15]ü Default configuration is noop scheduler
−Bypass multi-queue structure [Zhang,OSDI’18]
−Device-side I/O scheduling [Peter,OSDI’14] [Joshi,HotStorage’17]
• Multiple dynamic memory allocations
Linux Multi-queue Block I/O Layer
21
iod
prp_list… NVMe Queue Pairs
Multi-queueBlock Layer
sg_list
bio Device Driver
Linux multi-queue block layer
Per-coreSW Queues
HW Queues
…
request: length, bio(s)
…
Tagrequest
NVMe CMD
bio: LBA, length, page(s), …
page
submit_bio()
NVRAMOS 2019
Light-weight Block I/O Layer
22
prp_list
… NVMe Queue Pairs
Light-weightBlock Layer
Device Driver
Light-weight block layer
Per-CPULbio Pool
Tag
lbio: LBA, length, prp_list, page(s),
dma_addr(s) page
submit_lbio()
DMA-mappedPage PoolCore 0
lbio
NVMe CMD
…
…
… ……
Core n
• Light-weight bio (lbio) structure− Contains only essential arguments for to make
NVMe I/O request− Eliminates unnecessary structure conversions
and allocations
• Per-CPU lbio pool−Supports lockless lbio object allocation−Supports tagging function
• Single dynamic memory allocation−NVMe PRP* list (device driver)
*PRP: physical region page
NVRAMOS 2019
Read Path Comparison
23
Proposed Read Path
Proposed Read Path (before applying light-weight block I/O layer)
CPU
Device I/O 7.26µs
LBIO submission 0.13µs LBIO completion 0.65µs
I/O submit Interrupt Latency reduction
CPU
DeviceI/O submit
I/O 7.26µs
Interrupt
BIO submission 0.72µs Request completion 0.81µsNVMe I/O submission 0.37µs
sys_read() Return to user
sys_read() Return to user
NVRAMOS 2019
Read Path Comparison
CPU
Device I/O 7.26µs
24
CPU
DeviceI/O submit
Total latency 12.82μs I/O 7.26µs
Total latency 10.10μs
Interrupt
I/O submit Interrupt
Latency reduction
Vanilla Read Path
Proposed Read Path
sys_read() Return to user
sys_read() Return to user
NVRAMOS 2019
• Read path−Analysis of vanilla read path−Proposed read path
• Light-weight block I/O layer• Write path−Analysis of vanilla write path−Proposed write path
• Evaluation• Conclusion
Agenda
25
NVRAMOS 2019
Analysis of Vanilla Fsync Path (Ext4 Ordered Mode)
26
Data block I/O
Data block submit Journal block submitFlush &
commit blocksubmit
CPU
jbd2
Device Journal block I/O Commit block I/O12.73μs 10.72μs12.57μs
Data writeback 5.68µs
jbd2 call 0.80µs
Journal block preparation 5.55µs- Allocating buffer pages- Allocating journal area block- Checksum computation…
Commit block preparation2.15µs
sys_fsync() Return to user
NVRAMOS 2019
Proposed Fsync Path (Ext4 Ordered Mode)
27
Data block I/O
CPU
jbd2
Device11.37μs
Data blocksubmit
Journal blocksubmit
Data Writeback 4.18µs- Reduced latency by applying light-weight block I/O layer
jbd2 call 0.78µs- Wake jbd2 before data block wait
Journal blockpreparation
5.21µsCommit block preparation 1.90µs
sys_fsync()
jbd2 commit wait
NVRAMOS 2019
Proposed Fsync Path (Ext4 Ordered Mode)
28
Data block I/O Journal block I/O Commit block I/O
CPU
jbd2
Device10.61μs 10.44μs11.37μs
Data blocksubmit
Journal blocksubmit
Flush &commit block
submit
Data Writeback 4.18µs- Reduced latency by applying light-weight block I/O layer
jbd2 call 0.78µs- Wake jbd2 before data block wait
Journal blockpreparation
5.21µsCommit block preparation 1.90µs
Data & journal block I/O wait Flush & commit block dispatch 0.04µs
sys_fsync() Return to user
jbd2 commit wait
NVRAMOS 2019
Fsync Path Comparison (Ext4 Ordered Mode)
29
Data block I/O Journal block I/O Commit block I/O
CPU
jbd2
Device
Data blocksubmit
Journal blocksubmit
Flush &commit block submit
Total latency 38.03μs
CPU
jbd2
Data block I/ODevice Journal block I/O Commit block I/O
Data block submit Journal block submitFlush &
commit block submit
Total latency 53.94μs
Latency reduction
Vanilla Fsync Path
Proposed Fsync Path
NVRAMOS 2019
• Read path−Analysis of vanilla read path−Proposed read path
• Light-weight block I/O layer• Write path−Analysis of vanilla write path−Proposed write path
• Evaluation• Conclusion
Agenda
30
NVRAMOS 2019
Experimental Setup
31
Server Dell R730
OS Ubuntu 16.04.4
Base kernel Linux 5.0.5
CPU Intel Xeon E5-2640v3 2.6GHz 8-cores
Memory DDR4 32GB
Storage devices Z-SSD: Samsung SZ985 800GBOptane SSD: Intel Optane 905P 960GB
Workloads Synthetic micro-benchmark: FIOReal-world workload: RocksDB DBbench
NVRAMOS 2019
• 4KB block size
FIO Performance (Random Read)• Single thread
32
0
20
40
60
80
100
Late
ncy
(µs)
Block size
Vanilla AIOS AIOS-poll
Up to 23%latency reduction
7.6µs
0
100
200
300
400
500
600
700
1 2 4 8 16 32 64 128IO
PS (k
)Threads
Vanilla AIOSUp to 26% IOPS improvement
NVRAMOS 2019
• 4KB block size
FIO Performance (Random Write+Fsync, Ext4 Ordered)• Single thread
33
0
50
100
150
200
250
Late
ncy
(µs)
Block size
Vanilla AIOS
Up to 26%latency reduction
0
20
40
60
80
100
120
140
160
1 2 4 8 16IO
PS (k
)Threads
Vanilla AIOS
Up to 27%IOPS improvement
NVRAMOS 2019
• DBbench readrandom− 64GB dataset
• DBbench fillsync− 16GB dataset
050
100150200250300350400450500
1 2 4 8 16
OP/
s (k
)
Threads
Vanilla AIOS
0102030405060708090
100
1 2 4 8 16O
P/s
(k)
Threads
Vanilla AIOS
RocksDB Performance
34
Up to 44%performanceimprovement
Up to 27%performanceimprovement
27.2% ▲22.3% ▲
22.9% ▲
18.5% ▲
10.7% ▲
44.3% ▲38.9% ▲
35.7% ▲
30.6% ▲
23.6% ▲
NVRAMOS 2019
Conclusion
35
• Asynchronous I/O stack−Applies asynchronous I/O concept to the kernel I/O stack itself
−Overlaps computation with I/O to reduce total I/O latency
• Light-weight block I/O layer−Provides low-latency block I/O services for low-latency NVMe SSDs
• Performance evaluation−Achieves a single-digit microsecond I/O latency on Optane SSD
−Achieves significant latency reduction and performance improvement on
real-world workloadsSource code: https://github.com/skkucsl/aios
NVRAMOS 2019
Comparison with User-level Direct Access Approach
36
0102030405060708090
Ran
dom
Rea
d La
tenc
y (µ
s) User+DeviceCopy-to-userKernelUser+Device (SPDK)
4KB 8KB 16KB 32KB 64KB 128KBBlock size
NVRAMOS 2019
Q&A
• Thank you
37
NVRAMOS 2019
Acknowledgment
• This talk is supported by the National Research Foundation of
Korea (NRF) grant funded by the Korea government (MSIT)
(NRF-2017R1C1B2007273).
38
NVRAMOS 2019
Extra Slides
NVRAMOS 2019
• Extent status tree (in-memory structure)− Contains LBA mapping info. for file blocks (composed of extent objects)−Occurs additional I/O request to read mapping block in case of extent lookup miss
• Control & data plane approach [OSDI‘14]− Preloads the entire mapping info. into the memory in the control plane (e.g., file open)− Selectively applies to files requiring low latency access
Preloading Extent Tree
40
CPU
DeviceI/O submit
Device I/O 7.26µs
LBA retrieval 0.09µs 0.07µs
Interrupt
NVRAMOS 2019
CDF of the Block I/O Submission Latency
41
4KB random read 32KB random read
• Block I/O submission latency− Time from allocating a object (bio or lbio) to dispatching an I/O command to a device
NVRAMOS 2019
Read Performance Analysis with Opt. Levels
42
00.10.20.30.40.50.60.70.80.9
1
4KB 8KB 16KB 32KB 64KB 128KB
Nor
mal
ized
Lat
ency
Block size
Vanilla + Preload + MQ-bypass+ LBIO + Async-page + Async-DMA • Preload
− Preloading extent tree• MQ-bypass− Bypassing SW, HW queue
• LBIO− Light-weight block I/O layer
• Async-page− Asynchronous page allocation− Lazy page cache insertion
• Async-DMA− Asynchronous DMA mapping− Lazy DMA unmapping
NVRAMOS 2019
Write(+fsync) Performance Analysis with Opt. Levels
43
00.10.20.30.40.50.60.70.80.9
1
4KB 8KB 16KB 32KB 64KB 128KB
Nor
mal
ized
Lat
ency
Block size
Vanilla + LBIO + AIOS
• LBIO− Light-weight block I/O layer− Synchronous page allocation /
DMA mapping • AIOS− Proposed fsync path
NVRAMOS 2019
CPU Usage Breakdown
• RocksDB DBbench readrandom / fillsync using 8 threads
44
00.10.20.30.40.50.60.70.80.9
1
Vanilla AIOS Vanilla AIOS
readrandom fillsync
Nor
mal
ized
CPU
Usa
ge
Idle
I/O wait
Kernel
UserEffective overlapping between
I/O and kernel operationsthrough AIOS
Reduced overall runtime
Reduced I/O wait & kernel CPU time
NVRAMOS 2019
Comparison with User-level Direct Access Approach
45
0102030405060708090
Ran
dom
Rea
d La
tenc
y (µ
s) User+DeviceCopy-to-userKernelUser+Device (SPDK)
4KB 8KB 16KB 32KB 64KB 128KBBlock size
NVRAMOS 2019
Memory Overheads on Asynchronous I/O Stack
46
Object Linuxblock layer
Light-weightblock layer
Statically allocated objects (per-core)
request pool 412KB -
lbio array - 192KB
Free page pool - 256KB
128KB block request
(l)bio + (l)bio_vec 648B 704B
request 384B -
iod 974B -
prp_list 4096B 4096B
Total 6104B 4800B
NVRAMOS 2019
Proposed Fsync Path (Ext4 Ordered Mode)
47
Data block I/O Journal block I/O Commit block I/O
CPU
jbd2
Device10.61μs 10.44μs11.37μs
Data blocksubmit
Journal blocksubmit
Flush &commit block
submit
Data Writeback 4.18µs- Reduced latency by applying light-weight block I/O layer
jbd2 call 0.78µs- Wake jbd2 before data block wait
Journal blockpreparation
5.21µsCommit block preparation 1.90µs
Data & journal block I/O wait Flush & commit block dispatch 0.04µs
sys_fsync() Return to user
jbd2 commit wait
NVRAMOS 2019
• Implementation−Delays DMA unmapping to when a system is idle or waiting for another
I/O requests−Extended version of the deferred protection scheme in Linux [ASPLOS’16]−Optionally disabled for safety
Lazy DMA Unmapping
48
CPU
DeviceI/O submit
I/O 7.26µs
Lazy DMA unmapping 0.35µs
Interrupt