How to Teach an Old File System Dog New Object Store Tricks
USENIX HotStorage ’18
Eunji Lee1, Youil Han1, Suli Yang2, Andrea C. Arpaci-Dusseau2, Remzi H. Arpaci-Dusseau2
1
Chungbuk National University
Data-service Platforms• Layering
• Abstract away underlying details • Reuse of existing software • Agility: development, operation, and maintenance
2 Eco System of Data-Service Platform
Often at odds with efficiency • Local File System
• Bottom layer of modern storage platforms• Portability, Extensibility, Ease of Development
3
Distributed Data Store (Dynamo, MongoDB)
Key-value Store (RocksDB, BerkelyDB)
Local File System (Ext4, XFS, BtrFS)
Distributed Data Store (HBase, BigTable)
Object Store
Distributed File System (HDFS, GFS)
Local File System (Ext4, XFS, BtrFS)
Distributed Data Store (Ceph)
Object Store Daemon (Ceph)
Local File System (Ext4, XFS, BtrFS)
Local File System • Not intended to serve as an underlying storage engine • Mismatch between the two layers • System-wide optimization
• Ignore demands from individual applications • Little control over file system internals • Suffer from degraded QoS
• Lack of required operations• No atomic operation• No data movement or reorganization• No additional user-level metadata
4
Out-of-control and Sub-optimal Performance
Current Solutions• Bypass File System
• Key-value store, Object Store, Database • But, reliniquish file system benefits
• Extend file system interfaces • Add new features to POSIX APIs• Slow and conservative
evolution • Stable maintenance than
specific optimizations
5
Name: Ext2/3/4 Birth: 1993
Our Approach • Use a file system as it is, but in a different
manner!• Design patterns of user-level data platform
• Take advantages of file system • Minimize negative effects of mismatches
6
Contents• Motivation • Problem Analysis • SwimStore • Performance Evaluation • Conclusion
7
Data-service Platform Taxonomy
8
PackingMapping
“Multiple objects in a file”“Object as a file”
What is the best way to store objects atop a file system?
Case Study: Ceph• Backend object store engine
• FileStore : mapping • KStore : packing • BlueStore
9
FileStore
OSD
BlueStore
File system
Storage Device
RGW RBD CephFS
RADOS
…
Ceph ArchitectureBackend Object Store
KStore
Mapping vs. Packing
10
KStore (Packing)
Object Store Log
…
FileStore (Mapping)
Object Store
Log
LSM Tree
“Multiple Objects in a File”
Object
File A
“Object as a File”
FileA B
File B
… A
Experimental Setup• Ceph 12.01• Amazon EC2 Clusters • Intel Xeon quad-core• 32GB DRAM • 256 GB SSD x 2 • Ubuntu Server 16.04 • File System : XFS (recommended in Ceph)• Backend: FileStore, KStore• Benchmark: Rados• Metric: IOPS, throughput, write traffic
11
Performance• Small Write (4KB)
• KStore performs better than FileStore by 1.5x • Write amplification by file metadata
12
KstoreFilestore
KstoreFilestore0.0
1.02.03.04.05.06.07.08.09.0
10.0
Rat
io w
rto O
rigin
al W
rite
4KB 1MB
4.4x
8.8x
3.2x2.1x
OriginalLoggingCompactionFilesystem
Original write trafficKstore(4KB) 864 MBKstore(1MB) 2.4 GBFilestore(4KB) 332 MBFilestore(1MB) 3.8 GB
IOPS
FileStoreKStore
FilestoreKstore
FilestoreKstore0.0
1.02.03.04.05.06.07.08.09.0
10.0
Rat
io w
rto O
rigin
al W
rite
4KB 1MB
8.8x
4.4x
2.1x3.2x
OriginalLoggingCompactionFilesystem
1.5x
Write Traffic BreakdownAverage IOPS
1x
Performance• Large Write (1MB)
• FileStore outperforms KStore by 1.6x• Write amplification by compaction
13
IOPS
FileStore KStoreKstore
FilestoreKstore
Filestore0.01.02.03.04.05.06.07.08.09.0
10.0
Rat
io w
rto O
rigin
al W
rite
4KB 1MB
4.4x
8.8x
3.2x2.1x
OriginalLoggingCompactionFilesystem
Original write trafficKstore(4KB) 864 MBKstore(1MB) 2.4 GBFilestore(4KB) 332 MBFilestore(1MB) 3.8 GB
FilestoreKstore
FilestoreKstore0.0
1.02.03.04.05.06.07.08.09.0
10.0
Rat
io w
rto O
rigin
al W
rite
4KB 1MB
8.8x
4.4x
2.1x3.2x
OriginalLoggingCompactionFilesystemFileStore
KStore
1.6x
Write Traffic BreakdownAverage IOPS
1x
Performance• Lack of atomic update support in file systems• Double-write penalty of logging• Halve bandwidth in large writes
14
FilestoreKstore
FilestoreKstore0.0
1.02.03.04.05.06.07.08.09.0
10.0
Rat
io w
rto O
rigin
al W
rite
4KB 1MB
8.8x
4.4x
2.1x3.2x
OriginalLoggingCompactionFilesystem
KstoreFilestore
KstoreFilestore0.0
1.02.03.04.05.06.07.08.09.0
10.0
Rat
io w
rto O
rigin
al W
rite
4KB 1MB
4.4x
8.8x
3.2x2.1x
OriginalLoggingCompactionFilesystem
Original write trafficKstore(4KB) 864 MBKstore(1MB) 2.4 GBFilestore(4KB) 332 MBFilestore(1MB) 3.8 GB
Write Traffic Breakdown
QoS• FileStore
15
0 10 20 30 40 50 600
100
200
Time(s)
Writ
e (M
iB)
0
10
20
Thro
ughp
ut(M
B/s)BG-Write Throughputfilestore
FS: XFSW: 4KB
PerformanceWrite Traffic
0 10 20 30 40 50 600
100
200
Time(s)
Writ
e (M
iB)
0
150
300
Thro
ughp
ut(M
B/s)BG-Write Throughputfilestore
FS: XFSW: 1MB
4KB write
1MB write
Page Cache
Storage
Periodic Flush w. Buffered I/O Transaction Entanglement
QoS• KStore
16
0 10 20 30 40 500
150
300
Time(s)
Writ
e (M
iB)
0
15
30
Thro
ughp
ut(M
B/s)BG-Write Throughputkstore
FS: XFSW: 4KB
Throughput: 40MB/s0 10 20 30 40 50 60
0
100
200
Time(s)
Writ
e (M
iB)
0
150
300Th
roug
hput
(MB/
s)BG-Write ThroughputkstoreFS: XFSW: 1MB
Consistently Poor
4KB write
1MB write
User-level Cache
StorageFrequent Compaction
Write amplification by merge
Summary• Performance penalties of file systems
• Small objects seriously suffer from write amplification caused by filesystem metadata
• Large writes are sensitive to write traffic increase by Logging in common, and frequent compaction in packing architecture.
• Buffered I/O and out-of-control flush mechanism in file systems makes it challenging to support QoS.
17
Contents• Motivation • Problem Analysis • SwimStore • Performance Evaluation • Conclusion
18
SwimStore• Shadowing with Immutable Metadata Store• Provide consistently excellent performance for
all object sizes running over a file system
19
• Strategy 1. In-file shadowing
SwimStore
20
File
Object
Log
Direct I/O
A
A’
B
Problems• Filesystem metadata overhead• Double-write penalty • Performance fluctuation• Compaction cost
key, offset, length
Indexing
• Strategy 1. In-file shadowing
SwimStore
21
File
Synchronous Direct I/O
A
A’
User-facing Latency increases!
File
Raw Device Logging
A’
Log
Asynchronous Buffered I/O
A
FileStore SwimStore
File System
SwimStore • File system access is slower than raw device access
• File system metadata (e.g., inode, allocation bitmap, etc.)• Transaction entanglement
22
File
Synchronous Direct I/O
A
A’
File System
m m m m
SwimStore• Strategy 2. Metadata-Immutable Container
23
File
Synchronous Direct I/O
A
A’
File System
m m m m
1
0.4
0.86
0.4
Per-file
Single file
Raw device
Metadata-Immutable Container
Latency (4KB write)
Create a container file and allocate space in advance
• Strategy 3. Hole-punching with Buddy-like Allocation
SwimStore
24
Shadowing technique requires the recycling of obsolete data space
• Strategy 3. Hole-punching with Buddy-like Allocation
SwimStore
25
Opportunities(+) Filesystem has “infinite address space”(+) Filesystem provides “physical space reclamation” with punch-hole
….
Hole-punching
• Strategy 3. Hole-punching with Buddy-like Allocation
SwimStore
26
Too small holes severely fragments space
Logical address
Physical address
New object
SwimStore• Strategy 3. Hole-punching with Buddy-like
Allocation
27
2^0
2^1
2^n
…. Hole-punching for large holes
GC for small holes
• Architecture
SwimStore
Container File Pool Metadata(Indexing, attributes, etc.)
Intent Log(metadata, checksum)
…
…
…
…
LSM-Tree (LevelDB)
Contents• Motivation • Problem Analysis • SwimStore • Performance Evaluation • Conclusion
29
Experimental Setup• Ceph 12.01, C++ 12K LOC • Amazon EC2 Clusters • Intel Xeon quad-core• 32GB DRAM • 256 GB SSD x 2 • Ubuntu Server 16.04 • File System : XFS (recommended in Ceph)• Backend: FileStore, KStore, BlueStore, SwimStore • Benchmark: Rados• Metric: IOPS, throughput, write traffic
30
Performance Evaluation• IOPS
31
4KB 16KB 64KB 256KB 1MB0.00.51.01.52.02.53.0
IOSize
Rat
io w
rto F
ileSt
ore
1454
ops
/s
1472
ops
/s
881
ops/
s
243
ops/
s
67 o
ps/s
FileStoreBlueStoreSwimStoreKStore
Small Write2.5x better than FileStore 1.6x better than BlueStore 1.1x better than KStore
Large Write1.8x better than FileStore 3.1x better than KStore
FileStore
BlueStore
SwimStoreKStore
Performance Evaluation• Write Traffic
32
4KB 16KB 64KB 256KB 1MB0.01.02.03.04.05.0
IOSize
Rat
io w
rto S
wim
Stor
e
1129
(MB)
3020
(MB)
5370
(MB)
7342
(MB)
7342
(MB) FileStore
BlueStoreSwimStoreKStore
FileStore
BlueStoreSwimStore
KStore
Contents• Motivation • Problem Analysis • Solution • Performance Evaluation • Conclusion
33
Conclusion• Explore design patterns to build an object
store atop a local file system • SwimStore: a new backend object store
• In-file shadowing• Immutable metadata container • Hole-punching with buddy-like allocation
• Provide high performance and little performance variations
• Retain all benefits of the file system
34
35
Thank you