How to Teach an Old File System Dog New Object Store Tricks · Performance • Small Write (4KB)...

Preview:

Citation preview

How to Teach an Old File System Dog New Object Store Tricks

USENIX HotStorage ’18

Eunji Lee1, Youil Han1, Suli Yang2, Andrea C. Arpaci-Dusseau2, Remzi H. Arpaci-Dusseau2

1

Chungbuk National University

Data-service Platforms• Layering

• Abstract away underlying details • Reuse of existing software • Agility: development, operation, and maintenance

2 Eco System of Data-Service Platform

Often at odds with efficiency • Local File System

• Bottom layer of modern storage platforms• Portability, Extensibility, Ease of Development

3

Distributed Data Store (Dynamo, MongoDB)

Key-value Store (RocksDB, BerkelyDB)

Local File System (Ext4, XFS, BtrFS)

Distributed Data Store (HBase, BigTable)

Object Store

Distributed File System (HDFS, GFS)

Local File System (Ext4, XFS, BtrFS)

Distributed Data Store (Ceph)

Object Store Daemon (Ceph)

Local File System (Ext4, XFS, BtrFS)

Local File System • Not intended to serve as an underlying storage engine • Mismatch between the two layers • System-wide optimization

• Ignore demands from individual applications • Little control over file system internals • Suffer from degraded QoS

• Lack of required operations• No atomic operation• No data movement or reorganization• No additional user-level metadata

4

Out-of-control and Sub-optimal Performance

Current Solutions• Bypass File System

• Key-value store, Object Store, Database • But, reliniquish file system benefits

• Extend file system interfaces • Add new features to POSIX APIs• Slow and conservative

evolution • Stable maintenance than

specific optimizations

5

Name: Ext2/3/4 Birth: 1993

Our Approach • Use a file system as it is, but in a different

manner!• Design patterns of user-level data platform

• Take advantages of file system • Minimize negative effects of mismatches

6

Contents• Motivation • Problem Analysis • SwimStore • Performance Evaluation • Conclusion

7

Data-service Platform Taxonomy

8

PackingMapping

“Multiple objects in a file”“Object as a file”

What is the best way to store objects atop a file system?

Case Study: Ceph• Backend object store engine

• FileStore : mapping • KStore : packing • BlueStore

9

FileStore

OSD

BlueStore

File system

Storage Device

RGW RBD CephFS

RADOS

Ceph ArchitectureBackend Object Store

KStore

Mapping vs. Packing

10

KStore (Packing)

Object Store Log

FileStore (Mapping)

Object Store

Log

LSM Tree

“Multiple Objects in a File”

Object

File A

“Object as a File”

FileA B

File B

… A

Experimental Setup• Ceph 12.01• Amazon EC2 Clusters • Intel Xeon quad-core• 32GB DRAM • 256 GB SSD x 2 • Ubuntu Server 16.04 • File System : XFS (recommended in Ceph)• Backend: FileStore, KStore• Benchmark: Rados• Metric: IOPS, throughput, write traffic

11

Performance• Small Write (4KB)

• KStore performs better than FileStore by 1.5x • Write amplification by file metadata

12

KstoreFilestore

KstoreFilestore0.0

1.02.03.04.05.06.07.08.09.0

10.0

Rat

io w

rto O

rigin

al W

rite

4KB 1MB

4.4x

8.8x

3.2x2.1x

OriginalLoggingCompactionFilesystem

Original write trafficKstore(4KB) 864 MBKstore(1MB) 2.4 GBFilestore(4KB) 332 MBFilestore(1MB) 3.8 GB

IOPS

FileStoreKStore

FilestoreKstore

FilestoreKstore0.0

1.02.03.04.05.06.07.08.09.0

10.0

Rat

io w

rto O

rigin

al W

rite

4KB 1MB

8.8x

4.4x

2.1x3.2x

OriginalLoggingCompactionFilesystem

1.5x

Write Traffic BreakdownAverage IOPS

1x

Performance• Large Write (1MB)

• FileStore outperforms KStore by 1.6x• Write amplification by compaction

13

IOPS

FileStore KStoreKstore

FilestoreKstore

Filestore0.01.02.03.04.05.06.07.08.09.0

10.0

Rat

io w

rto O

rigin

al W

rite

4KB 1MB

4.4x

8.8x

3.2x2.1x

OriginalLoggingCompactionFilesystem

Original write trafficKstore(4KB) 864 MBKstore(1MB) 2.4 GBFilestore(4KB) 332 MBFilestore(1MB) 3.8 GB

FilestoreKstore

FilestoreKstore0.0

1.02.03.04.05.06.07.08.09.0

10.0

Rat

io w

rto O

rigin

al W

rite

4KB 1MB

8.8x

4.4x

2.1x3.2x

OriginalLoggingCompactionFilesystemFileStore

KStore

1.6x

Write Traffic BreakdownAverage IOPS

1x

Performance• Lack of atomic update support in file systems• Double-write penalty of logging• Halve bandwidth in large writes

14

FilestoreKstore

FilestoreKstore0.0

1.02.03.04.05.06.07.08.09.0

10.0

Rat

io w

rto O

rigin

al W

rite

4KB 1MB

8.8x

4.4x

2.1x3.2x

OriginalLoggingCompactionFilesystem

KstoreFilestore

KstoreFilestore0.0

1.02.03.04.05.06.07.08.09.0

10.0

Rat

io w

rto O

rigin

al W

rite

4KB 1MB

4.4x

8.8x

3.2x2.1x

OriginalLoggingCompactionFilesystem

Original write trafficKstore(4KB) 864 MBKstore(1MB) 2.4 GBFilestore(4KB) 332 MBFilestore(1MB) 3.8 GB

Write Traffic Breakdown

QoS• FileStore

15

0 10 20 30 40 50 600

100

200

Time(s)

Writ

e (M

iB)

0

10

20

Thro

ughp

ut(M

B/s)BG-Write Throughputfilestore

FS: XFSW: 4KB

PerformanceWrite Traffic

0 10 20 30 40 50 600

100

200

Time(s)

Writ

e (M

iB)

0

150

300

Thro

ughp

ut(M

B/s)BG-Write Throughputfilestore

FS: XFSW: 1MB

4KB write

1MB write

Page Cache

Storage

Periodic Flush w. Buffered I/O Transaction Entanglement

QoS• KStore

16

0 10 20 30 40 500

150

300

Time(s)

Writ

e (M

iB)

0

15

30

Thro

ughp

ut(M

B/s)BG-Write Throughputkstore

FS: XFSW: 4KB

Throughput: 40MB/s0 10 20 30 40 50 60

0

100

200

Time(s)

Writ

e (M

iB)

0

150

300Th

roug

hput

(MB/

s)BG-Write ThroughputkstoreFS: XFSW: 1MB

Consistently Poor

4KB write

1MB write

User-level Cache

StorageFrequent Compaction

Write amplification by merge

Summary• Performance penalties of file systems

• Small objects seriously suffer from write amplification caused by filesystem metadata

• Large writes are sensitive to write traffic increase by Logging in common, and frequent compaction in packing architecture.

• Buffered I/O and out-of-control flush mechanism in file systems makes it challenging to support QoS.

17

Contents• Motivation • Problem Analysis • SwimStore • Performance Evaluation • Conclusion

18

SwimStore• Shadowing with Immutable Metadata Store• Provide consistently excellent performance for

all object sizes running over a file system

19

• Strategy 1. In-file shadowing

SwimStore

20

File

Object

Log

Direct I/O

A

A’

B

Problems• Filesystem metadata overhead• Double-write penalty • Performance fluctuation• Compaction cost

key, offset, length

Indexing

• Strategy 1. In-file shadowing

SwimStore

21

File

Synchronous Direct I/O

A

A’

User-facing Latency increases!

File

Raw Device Logging

A’

Log

Asynchronous Buffered I/O

A

FileStore SwimStore

File System

SwimStore • File system access is slower than raw device access

• File system metadata (e.g., inode, allocation bitmap, etc.)• Transaction entanglement

22

File

Synchronous Direct I/O

A

A’

File System

m m m m

SwimStore• Strategy 2. Metadata-Immutable Container

23

File

Synchronous Direct I/O

A

A’

File System

m m m m

1

0.4

0.86

0.4

Per-file

Single file

Raw device

Metadata-Immutable Container

Latency (4KB write)

Create a container file and allocate space in advance

• Strategy 3. Hole-punching with Buddy-like Allocation

SwimStore

24

Shadowing technique requires the recycling of obsolete data space

• Strategy 3. Hole-punching with Buddy-like Allocation

SwimStore

25

Opportunities(+) Filesystem has “infinite address space”(+) Filesystem provides “physical space reclamation” with punch-hole

….

Hole-punching

• Strategy 3. Hole-punching with Buddy-like Allocation

SwimStore

26

Too small holes severely fragments space

Logical address

Physical address

New object

SwimStore• Strategy 3. Hole-punching with Buddy-like

Allocation

27

2^0

2^1

2^n

…. Hole-punching for large holes

GC for small holes

• Architecture

SwimStore

Container File Pool Metadata(Indexing, attributes, etc.)

Intent Log(metadata, checksum)

LSM-Tree (LevelDB)

Contents• Motivation • Problem Analysis • SwimStore • Performance Evaluation • Conclusion

29

Experimental Setup• Ceph 12.01, C++ 12K LOC • Amazon EC2 Clusters • Intel Xeon quad-core• 32GB DRAM • 256 GB SSD x 2 • Ubuntu Server 16.04 • File System : XFS (recommended in Ceph)• Backend: FileStore, KStore, BlueStore, SwimStore • Benchmark: Rados• Metric: IOPS, throughput, write traffic

30

Performance Evaluation• IOPS

31

4KB 16KB 64KB 256KB 1MB0.00.51.01.52.02.53.0

IOSize

Rat

io w

rto F

ileSt

ore

1454

ops

/s

1472

ops

/s

881

ops/

s

243

ops/

s

67 o

ps/s

FileStoreBlueStoreSwimStoreKStore

Small Write2.5x better than FileStore 1.6x better than BlueStore 1.1x better than KStore

Large Write1.8x better than FileStore 3.1x better than KStore

FileStore

BlueStore

SwimStoreKStore

Performance Evaluation• Write Traffic

32

4KB 16KB 64KB 256KB 1MB0.01.02.03.04.05.0

IOSize

Rat

io w

rto S

wim

Stor

e

1129

(MB)

3020

(MB)

5370

(MB)

7342

(MB)

7342

(MB) FileStore

BlueStoreSwimStoreKStore

FileStore

BlueStoreSwimStore

KStore

Contents• Motivation • Problem Analysis • Solution • Performance Evaluation • Conclusion

33

Conclusion• Explore design patterns to build an object

store atop a local file system • SwimStore: a new backend object store

• In-file shadowing• Immutable metadata container • Hole-punching with buddy-like allocation

• Provide high performance and little performance variations

• Retain all benefits of the file system

34

35

Thank you

Recommended