Ceph: A Scalable, High-Performance Distributed File System Priya Bhat, Yonggang Liu, Jing Qin

Ceph: A Scalable, High-Performance Distributed File System

Priya Bhat, Yonggang Liu, Jing Qin

Content

1. Ceph Architecture1. Ceph Architecture

2. Ceph Components2. Ceph Components

3. Performance Evaluation3. Performance Evaluation

4. Ceph Demo4. Ceph Demo

5. Conclusion5. Conclusion

Ceph Architecture

What is Ceph?Ceph is a distributed file system that provides excellent

performance, scalability and reliability.

Features

Decoupled data and metadata

Dynamic distributed metadata management

Reliable autonomic distributed object storage

Ceph Architecture

Object-based Storage

Applications

System Call Interface

File System

Logical Block Interface

Block I/O ManageHard Drive

Operating System

Traditional Storage

File System Storage Component

File System Client Component

Applications

System Call Interface

Logical Block Interface

Block I/O ManageObject-based Storage Device

Operating System

Object-based Storage

Ceph Architecture

Decoupled Data and Metadata

Ceph Architecture

Ceph: Components

Ceph Components

ObjectStoragecluster

Clients

MetadataServercluster

Clustermonitor

File I/O

Metadata I/O

Met

adat

a op

s

Ceph Components

Client Operation

Meta Datacluster

Clients

Object Storagecluster

Open Request

Capability Management

Read/

Writ

e

Capability,

Inode, size,

stripe

CRUSH is used to map Placement Group (PG) to

OSD.Close Request,

Details of

Read/Write

Ceph Components

Client Synchronization

POSIXSemantics

Relaxed Consistency

Synchronous I/O. performance killer

Solution: HPC extensions to POSIX

Default: Consistency / correctness

Optionally relax Extensions for both data

and metadata

Ceph Components

Namespace OperationsCeph optimizes for most common meta-data access scenarios (readdir followed by stat)

But by default “correct” behavior is provided at some cost.

S

t

a

t

o

p

e

r

a

ti

o

n

o

n

a

fil

e

o

p

e

n

e

d

b

y

m

u

lti

p

l

e

w

rit

e

r

s

Applications for which coherent behavior is unnecessary use extensions

Namespace Operations

Ceph Components

Metadata Storage Advantages

Per-MDS journals

Eventually pushed to OSD

Sequential Update

More efficient

Reducing re-write

workload.

Optimized on-disk storage

layout for future read access

Easier failure recovery. Journal can be rescanned

for recovery.

Ceph Components

Dynamic Sub-tree Partitioning

Adaptively distribute cached metadata hierarchically across a set of nodes.

Migration preserves locality. MDS measures popularity of metadata.

Ceph Components

Traffic Control for metadata access Challenge

Partitioning can balance workload but can’t deal with hot spots or flash crowds

Ceph Solution Heavily read directories are selectively replicated

across multiple nodes to distribute load Directories that are extra large or experiencing

heavy write workload have their contents hashed by file name across the cluster

15

Distributed Object Storage

16

CRUSH

CRUSH(x) (osdn1, osdn2, osdn3) Inputs

x is the placement group Hierarchical cluster map Placement rules

Outputs a list of OSDs Advantages

Anyone can calculate object location Cluster map infrequently updated

17

Replication

Objects are replicated on OSDs within same PG Client is oblivious to replication

Ceph: Performance

Performance Evaluation

Data Performance OSD Throughput


Data Performance OSD Throughput


Data Performance Write Latency


Data Performance Data Distribution and Scalability


MetaData Performance MetaData Update Latency & Read Latency

Ceph: Demo

Conclusion

Strengths: Easy scalability to peta-byte capacity High performance for varying work loads Strong reliability

Weaknesses: MDS and OSD Implemented in user-space The primary replicas may become bottleneck

to heavy write operation N-way replication lacks storage efficiency

References

“Ceph: A Scalable, High Performance Distributed File System” Sage A Weil, Scott A. Brandt, Ethan L. Miller and Darrell D.E. Long, OSDI '06: th USENIX Symposium on Operating Systems Design and Implementation.

“Ceph: A Linux petabyte-scale distributed file System”, M. Tim Jones, IBM developer works, online document.

Technical talk presented by Sage Weil at LCA 2010. Sage Weil's PhD dissertation, “

Ceph: Reliable, Scalable, and High-Performance Distributed Storage” (PDF)

“CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data” (PDF) and “RADOS: A Scalable, Reliable Storage Service for Petabyte-scale Storage Clusters” (PDF) discuss two of the most interesting aspects of the Ceph file system.

“Building a Small Ceph Cluster” gives instructions for building a Ceph cluster along with tips for distribution of assets.

“Ceph : Distributed Network File System: Kernel trap”

http://www.ibm.com/developerworks/linux/library/l-ceph/?ca=drs-

http://ceph.newdream.net/presentations/

http://ceph.newdream.net/weil-thesis.pdf

http://ceph.newdream.net/weil-thesis.pdf

http://www.ssrc.ucsc.edu/Papers/weil-sc06.pdf

http://www.ssrc.ucsc.edu/Papers/weil-sc06.pdf

http://www.pdsi-scidac.org/SC07/resources/weil-20071111-rados-pdsw.pdf

http://www.pdsi-scidac.org/SC07/resources/weil-20071111-rados-pdsw.pdf

http://www.ece.umd.edu/~posulliv/ceph/cluster_build.html

http://kerneltrap.org/Linux/Ceph_Distributed_Network_File_System

Questions ?

Documents

Ceph: A Scalable, High-Performance Distributed File System Priya Bhat, Yonggang Liu, Jing Qin