Ceph Day NYC: The Future of CephFS

Embed Size (px)

Citation preview

Future of CephFS

Sage Weil

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APP

APP

HOST/VM

CLIENT

Finally, lets talk about Ceph FS. Ceph FS is a parallel filesystem that provides a massively scalable, single-hierarchy, shared disk. If you use a shared drive at work, this is the same thing except that the same drive could be shared by everyone youve ever met (and everyone theyve ever met).

M

M

M

CLIENT

0110

data

metadata

Remember all that meta-data we talked about in the beginning? Feels so long ago. It has to be stored somewhere! Something has to keep track of who created files, when they were created, and who has the right to access them. And something has to remember where they live within a tree. Enter MDS, the Ceph Metadata Server. Clients accessing Ceph FS data first make a request to an MDS, which provides what they need to get files from the right OSDs.

M

M

M

There are multiple MDSs!

Metadata Server

Manages metadata for a POSIX-compliant shared filesystemDirectory hierarchy

File metadata (owner, timestamps, mode, etc.)

Stores metadata in RADOS

Does not serve file data to clients

Only required for shared filesystem

If you arent running Ceph FS, you dont need to deploy metadata servers.

legacy metadata storage

a scaling disastername inode block list data

no inode table locality

fragmentationinode table

directory

many seeks

difficult to partition

usretcvarhomevmlinuzpasswdmtabhostslib

includebin

ceph fs metadata storage

block lists unnecessary

inode table mostly uselessAPIs are path-based, not inode-based

no random table access, sloppy caching

embed inodes inside directoriesgood locality, prefetching

leverage key/value object

102

100

1

usretcvarhome

vmlinuz

passwdmtab

hosts

libinclude

bin

controlling metadata io

view ceph-mds as cachereduce readsdir+inode prefetching

reduce writesconsolidate multiple writes

large journal or logstripe over objects

two tiersjournal for short term

per-directory for long term

fast failure recovery

journal

directories

one tree

three metadata servers

??

So how do you have one tree and multiple servers?

load distribution

coarse (static subtree)preserve locality

high management overhead

fine (hash)always balanced

less vulnerable to hot spots

destroy hierarchy, locality

can a dynamic approach capture benefits of both extremes?

static subtree

hash directories

hash files

good locality

good balance

If theres just one MDS (which is a terrible idea), it manages metadata for the entire tree.

When the second one comes along, it will intelligently partition the work by taking a subtree.

When the third MDS arrives, it will attempt to split the tree again.

Same with the fourth.

DYNAMIC SUBTREE PARTITIONING

A MDS can actually even just take a single directory or file, if the load is high enough. This all happens dynamically based on load and the structure of the data, and its called dynamic subtree partitioning.

scalablearbitrarily partition metadata

adaptivemove work from busy to idle servers

replicate hot metadata

efficienthierarchical partition preserve locality

dynamicdaemons can join/leave

take over for failed nodes

dynamic subtree partitioning

Dynamic partitioning

many directories

same directory

Failure recovery

Metadata replication and availability

Metadata cluster scaling

client protocol

highly statefulconsistent, fine-grained caching

seamless hand-off between ceph-mds daemonswhen client traverses hierarchy

when metadata is migrated between servers

direct access to OSDs for file I/O

an example

mount -t ceph 1.2.3.4:/ /mnt3 ceph-mon RT

2 ceph-mds RT (1 ceph-mds to -osd RT)

cd /mnt/foo/bar2 ceph-mds RT (2 ceph-mds to -osd RT)

ls -alopen

readdir1 ceph-mds RT (1 ceph-mds to -osd RT)

stat each file

close

cp * /tmpN ceph-osd RT

ceph-mon

ceph-mds

ceph-osd

recursive accounting

ceph-mds tracks recursive directory statsfile sizes

file and directory counts

modification time

virtual xattrs present full stats

efficient

$ ls -alSh | headtotal 0drwxr-xr-x 1 root root 9.7T 2011-02-04 15:51 .drwxr-xr-x 1 root root 9.7T 2010-12-16 15:06 ..drwxr-xr-x 1 pomceph pg4194980 9.6T 2011-02-24 08:25 pomcephdrwxr-xr-x 1 mcg_test1 pg2419992 23G 2011-02-02 08:57 mcg_test1drwx--x--- 1 luko adm 19G 2011-01-21 12:17 lukodrwx--x--- 1 eest adm 14G 2011-02-04 16:29 eestdrwxr-xr-x 1 mcg_test2 pg2419992 3.0G 2011-02-02 09:34 mcg_test2drwx--x--- 1 fuzyceph adm 1.5G 2011-01-18 10:46 fuzycephdrwxr-xr-x 1 dallasceph pg275 596M 2011-01-14 10:06 dallasceph

snapshots

volume or subvolume snapshots unusable at petabyte scalesnapshot arbitrary subdirectories

simple interfacehidden '.snap' directory

no special tools

$ mkdir foo/.snap/one # create snapshot$ ls foo/.snapone$ ls foo/bar/.snap_one_1099511627776 # parent's snap name is mangled$ rm foo/myfile$ ls -F foobar/$ ls -F foo/.snap/onemyfile bar/$ rmdir foo/.snap/one # remove snapshot

multiple client implementations

Linux kernel clientmount -t ceph 1.2.3.4:/ /mnt

export (NFS), Samba (CIFS)

ceph-fuse

libcephfs.soyour app

Samba (CIFS)

Ganesha (NFS)

Hadoop (map/reduce)

kernellibcephfscephfuseceph-fuseyour applibcephfsSambalibcephfsGanesha

NFS

SMB/CIFS

libcephfsHadoop

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APP

APP

HOST/VM

CLIENT

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

NEARLYAWESOME

AWESOME

AWESOME

AWESOME

AWESOME

Ceph FS is feature-complete but still lacks the testing, quality assurance, and benchmarking work we feel it needs to recommend it for production use.

Path forward

TestingVarious workloads

Multiple active MDSs

Test automationSimple workload generator scripts

Bug reproducers

HackingBug squashing

Long-tail features

IntegrationsGanesha, Samba, *stacks

hard links?

rare

useful locality propertiesintra-directory

parallel inter-directory

on miss, file objects provide per-file backpointersdegenerates to log(n) lookups

optimistic read complexity

what is journaled

lots of statejournaling is expensive up-front, cheap to recover

non-journaled state is cheap, but complex (and somewhat expensive) to recover

yesclient sessions

actual fs metadata modifications

nocache provenance

open files

lazy flushclient modifications may not be durable until fsync() or visible by another client

Click to edit the title text formatClick To Edit Master Title Style

Click to edit the title text formatClick To Edit Master Title Style

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level

Ninth Outline LevelClick to edit Master text stylesSecond level

Third level

Fourth level

Fifth level

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level

Ninth Outline LevelClick to edit Master text styles

Click to edit the title text formatClick To Edit Master Title Style

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level

Ninth Outline LevelClick to edit Master text stylesSecond level

Third level

Fourth level

Fifth level

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level

Ninth Outline LevelClick to edit Master text stylesSecond level

Third level

Fourth level

Fifth level

Click to edit the title text formatClick To Edit Master Title Style

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level

Ninth Outline LevelClick to edit Master text styles

Click to edit the title text formatClick To Edit Master Title Style

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level

Ninth Outline LevelClick to edit Master text styles

Click to edit the title text formatClick To Edit Master Title Style

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level

Seventh Outline LevelClick to edit Master text stylesSecond levelThird levelFourth levelFifth level

Click to edit the title text formatClick To Edit Master Title Style

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level

Seventh Outline LevelClick to edit Master text styles