If you can't read please download the document
Upload
ceph-community
View
1.888
Download
0
Embed Size (px)
Citation preview
Future of CephFS
Sage Weil
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APP
APP
HOST/VM
CLIENT
Finally, lets talk about Ceph FS. Ceph FS is a parallel filesystem that provides a massively scalable, single-hierarchy, shared disk. If you use a shared drive at work, this is the same thing except that the same drive could be shared by everyone youve ever met (and everyone theyve ever met).
M
M
M
CLIENT
0110
data
metadata
Remember all that meta-data we talked about in the beginning? Feels so long ago. It has to be stored somewhere! Something has to keep track of who created files, when they were created, and who has the right to access them. And something has to remember where they live within a tree. Enter MDS, the Ceph Metadata Server. Clients accessing Ceph FS data first make a request to an MDS, which provides what they need to get files from the right OSDs.
M
M
M
There are multiple MDSs!
Metadata Server
Manages metadata for a POSIX-compliant shared filesystemDirectory hierarchy
File metadata (owner, timestamps, mode, etc.)
Stores metadata in RADOS
Does not serve file data to clients
Only required for shared filesystem
If you arent running Ceph FS, you dont need to deploy metadata servers.
legacy metadata storage
a scaling disastername inode block list data
no inode table locality
fragmentationinode table
directory
many seeks
difficult to partition
usretcvarhomevmlinuzpasswdmtabhostslib
includebin
ceph fs metadata storage
block lists unnecessary
inode table mostly uselessAPIs are path-based, not inode-based
no random table access, sloppy caching
embed inodes inside directoriesgood locality, prefetching
leverage key/value object
102
100
1
usretcvarhome
vmlinuz
passwdmtab
hosts
libinclude
bin
controlling metadata io
view ceph-mds as cachereduce readsdir+inode prefetching
reduce writesconsolidate multiple writes
large journal or logstripe over objects
two tiersjournal for short term
per-directory for long term
fast failure recovery
journal
directories
one tree
three metadata servers
??
So how do you have one tree and multiple servers?
load distribution
coarse (static subtree)preserve locality
high management overhead
fine (hash)always balanced
less vulnerable to hot spots
destroy hierarchy, locality
can a dynamic approach capture benefits of both extremes?
static subtree
hash directories
hash files
good locality
good balance
If theres just one MDS (which is a terrible idea), it manages metadata for the entire tree.
When the second one comes along, it will intelligently partition the work by taking a subtree.
When the third MDS arrives, it will attempt to split the tree again.
Same with the fourth.
DYNAMIC SUBTREE PARTITIONING
A MDS can actually even just take a single directory or file, if the load is high enough. This all happens dynamically based on load and the structure of the data, and its called dynamic subtree partitioning.
scalablearbitrarily partition metadata
adaptivemove work from busy to idle servers
replicate hot metadata
efficienthierarchical partition preserve locality
dynamicdaemons can join/leave
take over for failed nodes
dynamic subtree partitioning
Dynamic partitioning
many directories
same directory
Failure recovery
Metadata replication and availability
Metadata cluster scaling
client protocol
highly statefulconsistent, fine-grained caching
seamless hand-off between ceph-mds daemonswhen client traverses hierarchy
when metadata is migrated between servers
direct access to OSDs for file I/O
an example
mount -t ceph 1.2.3.4:/ /mnt3 ceph-mon RT
2 ceph-mds RT (1 ceph-mds to -osd RT)
cd /mnt/foo/bar2 ceph-mds RT (2 ceph-mds to -osd RT)
ls -alopen
readdir1 ceph-mds RT (1 ceph-mds to -osd RT)
stat each file
close
cp * /tmpN ceph-osd RT
ceph-mon
ceph-mds
ceph-osd
recursive accounting
ceph-mds tracks recursive directory statsfile sizes
file and directory counts
modification time
virtual xattrs present full stats
efficient
$ ls -alSh | headtotal 0drwxr-xr-x 1 root root 9.7T 2011-02-04 15:51 .drwxr-xr-x 1 root root 9.7T 2010-12-16 15:06 ..drwxr-xr-x 1 pomceph pg4194980 9.6T 2011-02-24 08:25 pomcephdrwxr-xr-x 1 mcg_test1 pg2419992 23G 2011-02-02 08:57 mcg_test1drwx--x--- 1 luko adm 19G 2011-01-21 12:17 lukodrwx--x--- 1 eest adm 14G 2011-02-04 16:29 eestdrwxr-xr-x 1 mcg_test2 pg2419992 3.0G 2011-02-02 09:34 mcg_test2drwx--x--- 1 fuzyceph adm 1.5G 2011-01-18 10:46 fuzycephdrwxr-xr-x 1 dallasceph pg275 596M 2011-01-14 10:06 dallasceph
snapshots
volume or subvolume snapshots unusable at petabyte scalesnapshot arbitrary subdirectories
simple interfacehidden '.snap' directory
no special tools
$ mkdir foo/.snap/one # create snapshot$ ls foo/.snapone$ ls foo/bar/.snap_one_1099511627776 # parent's snap name is mangled$ rm foo/myfile$ ls -F foobar/$ ls -F foo/.snap/onemyfile bar/$ rmdir foo/.snap/one # remove snapshot
multiple client implementations
Linux kernel clientmount -t ceph 1.2.3.4:/ /mnt
export (NFS), Samba (CIFS)
ceph-fuse
libcephfs.soyour app
Samba (CIFS)
Ganesha (NFS)
Hadoop (map/reduce)
kernellibcephfscephfuseceph-fuseyour applibcephfsSambalibcephfsGanesha
NFS
SMB/CIFS
libcephfsHadoop
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APP
APP
HOST/VM
CLIENT
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
NEARLYAWESOME
AWESOME
AWESOME
AWESOME
AWESOME
Ceph FS is feature-complete but still lacks the testing, quality assurance, and benchmarking work we feel it needs to recommend it for production use.
Path forward
TestingVarious workloads
Multiple active MDSs
Test automationSimple workload generator scripts
Bug reproducers
HackingBug squashing
Long-tail features
IntegrationsGanesha, Samba, *stacks
hard links?
rare
useful locality propertiesintra-directory
parallel inter-directory
on miss, file objects provide per-file backpointersdegenerates to log(n) lookups
optimistic read complexity
what is journaled
lots of statejournaling is expensive up-front, cheap to recover
non-journaled state is cheap, but complex (and somewhat expensive) to recover
yesclient sessions
actual fs metadata modifications
nocache provenance
open files
lazy flushclient modifications may not be durable until fsync() or visible by another client
Click to edit the title text formatClick To Edit Master Title Style
Click to edit the title text formatClick To Edit Master Title Style
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level
Ninth Outline LevelClick to edit Master text stylesSecond level
Third level
Fourth level
Fifth level
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level
Ninth Outline LevelClick to edit Master text styles
Click to edit the title text formatClick To Edit Master Title Style
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level
Ninth Outline LevelClick to edit Master text stylesSecond level
Third level
Fourth level
Fifth level
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level
Ninth Outline LevelClick to edit Master text stylesSecond level
Third level
Fourth level
Fifth level
Click to edit the title text formatClick To Edit Master Title Style
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level
Ninth Outline LevelClick to edit Master text styles
Click to edit the title text formatClick To Edit Master Title Style
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level
Ninth Outline LevelClick to edit Master text styles
Click to edit the title text formatClick To Edit Master Title Style
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level
Seventh Outline LevelClick to edit Master text stylesSecond levelThird levelFourth levelFifth level
Click to edit the title text formatClick To Edit Master Title Style
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level
Seventh Outline LevelClick to edit Master text styles