Upload
reynard-barker
View
224
Download
1
Embed Size (px)
DESCRIPTION
Ceph-History At the beginning (2006): part of Sage Weil ph.d researches After graduation (2007): Open source with 3 main developers 2011, S. Weil created Inktank Storage for professional services and support for Ceph (~60 developers) April 2014, Red Hat acquired Inktank ($175 Million) university of California, Santa Cruz presentation title
Citation preview
Philippe [email protected]
Awesome distributed storage system
Ceph-History
• At the beginning (2006): part of Sage Weil ph.d researches
• After graduation (2007): Open source with 3 main developers
• 2011, S. Weil created Inktank Storage for professional services and support for Ceph (~60 developers)
• April 2014, Red Hat acquired Inktank ($175 Million)
Open Source Project
• www.ceph.com• www.github.com/ceph
• 9th release (Infernalis, 11/2015)
Ceph-TargetCeph is a unified, distributed storage system designed for excellent performance, reliability and scalability.
• Designed for commodity hardware• Software-based• Open Source• Self managing/healing• Self balancing• painless scaling• No SPOF• Object Storage (S3, Swift)• Block Storage• File System (POSIX)
C A PX
Architecture outline
RBD
• Rados Block Device– Thin provisioning– Snapshot / Clone
– Can be used as OpenStack Cinder backend– Can be used by Libvirt
CephFS
• File System– POSIX Compliant (legacy)– Network File System– Plugin for hadoop (HDFS alternative)
– Kernel (>2.6.34 / 2010) or FUSE
RGW
• Rados GateWay : HTTP REST gateway for the RADOS object store– AWS S3 Compliant– OpenStack Swift Compliant
What services can be easily done with a Ceph ?
• A DropBox like (Ceph RGW + OwnCloud)
• A provider of volumes (Ceph RBD)
• A NFS like (CephFS)
All with the same Ceph cluster
a Ceph cluster
Ceph-Concept• Object Servers (OS): store the objects• Monitor servers (MON): watch over the storage network,
maintain the group membership, ensure consistency (Strong consistency)
• Metadata Servers (MDS): store the file system structure
• A service uses a Pool that is composed of Placement Groups– a placement group is a storage space distributed over n OS
• Crush Map : defines placement rules• Replication VS Erasure Code• Cache-tiering
Entities
Ceph Client Pool
PG0
PG1
PG2
PGn
…
OSD0a OSD0b OSD0c
OSD1a OSD1b OSD1c
OSD2a OSD2b OSD2c
OSDna OSDnb OSDnc
Client host OSD host
CRUD
Pool TypeResiliency
• Replicated– each PG is composed of n OSD– One OSD is designed as Primary– IO is done on Primary– Each object is copied on the other OSD by the Primary (strong consistency : ack after
copy)
• Erasure Code– a pool can have an erasure code profiles (params k, m)– each PG is composed of k+m OSD– One OSD is designed as Primary– IO is done on Primary (encode and decode)– Each object is encoded into k+m chunks by the Primary and then spread to the k+m OSD
(strong consistency : ack after creation)
– the default erasure code library : jerasure – other lib can be used dynamically (plugin)
Erasure code overview
Object Store Device
OSD Daemon
FS (xattr)
Disk
OSD is primary for some objects : - responsible for resiliency- responsible for coherency- responsible for re-balancing- responsible for recovery
OSD is secondary for some objects : - under control of primary- capable of becoming primary
atomic transactions : put, get, delete, …
Object Placement
CRUSHControlled Replication Under Scalable Hashing
a pseudo-random deterministic data distribution algorithm that efficiently and robustly distributes object replicas across a heterogeneous, structured storage cluster.
This avoids the need for an index server to coordinate reads and writes.
Based on • OSD weight• rules
device 0 osd.0device 1 osd.1device 2 osd.2device 3 osd.3device 4 osd.4device 5 osd.5device 6 osd.6device 7 osd.7
host ceph-osd-ssd-server-1 { id -1 alg straw hash 0 item osd.0 weight 1.00 item osd.1 weight 1.00 } host ceph-osd-ssd-server-2 { id -2 alg straw hash 0 item osd.2 weight 1.00 item osd.3 weight 1.00 } host ceph-osd-platter-server-1 { id -3 alg straw hash 0 item osd.4 weight 1.00 item osd.5 weight 1.00 } host ceph-osd-platter-server-2 { id -4 alg straw hash 0 item osd.6 weight 1.00 item osd.7 weight 1.00 } root platter { id -5 alg straw hash 0 item ceph-osd-platter-server-1 weight 2.00 item ceph-osd-platter-server-2 weight 2.00 } root ssd { id -6 alg straw hash 0 item ceph-osd-ssd-server-1 weight 2.00 item ceph-osd-ssd-server-2 weight 2.00 }
rule data { ruleset 0 type replicated min_size 2 max_size 2 step take platter step chooseleaf firstn 0 type host step emit } rule metadata { ruleset 1 type replicated min_size 0 max_size 10 step take platter step chooseleaf firstn 0 type host step emit } rule rbd { ruleset 2 type replicated min_size 0 max_size 10 step take platter step chooseleaf firstn 0 type host step emit } rule platter { ruleset 3 type replicated min_size 0 max_size 10 step take platter step chooseleaf firstn 0 type host step emit } rule ssd { ruleset 4 type replicated min_size 0 max_size 4 step take ssd step chooseleaf firstn 0 type host step emit }
OSDs Buckets Rules
Cache Tiering
2 modes : Write back and Read Only
Monitor
• Maintains Cluster state and history– Mon Map– OSD Map– PG Map– Crush Map– MDS Map
• Every Ceph client has a list of Mons (addresses)
Dependability
• Monitors use a consensus algorithm to maintain the maps– all the mons have the same map view (strong
consistency)– based on Quorum, then if half the monitors
crashes/disappears, the system won’t be available
Metadata Server (MDS)
• Store metadata of CephFS (permission bits, ACL, ownership, …)
• The data are stored into a Ceph pool (not locally)
• Cache the metadata• Provide high availability of metadata (multiple
MDS)
MDSadaptive metadata cluster architecture based on Dynamic Subtree Partitioning that adaptively and intelligently distributes responsibility for managing the file system directory hierarchy among the available MDSs in the MDS cluster
Ceph-Status• Portal: www.ceph.com• Code at https://github.com/ceph/ceph• Versions
– Infernalis (11/2015)– Hammer (04/2015)– Giant (10/2014)– Firefly (05/2014)– Emperor (11/2013)– Dumpling (08/2013)– Cuttlefish (05/2013)– Bobtail (01/2013)– Argonaut (07/2012)
• License : LGPL v2.1, BSD, MIT, Apache 2 …• On March 19, 2010, Linus Torvalds merged the Ceph client into Linux kernel version
2.6.34 (2010). • Active contributors: ~120• Very active community
inkScope is a Ceph visualization and admin interface
Open source : https://github.com/inkscope/inkscopeversion 1.3 (23/12/2015)
Architecture