Awesome distributed storage system

Philippe [email protected]

Awesome distributed storage system

Ceph-History

• At the beginning (2006): part of Sage Weil ph.d researches

• After graduation (2007): Open source with 3 main developers

• 2011, S. Weil created Inktank Storage for professional services and support for Ceph (~60 developers)

• April 2014, Red Hat acquired Inktank ($175 Million)

Open Source Project

• www.ceph.com• www.github.com/ceph

• 9th release (Infernalis, 11/2015)

http://www.ceph.com/

http://www.github.com/ceph

Ceph-TargetCeph is a unified, distributed storage system designed for excellent performance, reliability and scalability.

• Designed for commodity hardware• Software-based• Open Source• Self managing/healing• Self balancing• painless scaling• No SPOF• Object Storage (S3, Swift)• Block Storage• File System (POSIX)

C A PX

Architecture outline

RBD

• Rados Block Device– Thin provisioning– Snapshot / Clone

– Can be used as OpenStack Cinder backend– Can be used by Libvirt

CephFS

• File System– POSIX Compliant (legacy)– Network File System– Plugin for hadoop (HDFS alternative)

– Kernel (>2.6.34 / 2010) or FUSE

RGW

• Rados GateWay : HTTP REST gateway for the RADOS object store– AWS S3 Compliant– OpenStack Swift Compliant

What services can be easily done with a Ceph ?

• A DropBox like (Ceph RGW + OwnCloud)

• A provider of volumes (Ceph RBD)

• A NFS like (CephFS)

All with the same Ceph cluster

a Ceph cluster

Ceph-Concept• Object Servers (OS): store the objects• Monitor servers (MON): watch over the storage network,

maintain the group membership, ensure consistency (Strong consistency)

• Metadata Servers (MDS): store the file system structure

• A service uses a Pool that is composed of Placement Groups– a placement group is a storage space distributed over n OS

• Crush Map : defines placement rules• Replication VS Erasure Code• Cache-tiering

Entities

Ceph Client Pool

PG0

PG1

PG2

PGn

…

OSD0a OSD0b OSD0c

OSD1a OSD1b OSD1c

OSD2a OSD2b OSD2c

OSDna OSDnb OSDnc

Client host OSD host

CRUD

Pool TypeResiliency

• Replicated– each PG is composed of n OSD– One OSD is designed as Primary– IO is done on Primary– Each object is copied on the other OSD by the Primary (strong consistency : ack after

copy)

• Erasure Code– a pool can have an erasure code profiles (params k, m)– each PG is composed of k+m OSD– One OSD is designed as Primary– IO is done on Primary (encode and decode)– Each object is encoded into k+m chunks by the Primary and then spread to the k+m OSD

(strong consistency : ack after creation)

– the default erasure code library : jerasure – other lib can be used dynamically (plugin)

Erasure code overview

Object Store Device

OSD Daemon

FS (xattr)

Disk

OSD is primary for some objects : - responsible for resiliency- responsible for coherency- responsible for re-balancing- responsible for recovery

OSD is secondary for some objects : - under control of primary- capable of becoming primary

atomic transactions : put, get, delete, …

Object Placement

CRUSHControlled Replication Under Scalable Hashing

a pseudo-random deterministic data distribution algorithm that efficiently and robustly distributes object replicas across a heterogeneous, structured storage cluster.

This avoids the need for an index server to coordinate reads and writes.

Based on • OSD weight• rules

device 0 osd.0device 1 osd.1device 2 osd.2device 3 osd.3device 4 osd.4device 5 osd.5device 6 osd.6device 7 osd.7

host ceph-osd-ssd-server-1 { id -1 alg straw hash 0 item osd.0 weight 1.00 item osd.1 weight 1.00 } host ceph-osd-ssd-server-2 { id -2 alg straw hash 0 item osd.2 weight 1.00 item osd.3 weight 1.00 } host ceph-osd-platter-server-1 { id -3 alg straw hash 0 item osd.4 weight 1.00 item osd.5 weight 1.00 } host ceph-osd-platter-server-2 { id -4 alg straw hash 0 item osd.6 weight 1.00 item osd.7 weight 1.00 } root platter { id -5 alg straw hash 0 item ceph-osd-platter-server-1 weight 2.00 item ceph-osd-platter-server-2 weight 2.00 } root ssd { id -6 alg straw hash 0 item ceph-osd-ssd-server-1 weight 2.00 item ceph-osd-ssd-server-2 weight 2.00 }

rule data { ruleset 0 type replicated min_size 2 max_size 2 step take platter step chooseleaf firstn 0 type host step emit } rule metadata { ruleset 1 type replicated min_size 0 max_size 10 step take platter step chooseleaf firstn 0 type host step emit } rule rbd { ruleset 2 type replicated min_size 0 max_size 10 step take platter step chooseleaf firstn 0 type host step emit } rule platter { ruleset 3 type replicated min_size 0 max_size 10 step take platter step chooseleaf firstn 0 type host step emit } rule ssd { ruleset 4 type replicated min_size 0 max_size 4 step take ssd step chooseleaf firstn 0 type host step emit }

OSDs Buckets Rules

Cache Tiering

2 modes : Write back and Read Only

Monitor

• Maintains Cluster state and history– Mon Map– OSD Map– PG Map– Crush Map– MDS Map

• Every Ceph client has a list of Mons (addresses)

Dependability

• Monitors use a consensus algorithm to maintain the maps– all the mons have the same map view (strong

consistency)– based on Quorum, then if half the monitors

crashes/disappears, the system won’t be available

Metadata Server (MDS)

• Store metadata of CephFS (permission bits, ACL, ownership, …)

• The data are stored into a Ceph pool (not locally)

• Cache the metadata• Provide high availability of metadata (multiple

MDS)

MDSadaptive metadata cluster architecture based on Dynamic Subtree Partitioning that adaptively and intelligently distributes responsibility for managing the file system directory hierarchy among the available MDSs in the MDS cluster

Ceph-Status• Portal: www.ceph.com• Code at https://github.com/ceph/ceph• Versions

– Infernalis (11/2015)– Hammer (04/2015)– Giant (10/2014)– Firefly (05/2014)– Emperor (11/2013)– Dumpling (08/2013)– Cuttlefish (05/2013)– Bobtail (01/2013)– Argonaut (07/2012)

• License : LGPL v2.1, BSD, MIT, Apache 2 …• On March 19, 2010, Linus Torvalds merged the Ceph client into Linux kernel version

2.6.34 (2010). • Active contributors: ~120• Very active community

http://www.ceph.com/

https://github.com/ceph/ceph

inkScope is a Ceph visualization and admin interface

Open source : https://github.com/inkscope/inkscopeversion 1.3 (23/12/2015)

Architecture

Documents

Awesome distributed storage system