29
Philippe Raipin [email protected] Awesome distributed storage system

Awesome distributed storage system

Embed Size (px)

DESCRIPTION

Ceph-History At the beginning (2006): part of Sage Weil ph.d researches After graduation (2007): Open source with 3 main developers 2011, S. Weil created Inktank Storage for professional services and support for Ceph (~60 developers) April 2014, Red Hat acquired Inktank ($175 Million) university of California, Santa Cruz presentation title

Citation preview

Page 1: Awesome distributed storage system

Philippe [email protected]

Awesome distributed storage system

Page 2: Awesome distributed storage system

Ceph-History

• At the beginning (2006): part of Sage Weil ph.d researches

• After graduation (2007): Open source with 3 main developers

• 2011, S. Weil created Inktank Storage for professional services and support for Ceph (~60 developers)

• April 2014, Red Hat acquired Inktank ($175 Million)

Page 3: Awesome distributed storage system

Open Source Project

• www.ceph.com• www.github.com/ceph

• 9th release (Infernalis, 11/2015)

Page 4: Awesome distributed storage system

Ceph-TargetCeph is a unified, distributed storage system designed for excellent performance, reliability and scalability.

• Designed for commodity hardware• Software-based• Open Source• Self managing/healing• Self balancing• painless scaling• No SPOF• Object Storage (S3, Swift)• Block Storage• File System (POSIX)

C A PX

Page 5: Awesome distributed storage system

Architecture outline

Page 6: Awesome distributed storage system

RBD

• Rados Block Device– Thin provisioning– Snapshot / Clone

– Can be used as OpenStack Cinder backend– Can be used by Libvirt

Page 7: Awesome distributed storage system

CephFS

• File System– POSIX Compliant (legacy)– Network File System– Plugin for hadoop (HDFS alternative)

– Kernel (>2.6.34 / 2010) or FUSE

Page 8: Awesome distributed storage system

RGW

• Rados GateWay : HTTP REST gateway for the RADOS object store– AWS S3 Compliant– OpenStack Swift Compliant

Page 9: Awesome distributed storage system

What services can be easily done with a Ceph ?

• A DropBox like (Ceph RGW + OwnCloud)

• A provider of volumes (Ceph RBD)

• A NFS like (CephFS)

All with the same Ceph cluster

Page 10: Awesome distributed storage system

a Ceph cluster

Page 11: Awesome distributed storage system

Ceph-Concept• Object Servers (OS): store the objects• Monitor servers (MON): watch over the storage network,

maintain the group membership, ensure consistency (Strong consistency)

• Metadata Servers (MDS): store the file system structure

• A service uses a Pool that is composed of Placement Groups– a placement group is a storage space distributed over n OS

• Crush Map : defines placement rules• Replication VS Erasure Code• Cache-tiering

Page 12: Awesome distributed storage system

Entities

Ceph Client Pool

PG0

PG1

PG2

PGn

OSD0a OSD0b OSD0c

OSD1a OSD1b OSD1c

OSD2a OSD2b OSD2c

OSDna OSDnb OSDnc

Client host OSD host

CRUD

Page 13: Awesome distributed storage system

Pool TypeResiliency

• Replicated– each PG is composed of n OSD– One OSD is designed as Primary– IO is done on Primary– Each object is copied on the other OSD by the Primary (strong consistency : ack after

copy)

• Erasure Code– a pool can have an erasure code profiles (params k, m)– each PG is composed of k+m OSD– One OSD is designed as Primary– IO is done on Primary (encode and decode)– Each object is encoded into k+m chunks by the Primary and then spread to the k+m OSD

(strong consistency : ack after creation)

– the default erasure code library : jerasure – other lib can be used dynamically (plugin)

Page 14: Awesome distributed storage system

Erasure code overview

Page 15: Awesome distributed storage system

Object Store Device

OSD Daemon

FS (xattr)

Disk

OSD is primary for some objects : - responsible for resiliency- responsible for coherency- responsible for re-balancing- responsible for recovery

OSD is secondary for some objects : - under control of primary- capable of becoming primary

atomic transactions : put, get, delete, …

Page 16: Awesome distributed storage system

Object Placement

Page 17: Awesome distributed storage system

CRUSHControlled Replication Under Scalable Hashing

a pseudo-random deterministic data distribution algorithm that efficiently and robustly distributes object replicas across a heterogeneous, structured storage cluster.

This avoids the need for an index server to coordinate reads and writes.

Based on • OSD weight• rules

Page 18: Awesome distributed storage system

device 0 osd.0device 1 osd.1device 2 osd.2device 3 osd.3device 4 osd.4device 5 osd.5device 6 osd.6device 7 osd.7

host ceph-osd-ssd-server-1 { id -1 alg straw hash 0 item osd.0 weight 1.00 item osd.1 weight 1.00 } host ceph-osd-ssd-server-2 { id -2 alg straw hash 0 item osd.2 weight 1.00 item osd.3 weight 1.00 } host ceph-osd-platter-server-1 { id -3 alg straw hash 0 item osd.4 weight 1.00 item osd.5 weight 1.00 } host ceph-osd-platter-server-2 { id -4 alg straw hash 0 item osd.6 weight 1.00 item osd.7 weight 1.00 } root platter { id -5 alg straw hash 0 item ceph-osd-platter-server-1 weight 2.00 item ceph-osd-platter-server-2 weight 2.00 } root ssd { id -6 alg straw hash 0 item ceph-osd-ssd-server-1 weight 2.00 item ceph-osd-ssd-server-2 weight 2.00 }

rule data { ruleset 0 type replicated min_size 2 max_size 2 step take platter step chooseleaf firstn 0 type host step emit } rule metadata { ruleset 1 type replicated min_size 0 max_size 10 step take platter step chooseleaf firstn 0 type host step emit } rule rbd { ruleset 2 type replicated min_size 0 max_size 10 step take platter step chooseleaf firstn 0 type host step emit } rule platter { ruleset 3 type replicated min_size 0 max_size 10 step take platter step chooseleaf firstn 0 type host step emit } rule ssd { ruleset 4 type replicated min_size 0 max_size 4 step take ssd step chooseleaf firstn 0 type host step emit }

OSDs Buckets Rules

Page 19: Awesome distributed storage system

Cache Tiering

2 modes : Write back and Read Only

Page 20: Awesome distributed storage system

Monitor

• Maintains Cluster state and history– Mon Map– OSD Map– PG Map– Crush Map– MDS Map

• Every Ceph client has a list of Mons (addresses)

Page 21: Awesome distributed storage system

Dependability

• Monitors use a consensus algorithm to maintain the maps– all the mons have the same map view (strong

consistency)– based on Quorum, then if half the monitors

crashes/disappears, the system won’t be available

Page 22: Awesome distributed storage system

Metadata Server (MDS)

• Store metadata of CephFS (permission bits, ACL, ownership, …)

• The data are stored into a Ceph pool (not locally)

• Cache the metadata• Provide high availability of metadata (multiple

MDS)

Page 23: Awesome distributed storage system

MDSadaptive metadata cluster architecture based on Dynamic Subtree Partitioning that adaptively and intelligently distributes responsibility for managing the file system directory hierarchy among the available MDSs in the MDS cluster

Page 24: Awesome distributed storage system

Ceph-Status• Portal: www.ceph.com• Code at https://github.com/ceph/ceph• Versions

– Infernalis (11/2015)– Hammer (04/2015)– Giant (10/2014)– Firefly (05/2014)– Emperor (11/2013)– Dumpling (08/2013)– Cuttlefish (05/2013)– Bobtail (01/2013)– Argonaut (07/2012)

• License : LGPL v2.1, BSD, MIT, Apache 2 …• On March 19, 2010, Linus Torvalds merged the Ceph client into Linux kernel version

2.6.34 (2010). • Active contributors: ~120• Very active community

Page 25: Awesome distributed storage system

inkScope is a Ceph visualization and admin interface

Open source : https://github.com/inkscope/inkscopeversion 1.3 (23/12/2015)

Page 26: Awesome distributed storage system

Architecture

Page 27: Awesome distributed storage system
Page 28: Awesome distributed storage system
Page 29: Awesome distributed storage system