Upload
ian-colle
View
329
Download
2
Tags:
Embed Size (px)
DESCRIPTION
October 2014 overview of the Ceph distributed storage system architecture, integration with OpenStack, and plans for future development.
Citation preview
14 OCT 2014Colorado OpenStack Meetup
HISTORICAL TIMELINE
2
RHEL-OSP Certification FEB 2014
MAY 2012Launch of Inktank
OpenStack Integration 2011
2010Mainline Linux Kernel
Open Source 2006
2004 Project Starts at UCSC
Production Ready Ceph SEPT 2012
2012CloudStack Integration
OCT 2013Inktank Ceph Enterprise Launch
Xen Integration 2013
APR 2014Inktank Acquired by Red Hat
10 years in the making
Copyright © 2014 by Inktank
OPENSTACK USER SURVEY, 05/2014
3
DEV / QA PROOF OF CONCEPT PRODUCTION
A STORAGE REVOLUTION
PROPRIETARY HARDWARE
PROPRIETARY SOFTWARE
SUPPORT & MAINTENANCE
COMPUTER
DISKCOMPUTE
RDISK
COMPUTER
DISK
STANDARDHARDWARE
OPEN SOURCE SOFTWARE
ENTERPRISEPRODUCTS &
SERVICES
COMPUTER
DISKCOMPUTE
RDISK
COMPUTER
DISK
ARCHITECTURE
Copyright © 2014 by Inktank
ARCHITECTURAL COMPONENTS
6
RGWA web services
gateway for object storage, compatible
with S3 and Swift
LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby,
PHP)
RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors
RBDA reliable, fully-distributed block device with cloud
platform integration
CEPHFSA distributed file
system with POSIX semantics and
scale-out metadata management
APP HOST/VM CLIENT
ARCHITECTURAL COMPONENTS
7
RGWA web services
gateway for object storage, compatible
with S3 and Swift
LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby,
PHP)
RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors
RBDA reliable, fully-distributed block device with cloud
platform integration
CEPHFSA distributed file
system with POSIX semantics and
scale-out metadata management
APP HOST/VM CLIENT
Copyright © 2014 by Inktank
OBJECT STORAGE DAEMONS
8
FS
DISK
OSD
DISK
OSD
FS
DISK
OSD
FS
DISK
OSD
FS
btrfsxfsext4zfs?
M
M
M
Copyright © 2014 by Inktank
RADOS CLUSTER
9
APPLICATION
M M
M M
M
RADOS CLUSTER
Copyright © 2014 by Inktank
RADOS COMPONENTS
10
OSDs: 10s to 10000s in a cluster One per disk (or one per SSD, RAID
group…) Serve stored objects to clients Intelligently peer for replication & recovery
Monitors: Maintain cluster membership and state Provide consensus for distributed decision-
making Small, odd number These do not serve stored objects to
clients
MCopyright © 2014 by Inktank
WHERE DO OBJECTS LIVE?
11
??APPLICATION
M
M
M
OBJECT
Copyright © 2014 by Inktank
A METADATA SERVER?
12
1
APPLICATION
M
M
M
2
Copyright © 2014 by Inktank
CALCULATED PLACEMENT
13
FAPPLICATION
M
M
MA-G
H-N
O-T
U-Z
Copyright © 2014 by Inktank
EVEN BETTER: CRUSH!
14
CLUSTER
OBJECTS
10
01
01
10
10
01
11
01
10
01
01
10
10
01 11
01
1001
0110 10 01
11
01
PLACEMENT GROUPS(PGs)
Copyright © 2014 by Inktank
CRUSH IS A QUICK CALCULATION
15
RADOS CLUSTER
OBJECT
10
01
01
10
10
01 11
01
1001
0110 10 01
11
01
Copyright © 2014 by Inktank
CRUSH: DYNAMIC DATA PLACEMENT
16
CRUSH: Pseudo-random placement algorithm
Fast calculation, no lookup Repeatable, deterministic
Statistically uniform distribution Stable mapping
Limited data migration on change Rule-based configuration
Infrastructure topology aware Adjustable replication Weighting
Copyright © 2014 by Inktank
CRUSH
17
OBJECT
10 10 01 01 10 10 01 11 01 10
hash(object name) % num pg
CRUSH(pg, cluster state, rule set)
Copyright © 2014 by Inktank
18
OBJECT
10 10 01 01 10 10 01 11 01 10
Copyright © 2014 by Inktank
19
CLIENT
??
Copyright © 2014 by Inktank
20Copyright © 2014 by Inktank
21Copyright © 2014 by Inktank
22
CLIENT
??
Copyright © 2014 by Inktank
23Copyright © 2014 by Inktank
24Copyright © 2014 by Inktank
25Copyright © 2014 by Inktank
ARCHITECTURAL COMPONENTS
26
RGWA web services
gateway for object storage, compatible
with S3 and Swift
LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby,
PHP)
RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors
RBDA reliable, fully-distributed block device with cloud
platform integration
CEPHFSA distributed file
system with POSIX semantics and
scale-out metadata management
APP HOST/VM CLIENT
Copyright © 2014 by Inktank
ACCESSING A RADOS CLUSTER
27
APPLICATION
M M
M
RADOS CLUSTER
LIBRADOS
OBJECT
socket
Copyright © 2014 by Inktank
L
LIBRADOS: RADOS ACCESS FOR APPS
28
LIBRADOS: Direct access to RADOS for applications C, C++, Python, PHP, Java, Erlang Direct access to storage nodes No HTTP overhead
Copyright © 2014 by Inktank
ARCHITECTURAL COMPONENTS
29
RGWA web services
gateway for object storage, compatible
with S3 and Swift
LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby,
PHP)
RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors
RBDA reliable, fully-distributed block device with cloud
platform integration
CEPHFSA distributed file
system with POSIX semantics and
scale-out metadata management
APP HOST/VM CLIENT
Copyright © 2014 by Inktank
THE RADOS GATEWAY
30
M M
M
RADOS CLUSTER
RADOSGWLIBRADOS
socket
RADOSGWLIBRADOS
APPLICATION APPLICATION
REST
Copyright © 2014 by Inktank
RADOSGW MAKES RADOS WEBBY
31
RADOSGW: REST-based object storage proxy Uses RADOS to store objects API supports buckets, accounts Usage accounting for billing Compatible with S3 and Swift applications
Copyright © 2014 by Inktank
ARCHITECTURAL COMPONENTS
32
RGWA web services
gateway for object storage, compatible
with S3 and Swift
LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby,
PHP)
RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors
RBDA reliable, fully-distributed block device with cloud
platform integration
CEPHFSA distributed file
system with POSIX semantics and
scale-out metadata management
APP HOST/VM CLIENT
Copyright © 2014 by Inktank
STORING VIRTUAL DISKS
33
M M
RADOS CLUSTER
HYPERVISORLIBRBD
VM
Copyright © 2014 by Inktank
SEPARATE COMPUTE FROM STORAGE
34
M M
RADOS CLUSTER
HYPERVISORLIBRB
D
VM HYPERVISORLIBRB
D
Copyright © 2014 by Inktank
KERNEL MODULE FOR MAX FLEXIBLE!
35
M M
RADOS CLUSTER
LINUX HOSTKRBD
Copyright © 2014 by Inktank
RBD STORES VIRTUAL DISKS
36
RADOS BLOCK DEVICE: Storage of disk images in RADOS Decouples VMs from host Images are striped across the cluster
(pool) Snapshots Copy-on-write clones Support in:
Mainline Linux Kernel (2.6.39+) Qemu/KVM, native Xen coming soon OpenStack, CloudStack, Nebula,
Proxmox
Copyright © 2014 by Inktank
Export snapshots to geographically dispersed data centers▪Institute disaster recovery
Export incremental snapshots▪Minimize network bandwidth by only sending
changes
RBD SNAPSHOTS
Copyright © 2014 by Inktank
ARCHITECTURAL COMPONENTS
38
RGWA web services
gateway for object storage, compatible
with S3 and Swift
LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby,
PHP)
RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors
RBDA reliable, fully-distributed block device with cloud
platform integration
CEPHFSA distributed file
system with POSIX semantics and
scale-out metadata management
APP HOST/VM CLIENT
Copyright © 2014 by Inktank
SEPARATE METADATA SERVER
39
LINUX HOST
M M
M
RADOS CLUSTER
KERNEL MODULE
datametadata 0110
Copyright © 2014 by Inktank
SCALABLE METADATA SERVERS
40
METADATA SERVER Manages metadata for a POSIX-compliant
shared filesystem Directory hierarchy File metadata (owner, timestamps,
mode, etc.) Stores metadata in RADOS Does not serve file data to clients Only required for shared filesystem
Copyright © 2014 by Inktank
CALAMARI
41Copyright © 2014 by Inktank
CALAMARI ARCHITECTURE
42
CEPH STORAGE CLUSTER
MASTER
CALAMARI
ADMIN NODE
MINION
MINION
M
MINION
MINION
M
MINION
MINION
M
Copyright © 2014 by Inktank
USE CASES
WEB APPLICATION STORAGE
WEB APPLICATION
APP SERVER
APP SERVER
APP SERVER
CEPH STORAGE CLUSTER(RADOS)
CEPH OBJECT GATEWAY
(RGW)
CEPH OBJECT GATEWAY(RGW)
44
APP SERVER
S3/Swift S3/Swift S3/Swift S3/Swift
Copyright © 2014 by Inktank
MULTI-SITE OBJECT STORAGE
WEB APPLICATIONAPP
SERVER
CEPH OBJECT GATEWAY
(RGW)
45
CEPH STORAGE CLUSTER
(US-EAST)
WEB APPLICATIONAPP
SERVER
CEPH OBJECT GATEWAY
(RGW)
CEPH STORAGE CLUSTER
(EU-WEST)
Copyright © 2014 by Inktank
ARCHIVE / COLD STORAGE
46
APPLICATION
CACHE POOL (REPLICATED)
BACKING POOL (ERASURE CODED)
CEPH STORAGE CLUSTER
Copyright © 2014 by Inktank
ERASURE CODING
47
OBJECT
REPLICATED POOL
CEPH STORAGE CLUSTER
ERASURE CODED POOL
CEPH STORAGE CLUSTER
COPY COPY
OBJECT
31 2 X Y
COPY4
Full copies of stored objects Very high durability Quicker recovery
One copy plus parity Cost-effective durability Expensive recovery
Copyright © 2014 by Inktank
ERASURE CODING: HOW DOES IT WORK?
48
CEPH STORAGE CLUSTER
OBJECT
Y
OSD
3
OSD
2
OSD
1
OSD
4
OSD
X
OSD
ERASURE CODED POOL
Copyright © 2014 by Inktank
CACHE TIERING
49
CEPH CLIENT
CACHE: WRITEBACK MODE
BACKING POOL (REPLICATED)
CEPH STORAGE CLUSTER
Read/Write Read/Write
Copyright © 2014 by Inktank
CACHE TIERING
50
CEPH CLIENT
CACHE: READ ONLY MODE
BACKING POOL (REPLICATED)
CEPH STORAGE CLUSTER
Write Write Read Read
Copyright © 2014 by Inktank
WEBSCALE APPLICATIONS
51
WEB APPLICATION
APP SERVER
APP SERVER
APP SERVER
CEPH STORAGE CLUSTER(RADOS)
APP SERVER
NativeProtocol
NativeProtocol
NativeProtocol
NativeProtocol
Copyright © 2014 by Inktank
ARCHIVE / COLD STORAGE
52
APPLICATION
CACHE POOL (REPLICATED)
BACKING POOL (ERASURE CODED)
Site A Site B
CEPH STORAGE CLUSTER CEPH STORAGE CLUSTER
Copyright © 2014 by Inktank
CEPH BLOCK DEVICE (RBD)
DATABASES
53
MYSQL / MARIADB
LINUX KERNEL
CEPH STORAGE CLUSTER(RADOS)
NativeProtocol
NativeProtocol
NativeProtocol
NativeProtocol
Copyright © 2014 by Inktank
WHAT ABOUT CEPH AND OPENSTACK?
CEPH AND OPENSTACK
55
RADOSGWLIBRADOS
M M
RADOS CLUSTER
OPENSTACK
KEYSTONE CINDER GLANCE
NOVASWIFTLIBRB
DLIBRB
D
HYPER-
VISORLIBRBD
Copyright © 2014 by Inktank
JUNO Enable Cloning for rbd-backed ephemeral disks
KILO Volume Migration from One Backend to Another Implement proper snapshotting for Ceph-based
ephemeral disks Improve Backup in Cinder
OPENSTACK ADDITIONS
Copyright © 2014 by Inktank
Future Ceph Roadmap
CEPH ROADMAP
58
Giant Hammer I-Release
RBD df
Object Versioning
Performance Improvements
RBD Mirroring
Copyright © 2014 by Inktank
Calamari
Alternative Web Server for RGW
Performance Improvements
Object Expiration
Performance Improvements
NEXT STEPS
NEXT STEPSWHAT NOW?
• Read about the latest version of Ceph: http://ceph.com/docs
• Deploy a test cluster using ceph-deploy: http://ceph.com/qsg
Getting Started with Ceph
Most discussion happens on the mailing lists ceph-devel and ceph-users. Join or view archives at http://ceph.com/list
IRC is a great place to get help (or help others!) #ceph and #ceph-devel. Details and logs at http://ceph.com/irc
Getting Involved with Ceph
60
• Deploy a test cluster on the AWS free-tier using Juju: http://ceph.com/juju
• Ansible playbooks for Ceph: https://www.github.com/alfredodeza/ceph-ansible Download the code: http:
//www.github.com/ceph The tracker manages bugs and
feature requests. Register and start looking around at http://tracker.ceph.com
Doc updates and suggestions are always welcome. Learn how to contribute docs at http://ceph.com/docwriting
THANK YOU!
Ian ColleGlobal Director of Software Engineering
@ircolle