Upload
ross-turk
View
332
Download
4
Tags:
Embed Size (px)
DESCRIPTION
Citation preview
SCALING STORAGE WITH CEPH
Ross Turk, Inktank
me
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APP APP HOST/VM CLIENT
IN THE BEGINNINGMagic Madzik, Flickr / CC BY 2.0
EARLY INFORMATION STORAGEChico.Ferreira, Flickr / CC BY 2.0
WRITING > CAVE PAINTINGSkevingessner, Flickr / CC BY-SA 2.0
x1000
==x1
PEOPLE BEGIN WRITING A LOTMoyan_Brenn, Flickr / CC BY-ND 2.0
WRITING IS T IME-CONSUMINGtrekkyandy, Flickr / CC BY 2.0
THE INDUSTRIALIZATION OF WRITINGFateDenied, Flickr / CC BY 2.0
x1000
==x1
+magnet =tape magnetic tape
STORAGE BECOMES MECHANICALErik Pitti, Wikipedia / CC BY-ND 2.0
HUMAN
COMPUTER TAPE
HUMAN
ROCK
HUMAN
INK
PAPER
COMPUTERS NEED PEOPLE TO WORKUSDAgov, Flickr / CC BY 2.0
HUMAN
COMPUTER TAPE
11101011 10110110 10110101 10101001 00100100 01001001 10100100 10100101 01011010 01101010 10101010 10101010 01010110 01010011
==
THROUGHPUT BECOMES IMPORTANTZane Luke, Flickr / CC BY-ND 2.0
LAZ0R B3AMS CHANGE EVERYTHING!!Jeff Kubina, Flickr / CC-BY-SA 2.0
HARD DRIVES ARE TOTALLY BET TER
amazing spinny hard drives sucky stupid tapeslow
EVERYTHING GETS MESSYRob!, Flickr / CC BY 2.0
000
aa
acab
ba
111010
bb bc
110
010
111
dc
101
da000
110
001
010
011
db
owner: rturkcreated: aug12
last viewed: aug17size: 42025perms: 644
11101011 10110110 10110101 10101001 00100100 01001001 10100100 10100101 01011010 01101010 10101010 10101010
file
000
aa
acab
ba
111010
bb bc
110
010
111
dc
101
da000
110
001
010
db 0110
WE OUTGROW THE HARD DRIVEMr. T in DC, Flickr / CC BY 2.0
HUMAN COMPUTER DISK
DISK
DISK
DISK
DISK
DISK
DISK
HUMAN
HUMAN
(COMPUTER)
DISK
DISK
DISK
DISK
DISK
DISK
DISK
DISK
DISK
DISK
DISK
DISK
HUMAN
HUMAN
HUMAN
HUMAN
HUMAN
HUMAN
HUMANHUMAN
HUMANHUMAN
HUMAN
HUMANHUMAN
HUMAN
HUMAN
HUMAN
HUMAN
HUMAN
HUMAN
HUMAN
HUMAN
HUMAN (actually more like this…)
DISKCOMPUTE
R
HUMAN
HUMAN
HUMAN
DISKCOMPUTE
RDISK
COMPUTER
DISKCOMPUTE
RDISK
COMPUTER
DISKCOMPUTE
RDISK
COMPUTER
DISKCOMPUTE
RDISK
COMPUTER
DISKCOMPUTE
RDISK
COMPUTER
DISKCOMPUTE
R
000
aa
acab
ba
111010
bb bc
110
010
111
dc
101
da000
110
001
010
011
dbX
pace: quickdriver: frog
license: expiredexpression: agog11101011 10110110
10110101 10101001 00100100 01001001 10100100 10100101 01011010 01101010 10101010 10101010
object
DISKCOMPUTE
R
APP
DISKCOMPUTE
RDISK
COMPUTER
DISKCOMPUTE
RDISK
COMPUTER
DISKCOMPUTE
RDISK
COMPUTER
DISKCOMPUTE
RDISK
COMPUTER
DISKCOMPUTE
RDISK
COMPUTER
DISKCOMPUTE
R
DISK
COMPUTER
DISKCOMPUTE
RDISK
COMPUTER
DISKCOMPUTE
RDISK
COMPUTER
DISKCOMPUTE
RDISK
COMPUTER
DISKCOMPUTE
RDISK
COMPUTER
DISKCOMPUTE
RDISK
COMPUTER
DISKCOMPUTE
R
COMPUTER
DISK
DISK
DISKCOMPUTE
RDISK
COMPUTER
DISKCOMPUTE
RDISK
COMPUTER
DISKCOMPUTE
RDISK
COMPUTER
DISKCOMPUTE
RDISK
COMPUTER
DISKCOMPUTE
RDISK
COMPUTER
DISKCOMPUTE
R
COMPUTER
VM
VM
VM
STOR AG E TH ROUG H OUT H I STORYTime-scale: Roughly logarithmic. Content: Whatever the opposite of “scientific” is.
Writing
Computers
Shared storage
Distributed storage
Cloud computing
Ceph
Painting
DISKCOMPUTE
R
HUMAN
HUMAN
HUMAN
DISKCOMPUTE
RDISK
COMPUTER
DISKCOMPUTE
RDISK
COMPUTER
DISKCOMPUTE
RDISK
COMPUTER
DISKCOMPUTE
RDISK
COMPUTER
DISKCOMPUTE
RDISK
COMPUTER
DISKCOMPUTE
R
DISKCOMPUTE
RDISK
COMPUTER
DISKCOMPUTE
RDISK
COMPUTER
DISKCOMPUTE
RDISK
COMPUTER
DISKCOMPUTE
RDISK
COMPUTER
DISKCOMPUTE
RDISK
COMPUTER
DISKCOMPUTE
RDISK
COMPUTER
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
HUMAN
HUMAN
HUMAN
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
STORAGE APPLIANCESMichael Moll, Wikipedia / CC BY-SA 2.0
6.4 MILL ION SQFT OF FACTORIESDude94111, Flickr / CC BY 2.0
TECHNOLOGY IS A COMMODITYRaeAllen, Flickr / CC-BY 2.0
COMMODITY PRICES FLUCTUATE
May-07 May-08 May-09 May-10 May-11 May-12
Hardware Appliances are Mysterious Black Boxes Abode of Chaos, Flickr / CC BY 2.0
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
HUMAN[DEVELOPER]
!!
DC
DC
DC
DC
D
C
DC
DC
DC
DC
DC
DC
DC
C++
DC
DC
DC
DC
D
C
DC
DC
DC
DC
DC
DC
DC
C++X
THE WORLDNEEDS
AN OPEN STORAGE TECHNOLOGY
THATSCALES
SAGE WEIL
Co-founder of DreamHost
Inventor of Ceph
CEO of Inktank
OPEN SOURCE
philosophy design
OPEN SOURCE SPREADS IDEASorchidgalore, Flickr / CC BY 2.0
OPEN SOURCE
COMMUNITY-FOCUSED
philosophy design
WE ARE SMARTER TOGETHERrturk, Linkedin Inmap
CEPH BELONGS TO ALL OF USwackybadger, Flickr / CC BY 2.0
OPEN SOURCE
COMMUNITY-FOCUSED
SCALABLE
philosophy design
CEPH IS BUILT TO SCALE
Too much for a book
Too much for a drive
Too much for a computer
Too much for a room
Ceph
Too much for a cave
OPEN SOURCE
COMMUNITY-FOCUSED
SCALABLE
NO SINGLE POINT OF FAILURE
philosophy design
ARILOMAX CALIFORNICUSaroid, Flickr / CC BY 2.0
THE OCTOPUS (A METAPHOR)I love speaking in metaphors.
single pointof failure
highly-availablereplicated
THE BEEHIVE (ANOTHER METAPHOR)blumenbiene, Flickr / CC BY 2.0
OPEN SOURCE
COMMUNITY-FOCUSED
SCALABLE
NO SINGLE POINT OF FAILURE
SOFTWARE BASED
philosophy design
DC
DC
DC
DC
D
C
DC
DC
DC
DC
DC
DC
DC
C++
DC
DC
DC
DC
D
C
DC
DC
DC
DC
DC
DC
DC
C++✔
OPEN SOURCE
COMMUNITY-FOCUSED
SCALABLE
NO SINGLE POINT OF FAILURE
SOFTWARE BASEDSELF-
MANAGING
philosophy design
DISKS = JUST T INY RECORD PLAYERSjon_a_ross, Flickr / CC BY 2.0
D
55 times / day
=D
DD
x 1 MILLION
DD
DD
IT ALL STARTED WITH A DREAM
+
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APP APP HOST/VM CLIENT
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APP APP HOST/VM CLIENT
DISK
FS
DISK DISK
OSD
DISK DISK
OSD OSD OSD OSD
FS FS FSFS btrfsxfsext4
MMM
M
M
M
HUMAN
Monitors: Maintain cluster map Provide consensus for
distributed decision-making Must have an odd number These do not serve stored
objects to clients
MOSDs: One per disk
(recommended) At least three in a cluster Serve stored objects to
clients Intelligently peer to perform
replication tasks Supports object classes
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APP APP HOST/VM CLIENT
LIBRADOS
M
M
M
APP
native
LLIBRADOS Provides direct access
to RADOS for applications
C, C++, Python, PHP, Java
No HTTP overhead
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APP APP HOST/VM CLIENT
M
M
M
native
REST
APP
LIBRADOS
RADOSGWLIBRADOS
RADOSGW
APP
RADOS Gateway: REST-based interface
to RADOS Supports buckets,
accounting Compatible with S3
and Swift applications
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APP APP HOST/VM CLIENT
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
M
M
M
VM
LIBRADOS
LIBRBD
VIRTUALIZATION CONTAINER
LIBRADOS
M
M
M
LIBRBD
CONTAINER
LIBRADOS
LIBRBD
CONTAINERVM
LIBRADOS
M
M
M
KRBD (KERNEL MODULE)
HOST
RADOS Block Device: Storage of virtual disks in
RADOS Allows decoupling of VMs
and containers Live migration!
Images are striped across the cluster
Boot support in QEMU, KVM, and OpenStack Nova
Mount support in the Linux kernel
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APP APP HOST/VM CLIENT
M
M
M
CLIENT
0110
datametadata
Metadata Server Manages metadata for a
POSIX-compliant shared filesystem Directory hierarchy File metadata (owner,
timestamps, mode, etc.) Stores metadata in RADOS Does not serve file data to
clients Only required for shared
filesystem
WHAT MAKES CEPH
UNIQUE?
HOW DO YOU F IND YOUR KEYS?azmeen, Flickr / CC BY 2.0
APP??
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
APP
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
A-G
H-N
O-T
U-Z
F*
I ALWAYS PUT MY KEYS ON THE HOOKvitamindave, Flickr / CC BY 2.0
APP
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DC
DEAR DIARY: KEYS = IN THE KITCHENBarnaby, Flickr / CC BY 2.0
HOW DO YOUFIND YOUR KEYS
WHEN YOUR HOUSEIS
INFINITELY BIGAND
ALWAYS CHANGING?
THE ANSWER: CRUSH!!pasukaru76, Flickr / CC SA 2.0
10 10 01 01 10 10 01 11 01 10
10 10 01 01 10 10 01 11 01 10
hash(object name) % num pg
CRUSH(pg, cluster state, rule set)
10 10 01 01 10 10 01 11 01 10
10 10 01 01 10 10 01 11 01 10
CRUSH Pseudo-random
placement algorithm Ensures even distribution Repeatable, deterministic Rule-based configuration
Replica count Infrastructure topology Weighting
CLIENT
??
CLIENT
??
LIBRADOS
M
M
M
VM
LIBRBD
VIRTUALIZATION CONTAINER
HOW DO YOUSPIN UP
THOUSANDS OF VMsINSTANTLY
ANDEFFICIENTLY?
144 0 0 0 0
instant copy
= 144
4144
CLIENT
write
write
write
= 148
write
4144
CLIENTread
read
read
= 148
HOW DO YOUMANAGE
DIRECTORY HEIRARCHYWITHOUT
ASINGLE POINT OF
FAILURE?
FILESYSTEMS REQUIRE METADATABarnaby, Flickr / CC BY 2.0
M
M
M
CLIENT
0110
M
M
M
one tree
three metadata servers
??
DYNAMIC SUBTREE PARTITIONING
AND NOWBACKPEDALING
ALMOSTEVERYTHING
WORKS
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APP APP HOST/VM CLIENT
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
NEARLYAWESOME
AWESOMEAWESOME
AWESOME
AWESOME
LAN SCALE!!*
*OR REALLY REALLY SCARY FAST WAN
CEPH AND CLOUDSTACKtableatny, Flickr / CC BY 2.0
RBD SUPPORT IN CLOUDSTACK
Allows storage of virtual disks inside RADOS Works with KVM only right now No snapshots yet
Upcoming in CloudStack 4 More information can be found on the mailing list:
ceph-devel / incubator-cloudstack-dev: http://article.gmane.org/gmane.comp.file-systems.ceph.devel/7505