127
SCALING STORAGE WITH CEPH Ross Turk, Inktank

Build A Cloud Day - Chicago

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Build A Cloud Day - Chicago

SCALING STORAGE WITH CEPH

Ross Turk, Inktank

Page 2: Build A Cloud Day - Chicago

WHO?

Ross TurkVP Community, Inktank

[email protected] @rossturk

inktank.com | ceph.com

Page 3: Build A Cloud Day - Chicago
Page 4: Build A Cloud Day - Chicago
Page 5: Build A Cloud Day - Chicago

me

Page 6: Build A Cloud Day - Chicago

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

Page 7: Build A Cloud Day - Chicago

IN THE BEGINNINGMagic Madzik, Flickr / CC BY 2.0

Page 8: Build A Cloud Day - Chicago

EARLY INFORMATION STORAGEChico.Ferreira, Flickr / CC BY 2.0

Page 9: Build A Cloud Day - Chicago

WRITING > CAVE PAINTINGSkevingessner, Flickr / CC BY-SA 2.0

Page 10: Build A Cloud Day - Chicago

x1000

==x1

Page 11: Build A Cloud Day - Chicago

PEOPLE BEGIN WRITING A LOTMoyan_Brenn, Flickr / CC BY-ND 2.0

Page 12: Build A Cloud Day - Chicago

WRITING IS T IME-CONSUMINGtrekkyandy, Flickr / CC BY 2.0

Page 13: Build A Cloud Day - Chicago

THE INDUSTRIALIZATION OF WRITINGFateDenied, Flickr / CC BY 2.0

Page 14: Build A Cloud Day - Chicago

x1000

==x1

+magnet =tape magnetic tape

Page 15: Build A Cloud Day - Chicago

STORAGE BECOMES MECHANICALErik Pitti, Wikipedia / CC BY-ND 2.0

Page 16: Build A Cloud Day - Chicago

HUMAN

COMPUTER TAPE

HUMAN

ROCK

HUMAN

INK

PAPER

Page 17: Build A Cloud Day - Chicago

COMPUTERS NEED PEOPLE TO WORKUSDAgov, Flickr / CC BY 2.0

Page 18: Build A Cloud Day - Chicago

HUMAN

COMPUTER TAPE

Page 19: Build A Cloud Day - Chicago

11101011 10110110 10110101 10101001 00100100 01001001 10100100 10100101 01011010 01101010 10101010 10101010 01010110 01010011

==

Page 20: Build A Cloud Day - Chicago

THROUGHPUT BECOMES IMPORTANTZane Luke, Flickr / CC BY-ND 2.0

Page 21: Build A Cloud Day - Chicago

LAZ0R B3AMS CHANGE EVERYTHING!!Jeff Kubina, Flickr / CC-BY-SA 2.0

Page 22: Build A Cloud Day - Chicago

HARD DRIVES ARE TOTALLY BET TER

amazing spinny hard drives sucky stupid tapeslow

Page 23: Build A Cloud Day - Chicago

EVERYTHING GETS MESSYRob!, Flickr / CC BY 2.0

Page 24: Build A Cloud Day - Chicago

000

aa

acab

ba

111010

bb bc

110

010

111

dc

101

da000

110

001

010

011

db

Page 25: Build A Cloud Day - Chicago

owner: rturkcreated: aug12

last viewed: aug17size: 42025perms: 644

11101011 10110110 10110101 10101001 00100100 01001001 10100100 10100101 01011010 01101010 10101010 10101010

file

Page 26: Build A Cloud Day - Chicago

000

aa

acab

ba

111010

bb bc

110

010

111

dc

101

da000

110

001

010

db 0110

Page 27: Build A Cloud Day - Chicago

WE OUTGROW THE HARD DRIVEMr. T in DC, Flickr / CC BY 2.0

Page 28: Build A Cloud Day - Chicago

HUMAN COMPUTER DISK

DISK

DISK

DISK

DISK

DISK

DISK

HUMAN

HUMAN

Page 29: Build A Cloud Day - Chicago

(COMPUTER)

DISK

DISK

DISK

DISK

DISK

DISK

DISK

DISK

DISK

DISK

DISK

DISK

HUMAN

HUMAN

HUMAN

HUMAN

HUMAN

HUMAN

HUMANHUMAN

HUMANHUMAN

HUMAN

HUMANHUMAN

HUMAN

HUMAN

HUMAN

HUMAN

HUMAN

HUMAN

HUMAN

HUMAN

HUMAN (actually more like this…)

Page 30: Build A Cloud Day - Chicago

DISKCOMPUTE

R

HUMAN

HUMAN

HUMAN

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

R

Page 31: Build A Cloud Day - Chicago

000

aa

acab

ba

111010

bb bc

110

010

111

dc

101

da000

110

001

010

011

dbX

Page 32: Build A Cloud Day - Chicago

pace: quickdriver: frog

license: expiredexpression: agog11101011 10110110

10110101 10101001 00100100 01001001 10100100 10100101 01011010 01101010 10101010 10101010

object

Page 33: Build A Cloud Day - Chicago

DISKCOMPUTE

R

APP

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

R

Page 34: Build A Cloud Day - Chicago

DISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

R

COMPUTER

DISK

Page 35: Build A Cloud Day - Chicago

DISK

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

R

COMPUTER

VM

VM

VM

Page 36: Build A Cloud Day - Chicago

STOR AG E TH ROUG H OUT H I STORYTime-scale: Roughly logarithmic. Content: Whatever the opposite of “scientific” is.

Writing

Computers

Shared storage

Distributed storage

Cloud computing

Ceph

Painting

Page 37: Build A Cloud Day - Chicago

DISKCOMPUTE

R

HUMAN

HUMAN

HUMAN

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

R

Page 38: Build A Cloud Day - Chicago

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

DISKCOMPUTE

RDISK

COMPUTER

Page 39: Build A Cloud Day - Chicago

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

Page 40: Build A Cloud Day - Chicago

HUMAN

HUMAN

HUMAN

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

Page 41: Build A Cloud Day - Chicago

STORAGE APPLIANCESMichael Moll, Wikipedia / CC BY-SA 2.0

Page 42: Build A Cloud Day - Chicago

6.4 MILL ION SQFT OF FACTORIESDude94111, Flickr / CC BY 2.0

Page 43: Build A Cloud Day - Chicago

TECHNOLOGY IS A COMMODITYRaeAllen, Flickr / CC-BY 2.0

Page 44: Build A Cloud Day - Chicago

COMMODITY PRICES FLUCTUATE

May-07 May-08 May-09 May-10 May-11 May-12

Page 45: Build A Cloud Day - Chicago

Hardware Appliances are Mysterious Black Boxes Abode of Chaos, Flickr / CC BY 2.0

Page 46: Build A Cloud Day - Chicago

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

HUMAN[DEVELOPER]

!!

Page 47: Build A Cloud Day - Chicago

DC

DC

DC

DC

D

C

DC

DC

DC

DC

DC

DC

DC

C++

Page 48: Build A Cloud Day - Chicago

DC

DC

DC

DC

D

C

DC

DC

DC

DC

DC

DC

DC

C++X

Page 49: Build A Cloud Day - Chicago

THE WORLDNEEDS

AN OPEN STORAGE TECHNOLOGY

THATSCALES

Page 50: Build A Cloud Day - Chicago

SAGE WEIL

Co-founder of DreamHost

Inventor of Ceph

CEO of Inktank

Page 51: Build A Cloud Day - Chicago

OPEN SOURCE

philosophy design

Page 52: Build A Cloud Day - Chicago

OPEN SOURCE SPREADS IDEASorchidgalore, Flickr / CC BY 2.0

Page 53: Build A Cloud Day - Chicago

OPEN SOURCE

COMMUNITY-FOCUSED

philosophy design

Page 54: Build A Cloud Day - Chicago

WE ARE SMARTER TOGETHERrturk, Linkedin Inmap

Page 55: Build A Cloud Day - Chicago

CEPH BELONGS TO ALL OF USwackybadger, Flickr / CC BY 2.0

Page 56: Build A Cloud Day - Chicago

OPEN SOURCE

COMMUNITY-FOCUSED

SCALABLE

philosophy design

Page 57: Build A Cloud Day - Chicago

CEPH IS BUILT TO SCALE

Too much for a book

Too much for a drive

Too much for a computer

Too much for a room

Ceph

Too much for a cave

Page 58: Build A Cloud Day - Chicago

OPEN SOURCE

COMMUNITY-FOCUSED

SCALABLE

NO SINGLE POINT OF FAILURE

philosophy design

Page 59: Build A Cloud Day - Chicago

ARILOMAX CALIFORNICUSaroid, Flickr / CC BY 2.0

Page 60: Build A Cloud Day - Chicago

THE OCTOPUS (A METAPHOR)I love speaking in metaphors.

single pointof failure

highly-availablereplicated

Page 61: Build A Cloud Day - Chicago

THE BEEHIVE (ANOTHER METAPHOR)blumenbiene, Flickr / CC BY 2.0

Page 62: Build A Cloud Day - Chicago

OPEN SOURCE

COMMUNITY-FOCUSED

SCALABLE

NO SINGLE POINT OF FAILURE

SOFTWARE BASED

philosophy design

Page 63: Build A Cloud Day - Chicago

DC

DC

DC

DC

D

C

DC

DC

DC

DC

DC

DC

DC

C++

Page 64: Build A Cloud Day - Chicago

DC

DC

DC

DC

D

C

DC

DC

DC

DC

DC

DC

DC

C++✔

Page 65: Build A Cloud Day - Chicago

OPEN SOURCE

COMMUNITY-FOCUSED

SCALABLE

NO SINGLE POINT OF FAILURE

SOFTWARE BASEDSELF-

MANAGING

philosophy design

Page 66: Build A Cloud Day - Chicago

DISKS = JUST T INY RECORD PLAYERSjon_a_ross, Flickr / CC BY 2.0

Page 67: Build A Cloud Day - Chicago

D

55 times / day

=D

DD

x 1 MILLION

DD

DD

Page 68: Build A Cloud Day - Chicago
Page 69: Build A Cloud Day - Chicago

IT ALL STARTED WITH A DREAM

Page 70: Build A Cloud Day - Chicago

+

Page 71: Build A Cloud Day - Chicago

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

Page 72: Build A Cloud Day - Chicago

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

Page 73: Build A Cloud Day - Chicago

DISK

FS

DISK DISK

OSD

DISK DISK

OSD OSD OSD OSD

FS FS FSFS btrfsxfsext4

MMM

Page 74: Build A Cloud Day - Chicago

M

M

M

HUMAN

Page 75: Build A Cloud Day - Chicago

Monitors: Maintain cluster map Provide consensus for

distributed decision-making Must have an odd number These do not serve stored

objects to clients

MOSDs: One per disk

(recommended) At least three in a cluster Serve stored objects to

clients Intelligently peer to perform

replication tasks Supports object classes

Page 76: Build A Cloud Day - Chicago

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

Page 77: Build A Cloud Day - Chicago

LIBRADOS

M

M

M

APP

native

Page 78: Build A Cloud Day - Chicago

LLIBRADOS Provides direct access

to RADOS for applications

C, C++, Python, PHP, Java

No HTTP overhead

Page 79: Build A Cloud Day - Chicago

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

Page 80: Build A Cloud Day - Chicago

M

M

M

native

REST

APP

LIBRADOS

RADOSGWLIBRADOS

RADOSGW

APP

Page 81: Build A Cloud Day - Chicago

RADOS Gateway: REST-based interface

to RADOS Supports buckets,

accounting Compatible with S3

and Swift applications

Page 82: Build A Cloud Day - Chicago

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

Page 83: Build A Cloud Day - Chicago

M

M

M

VM

LIBRADOS

LIBRBD

VIRTUALIZATION CONTAINER

Page 84: Build A Cloud Day - Chicago

LIBRADOS

M

M

M

LIBRBD

CONTAINER

LIBRADOS

LIBRBD

CONTAINERVM

Page 85: Build A Cloud Day - Chicago

LIBRADOS

M

M

M

KRBD (KERNEL MODULE)

HOST

Page 86: Build A Cloud Day - Chicago

RADOS Block Device: Storage of virtual disks in

RADOS Allows decoupling of VMs

and containers Live migration!

Images are striped across the cluster

Boot support in QEMU, KVM, and OpenStack Nova

Mount support in the Linux kernel

Page 87: Build A Cloud Day - Chicago

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

Page 88: Build A Cloud Day - Chicago

M

M

M

CLIENT

0110

datametadata

Page 89: Build A Cloud Day - Chicago

Metadata Server Manages metadata for a

POSIX-compliant shared filesystem Directory hierarchy File metadata (owner,

timestamps, mode, etc.) Stores metadata in RADOS Does not serve file data to

clients Only required for shared

filesystem

Page 90: Build A Cloud Day - Chicago

WHAT MAKES CEPH

UNIQUE?

Page 91: Build A Cloud Day - Chicago

HOW DO YOU F IND YOUR KEYS?azmeen, Flickr / CC BY 2.0

Page 92: Build A Cloud Day - Chicago

APP??

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

Page 93: Build A Cloud Day - Chicago

APP

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

A-G

H-N

O-T

U-Z

F*

Page 94: Build A Cloud Day - Chicago

I ALWAYS PUT MY KEYS ON THE HOOKvitamindave, Flickr / CC BY 2.0

Page 95: Build A Cloud Day - Chicago

APP

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

Page 96: Build A Cloud Day - Chicago

DEAR DIARY: KEYS = IN THE KITCHENBarnaby, Flickr / CC BY 2.0

Page 97: Build A Cloud Day - Chicago

HOW DO YOUFIND YOUR KEYS

WHEN YOUR HOUSEIS

INFINITELY BIGAND

ALWAYS CHANGING?

Page 98: Build A Cloud Day - Chicago

THE ANSWER: CRUSH!!pasukaru76, Flickr / CC SA 2.0

Page 99: Build A Cloud Day - Chicago

10 10 01 01 10 10 01 11 01 10

10 10 01 01 10 10 01 11 01 10

hash(object name) % num pg

CRUSH(pg, cluster state, rule set)

Page 100: Build A Cloud Day - Chicago

10 10 01 01 10 10 01 11 01 10

10 10 01 01 10 10 01 11 01 10

Page 101: Build A Cloud Day - Chicago

CRUSH Pseudo-random

placement algorithm Ensures even distribution Repeatable, deterministic Rule-based configuration

Replica count Infrastructure topology Weighting

Page 102: Build A Cloud Day - Chicago

CLIENT

??

Page 103: Build A Cloud Day - Chicago
Page 104: Build A Cloud Day - Chicago
Page 105: Build A Cloud Day - Chicago

CLIENT

??

Page 106: Build A Cloud Day - Chicago

LIBRADOS

M

M

M

VM

LIBRBD

VIRTUALIZATION CONTAINER

Page 107: Build A Cloud Day - Chicago

HOW DO YOUSPIN UP

THOUSANDS OF VMsINSTANTLY

ANDEFFICIENTLY?

Page 108: Build A Cloud Day - Chicago

144 0 0 0 0

instant copy

= 144

Page 109: Build A Cloud Day - Chicago

4144

CLIENT

write

write

write

= 148

write

Page 110: Build A Cloud Day - Chicago

4144

CLIENTread

read

read

= 148

Page 111: Build A Cloud Day - Chicago

HOW DO YOUMANAGE

DIRECTORY HEIRARCHYWITHOUT

ASINGLE POINT OF

FAILURE?

Page 112: Build A Cloud Day - Chicago

FILESYSTEMS REQUIRE METADATABarnaby, Flickr / CC BY 2.0

Page 113: Build A Cloud Day - Chicago

M

M

M

CLIENT

0110

Page 114: Build A Cloud Day - Chicago

M

M

M

Page 115: Build A Cloud Day - Chicago

one tree

three metadata servers

??

Page 116: Build A Cloud Day - Chicago
Page 117: Build A Cloud Day - Chicago
Page 118: Build A Cloud Day - Chicago
Page 119: Build A Cloud Day - Chicago
Page 120: Build A Cloud Day - Chicago

DYNAMIC SUBTREE PARTITIONING

Page 121: Build A Cloud Day - Chicago

AND NOWBACKPEDALING

Page 122: Build A Cloud Day - Chicago

ALMOSTEVERYTHING

WORKS

Page 123: Build A Cloud Day - Chicago

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

NEARLYAWESOME

AWESOMEAWESOME

AWESOME

AWESOME

Page 124: Build A Cloud Day - Chicago

LAN SCALE!!*

*OR REALLY REALLY SCARY FAST WAN

Page 125: Build A Cloud Day - Chicago

CEPH AND CLOUDSTACKtableatny, Flickr / CC BY 2.0

Page 126: Build A Cloud Day - Chicago

RBD SUPPORT IN CLOUDSTACK

Allows storage of virtual disks inside RADOS Works with KVM only right now No snapshots yet

Upcoming in CloudStack 4 More information can be found on the mailing list:

ceph-devel / incubator-cloudstack-dev: http://article.gmane.org/gmane.comp.file-systems.ceph.devel/7505

Page 127: Build A Cloud Day - Chicago

QUESTIONS?

Ross TurkVP Community, Inktank

[email protected] @rossturk

inktank.com | ceph.com