Ceph Object Storage at Spreadshirt (July 2015, Ceph Berlin Meetup)

Preview:

Citation preview

Ceph Object Storage at Spreadshirt

How we start

July 2015

Jens Hadlich, Chief Architect Ansgar Jazdzewski, System Engineer

Ceph Berlin Meetup

About Spreadshirt

2

Spread it with Spreadshirt

A global e-commerce platform for everyone to create, sell and buy ideas on clothing and accessories across many points of sale. •  12 languages, 11 currencies •  19 markets •  150+ shipping regions

•  community of >70.000 active sellers •  € 72M revenue (2014) •  >3.3M items shipped (2014)

Object Storage at Spreadshirt

•  Our main use case –  Store and read primarily user generated content, mostly images

•  Some 10s of terabyte (TB) of data •  2 typical sizes:

–  a few dozen KB –  a few MB

•  Up to 50.000 uploads per day •  Read > Write

3

Object Storage at Spreadshirt

•  „Never change a running system“? –  Currently solution (from our early days):

•  Big storage, well-branded vendor •  Lots of files / directories / sharding

–  Problems: •  Regular UNIX tools are unusable in practice •  Not designed for „the cloud“ (e.g. replication is an issue) •  Performance bottlenecks

–  Challenges: •  Growing number of users à more content •  Build a truly global platform (multiple regions and data centers)

4

Ceph

•  Why Ceph? –  Vendor independent –  Open source –  Runs on commodity hardware –  Local installation for minimal latency –  Existing knowledge and experience –  S3-API

•  Simple bucket-to-bucket replication –  A good fit also for < Petabyte –  Easy to add more storage –  (Can be used later for block storage)

5

Ceph Object Storage Architecture

6

Overview

Ceph Object Gateway

Monitor

Cluster Network

Public Network

OSD OSD OSD OSD OSD

Monitor Monitor

A lot of nodes and disks

Client HTTP (S3 or SWIFT API)

RADOS (reliable autonomic distributed object store)

Ceph Object Storage Architecture

7

A little more detailled

Monitor

Cluster Network

Public Network

Client

RadosGW

HTTP (S3 or SWIFT API)

Monitor Monitor

Some SSDs (for journals) More HDDs JBOD (no RAID)

OSD node

Ceph Object Gateway

librados

Odd number (Quorum)

OSD node OSD node OSD node OSD node

1G

10G (the more the better)

...

RADOS (reliable autonomic distributed object store)

OSD node

Ceph Object Storage at Spreadshirt

8

Initial Setup

Cluster Network (OSD Replication)

Cluster nodes 3 x SSD (journal / index) 9 x HDD (data) xfs

3 Monitors

2 x 1G

2 x 10G

Public Network

Client HTTP (S3 or SWIFT API)

HAProxy

RadosGW

Monitor

RadosGW

Monitor

RadosGW

Monitor

RadosGW RadosGW

2 x 10G Cluster Network

RadosGW on each node

Ceph Object Storage at Spreadshirt

9

Initial Setup •  Hardware Configuration –  5 x Dell PowerEdge R730xd

•  Intel Xeon E5-2630v3, 2.4 GHz, 8C/16T •  64 GB RAM •  9 x 4 TB NLSAS HDD, 7.2K •  3 x 200 GB SSD Mixed Use •  2 x 120 GB SDD for Boot & Ceph Monitors (LevelDB) •  2 x 1 Gbit + 4 x 10 Gbit NW

10

Performance – First smoke tests

Ceph Object Storage Performance

11

First smoke tests

•  How fast with RadosGW? –  Response times (read / write)

•  Average? •  Percentiles (P99)?

–  Throughput? –  Compared to AWS S3?

•  A first (very minimalistic) test setup –  3 VMs (KVM) all with RadosGW, Monitor and 1 OSD

•  2 Cores, 4GB RAM, 1 OSD each (15 GB + 5GB), SSD, 10G Network between nodes, HAProxy (round-robin), LAN, HTTP

–  No further optimizations

Ceph Object Storage Performance

12

First smoke tests

•  How fast is RadosGW? –  Random read and write –  Object size: 4 KB

•  Results: Pretty promising! –  E.g. 16 parallel threads, read:

•  Avg 9 ms •  P99 49 ms •  > 1.300 requests/s

Ceph Object Storage Performance

13

First smoke tests

•  Compared to Amazon S3? –  Comparing apples and oranges (unfair, but interesting)

•  http vs. https, LAN vs. WAN etc.

•  Reponse times –  Random read, object size: 4KB, 4 parallel threads, location: Leipzig Ceph S3

(Test) AWS S3

eu-central-1 eu-west-1

Location Leipzig Frankfurt Ireland Avg 6 ms 25 ms 56 ms P99 47 ms 128 ms 374 ms Requests/s 405 143 62

14

Performance – Now with the final hardware

Ceph Object Storage Performance

15

Now with the final hardware

•  How fast is RadosGW? –  Random read and write –  Object size: 4 KB

•  Results: –  E.g. 16 parallel threads, read:

•  Avg 4 ms •  P99 43 ms •  > 2.800 requests/s

Ceph Object Storage Performance

16

Now with the final hardware

0

50

100

150

200

250

300

350

1 2 4 8 16 32

ms

client threads

Average response times (4k object size)

read

write

Ceph Object Storage Performance

17

Now with the final hardware

0

5

10

15

20

25

30

35

40

45

50

1 2 4 8 16 32 32+32

ms

client threads

Read response times (4k object size)

avg

p99

Ceph Object Storage Performance

18

Now with the final hardware

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

1 2 4 8 16 32 32+32

requ

ests

/s

client threads

Read request/s

4k object size

128k object size

1 client / 8 threads: 1G network almost saturated at ~115 MB/s

2 clients: 1G network saturated again; but scale out works J

19

Monitoring

Monitoring

20

Grafana rulez J

21

Global availability

Global Availability

22

•  1 Ceph cluster per data center

•  S3 bucket-to-bucket replication

•  Multiple regions, local delivery

23

Currently open issues / operational tasks

Open issues / operational tasks

24

•  Backup –  s3fs-fuse too slow –  Setup another Ceph cluster?

•  Security –  Users –  ACLs

•  Migration of old data –  Upload all existing files via script –  Use the old system as fallback / in parallel

Open issues / operational tasks

25

•  Replication –  Test-drive radosgw-agent –  s3cmd? Custom tool? –  Metadata (User) –  Data

•  Performance?

•  Bucket Notification –  Currently unsupported by RadosGW –  Build a custom solution?

Open issues / operational tasks

26

•  Scrubbing •  Rebuild

To be continued ...

+ = ?

Thank You! jns@spreadshirt.com ajaz@spreadshirt.com