Transcript
Page 1: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

Ceph at Spreadshirt June 2016

Ansgar Jazdzewski, System Engineer

Jens Hadlich, Chief Architect

Page 2: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

Agenda

•  Introduction

• Setup

• Performance

• Geo-Replication & Backup

• Lessons learned

Page 3: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

Introduction

Page 4: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

About Spreadshirt

Spread it with Spreadshirt

A global e-commerce platform for everyone to create, sell and buy ideas on clothing and accessories across many points of sale.

•  Founded in 2002, based in Leipzig •  550+ employees around the world •  Global shipping (180+ countries)

•  Community of 70.000+ active sellers •  € 85M revenue (2015) •  3.6M shipped items (2015)

Page 5: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

Why do we need object storage?

Main use case: user generated content (vector & pixel images) •  Some 10s of terabytes (TB) of data •  2 typical sizes:

•  A few MB (originals)

•  A few KB (post-processed)

•  50.000 uploads per day at peak •  Currently 35M+ objects

•  Read > write

Page 6: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

Previous approach

NFS •  Well-branded vendor •  Millions of files and directories

•  Sharding per user / user group •  Same shares mounted on almost every server

Page 7: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

Previous approach

NFS •  Problems

•  Regular UNIX tools became unusable

•  Not designed for “the cloud” (e.g. replication designed for LAN)

•  Partitions can only grow, not shrink

•  Performance bottlenecks

•  Challenges •  Growing number of users à more content

•  Build a truly global platform (multiple regions and data centers)

Page 8: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

Why Ceph?

• Vendor independent

• Open source

• Runs on commodity hardware (big or small)

• Local installation for minimal network latency

• Existing knowledge and experience

• S3 API

• Easy to add more storage

Page 9: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

The Ceph offering

• Object storage

• Block storage

• File system

à we (currently) focus on object storage (Ceph Object Gateway)

Page 10: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

Setup

Page 11: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

General Ceph Object Storage Architecture

Ceph Object Gateway

Monitor

Cluster Network

Public Network

OSD OSD OSD OSD OSD

Monitor Monitor

A lot of nodes and disks

Client HTTP (S3 or SWIFT API)

RADOS (reliable autonomic distributed object store)

Page 12: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

General Ceph Object Storage Architecture

Monitor

Cluster Network

Public Network

Client

RadosGW

HTTP (S3 or SWIFT API)

Monitor Monitor

Some SSDs (for journals) More HDDs JBOD (no RAID)

OSD node

Ceph Object Gateway

librados

Odd number (Quorum)

OSD node OSD node OSD node OSD node

10G

10G (more is better)

...

RADOS (reliable autonomic distributed object store)

OSD node

Page 13: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

Ceph Object Storage at Spreadshirt

Cluster Network (OSD Replication)

Cluster nodes 2 x SSD (OS / LevelDB) 3 x SSD (journal / index) 9 x HDD (data) XFS

3 Monitors

Client HTTP (S3 API)

Cluster Network

RadosGW on each node

S3 Frontend Network

RadosGW

Monitor

s3gw-haproxy

RadosGW

s3gw-haproxy

RadosGW

Monitor

s3gw-haproxy

RadosGW

Monitor

s3gw-haproxy

RadosGW

s3gw-haproxy

S3 Backend Network (LB w/ consistent hashing)

Page 14: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

Ceph Object Storage at Spreadshirt

Cluster nodes 2 x SSD (OS / LevelDB) 3 x SSD (journal / index) 9 x HDD (data) XFS

3 Monitors

Client HTTP (S3 API)

RadosGW on each node RadosGW

Monitor

s3gw-haproxy

RadosGW

s3gw-haproxy

RadosGW

Monitor

s3gw-haproxy

RadosGW

Monitor

s3gw-haproxy

RadosGW

s3gw-haproxy

Cluster Network (OSD Replication)

2 x 1G

2 x 10G

2 x 10G Cluster Network

S3 Frontend Network

S3 Backend Network (LB w/ consistent hashing) 2 x 10 G

S3 Backend Network = Cluster Network

Page 15: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

Ceph Object Storage at Spreadshirt

• Hardware configuration •  5 x Dell PowerEdge R730xd

•  Intel Xeon E5-2630v3, 2.4 GHz, 8C/16T

•  64 GB RAM

•  9 x 4 TB NLSAS HDD, 7.2K

•  3 x 200 GB SSD Mixed Use

•  2 x 120 GB SDD for Boot & Ceph Monitors (LevelDB) •  2 x 1 Gbit + 4 x 10 Gbit NW

•  Same for each data center (EU, NA)

Page 16: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

One node in detail

OS + LevelDB (SSD)

Regular OSDs (HDD)

OSD journal + bucket.index (SSD)

RadosGW

Monitor

s3gw-haproxy

collectd

keepalived

cephy

Component What? keepalived High-availability

collectd Metrics (Grafana)

s3gw-haproxy Geo-Replication (explained later) Redis

cephy

Page 17: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

Ceph OSD tree

ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -5 1.27487 root ssd -4 0.25497 host stor05_ssd 9 0.08499 osd.9 up 1.00000 1.00000 ... -7 0.25497 host stor04_ssd 21 0.08499 osd.21 up 1.00000 1.00000 ... -9 0.25497 host stor03_ssd 33 0.08499 osd.33 up 1.00000 1.00000 ... -11 0.25497 host stor02_ssd 45 0.08499 osd.45 up 1.00000 1.00000 ... -13 0.25497 host stor01_ssd 57 0.08499 osd.57 up 1.00000 1.00000 ... -3 163.79997 root hdd -2 32.75999 host stor05_hdd 0 3.64000 osd.0 up 1.00000 1.00000 1 3.64000 osd.1 up 1.00000 1.00000 2 3.64000 osd.2 up 1.00000 1.00000 3 3.64000 osd.3 up 1.00000 1.00000 ...

Page 18: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

Performance

Page 19: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

Why does latency matter?

Results of some early performance smoke tests*:

Ceph S3 (test setup)

AWS S3 eu-central-1

AWS S3 eu-west-1

Location Leipzig Frankfurt Ireland

Response time (average) 6 ms 25 ms 56 ms

Response time (p99) 47 ms 128 ms 374 ms

Requests per second 405 143 62

* Random read, object size 4KB, 4 parallel threads, location: Leipzig

Page 20: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

Performance

• Test setup •  “Giant” release •  Index pools still on HDDs

• Test result •  E.g. random read, 16 parallel threads, 4KB object size

•  Response time: •  4 ms (average)

•  43 ms (p99)

•  ~2.800 requests/s

Page 21: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

Performance: read vs. write

0

50

100

150

200

250

300

350

1 2 4 8 16 32

ms

client threads

Average response times (4k object size)

read

write

Write performance is much better with index pools on SSD.

Page 22: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

Performance: read

0

5

10

15

20

25

30

35

40

45

50

1 2 4 8 16 32 32+32

ms

client threads

Read response times (4k object size)

avg

p99

Page 23: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

Performance: horizontal scaling

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

1 2 4 8 16 32 32+32

requ

ests

/s

client threads

Read request/s

4k object size

128k object size

1 client / 8 threads: 1G network almost saturated at ~117 MB/s

2 clients: 1G network saturated again; but scale out works J

Page 24: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

Geo-Replication & Backup

Page 25: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

Geo-Replication & Backup

Why build a custom solution? •  Ceph provided radosgw-agent only provides (very) basic functionality:

“For all buckets of a given user, synchronize all objects sequentially.”

Page 26: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

Geo-Replication & Backup

Our requirements •  All data globally available •  50M+ objects

•  Near real time (NRT) •  S3 users / keys / buckets are pretty static

Page 27: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

Geo-Replication & Backup

Our requirements •  All data globally available •  50M+ objects

•  Near real time (NRT) •  S3 users / keys / buckets are pretty static

Backup-Strategy •  No dedicated backup à replication •  No restore à fail-over to next alive cluster

Page 28: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

Geo-Replication & Backup

Our requirements •  All data globally available •  50M+ objects

•  Near real time (NRT) •  S3 users / keys / buckets are pretty static

Backup-Strategy •  No dedicated backup à replication •  No restore à fail-over to next alive cluster

We need bucket notifications!

Page 29: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

Geo-Replication

“Spreadshirt Cloud Object Storage” •  Ceph Object Storage (RadosGW) + bucket notifications + replication

agent: •  s3gw-haproxy* (custom version of HAProxy, written in C)

•  Redis

•  cephy tool suite (written in Go)

•  Support from Heinlein for implementation

* https://github.com/spreadshirt/s3gw-haproxy

Page 30: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

cephy

A tool suite for working with Ceph Object Storage through its S3 API:

•  cephy basic file system operations between files/directories and objects/buckets

•  cephy-sync synchronize two or more buckets (single source, multiple targets) using notifications from cephy-listen or cephy-queue

•  cephy-listen creates sync notifications from S3 operations on a RadosGW socket

•  cephy-queue generates sync notifications for all objects inside bucket

Page 31: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

Replication in Action

s3gw-haproxy

cephy

Client

Site A

RadosGW

Page 32: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

Replication in Action

s3gw-haproxy

cephy

Client

Site A

RadosGW

HTTP (S3 API)

1 PUT bucket/key (HAProxy)

1

Page 33: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

Replication in Action

s3gw-haproxy

cephy

Client

Site A

RadosGW

HTTP (S3 API)

1 PUT bucket/key (HAProxy)

1 2 PUT bucket/key (RadosGW)

2

Page 34: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

Replication in Action

s3gw-haproxy

cephy

Client

Site A

RadosGW

HTTP (S3 API)

1 PUT bucket/key (HAProxy)

2 PUT bucket/key (RadosGW)

3

2 3

1

Response Header Status-Code 20x

Page 35: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

Replication in Action

s3gw-haproxy

cephy

Client

Site A

RadosGW

HTTP (S3 API)

1 PUT bucket/key (HAProxy)

1 2 PUT bucket/key (RadosGW)

3

4 Publish notification LPUSH s3notifications:bucket {

"event": "s3:ObjectCreated:Put", "objectKey": "key"

}

2 3

4

Response Header Status-Code 20x

Page 36: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

Replication in Action

s3gw-haproxy

cephy

Client

Site A

RadosGW

HTTP (S3 API)

1 PUT bucket/key (HAProxy)

2 PUT bucket/key (RadosGW)

3

4

5

Response Header Status-Code 20x

Publish notification

Consume notification and replicate to all given targets

2 3

4

5

1

Page 37: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

Replication in Action

s3gw-haproxy

cephy

Client

Site A

RadosGW

HTTP (S3 API)

2 3

4

5

s3gw-haproxy

cephy

Site B

RadosGW

1

Page 38: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

Replication in Action

s3gw-haproxy

cephy

Client

Site A

RadosGW

HTTP (S3 API)

2 3

4

5

s3gw-haproxy

cephy

Site B

RadosGW

1

6

7

7 PUT bucket/key Request Header X-Notifications: False

6 +

No notifications for replicated objects

Page 39: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

Geo-replication

Roadmap •  More servers / locations / replication targets •  100M+ objects

•  Global gateway •  “Object inventory”

•  More intelligent scrubbing

•  Location awareness

•  Easily bootstrap a new cluster

•  Statistics

•  Potentially make cephy open source

Page 40: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

Lessons learned

Page 41: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

Lessons learned

• Grafana is awesome!

Page 42: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

Lessons learned

• Kibana is awesome, too! RadosGW is quite chatty btw.

Page 43: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

Lessons learned

• Upgrades can fail •  Build scripts to get a dummy setup and test it in QA •  Story of a zombie cluster

Page 44: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

Lessons learned

• Move index data to SSD* ceph osd crush add-bucket ssd root

ceph osd crush add-bucket stor01_ssd host

ceph osd crush move stor01_ssd root=ssd

ceph osd crush rule create-simple ssd_rule ssd host

ceph osd pool set .eu-cgn.rgw.buckets.index crush_ruleset 1

* https://www.sebastien-han.fr/blog/2014/01/13/ceph-managing-crush-with-the-cli/

CRUSH rule numbers are

tricky

Page 45: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

Lessons learned

• Know your libraries and tools •  Pain points:

•  Ceph Object Gateway (before Jewel) only supports V2 signatures •  Official SDKs from AWS: Java, Go, …

•  Fun with Python Boto2/3

•  Custom (other than AWS) S3 endpoints might not be supported or even ignored/overridden

•  Beware of file system wrappers (e.g. Java FileSystemProvider) •  They do funny things that you don’t necessarily expect

e.g. DELETE requests when copying files with REPLACE_EXISTING option

•  Can cause much more requests than you might think

Page 46: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

Lessons learned

• Plan your scrubbing •  35.2M objects / 4.400 PGs = 8.000 files per scrub task •  Distribute deep scrubs over a longer timeframe osd deep scrub interval = 2592000 (30 days)

•  Watch it closely ceph pg dump pgs -f json | jq .[].last_deep_scrub_stamp | tr -d '"' | awk '{print

$1}'| sort -n | uniq -c

Page 47: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

Questions?

Page 48: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

Thank You [email protected]

[email protected]

Page 49: Ceph at Spreadshirt (June 2016)

Spreadshirt Spreadshirt

Links

• http://ceph.com/

• https://github.com/spreadshirt/s3gw-haproxy

• http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html

• http://redis.io/

• http://www.haproxy.org/

• https://www.sebastien-han.fr/blog/2014/01/13/ceph-managing-crush-with-the-cli/