Building AuroraObjects- Ceph Day Frankfurt

Preview:

DESCRIPTION

Wido den Hollander, 42on.com

Citation preview

Building AuroraObjects

Who am I?

● Wido den Hollander (1986)● Co-owner and CTO of a PCextreme B.V., a

dutch hosting company● Ceph trainer and consultant at 42on B.V.● Part of the Ceph community since late 2009

– Wrote the Apache CloudStack integration

– libvirt RBD storage pool support

– PHP and Java bindings for librados

PCextreme?

● Founded in 2004● Medium-sized ISP in the Netherlands● 45.000 customers● Started as a shared hosting company● Datacenter in Amsterdam

What is AuroraObjects?

● Under the name “Aurora” my hosting company PCextreme B.V. has two services:– AuroraCompute, a CloudStack based public cloud

backed by Ceph's RBD

– AuroraObjects, a public object store using Ceph's RADOS Gateway

● AuroraObjects is a public RADOS Gateway service (S3 only) running in production

The RADOS Gateway (RGW)

● Service objects using either Amazon's S3 or OpenStack's Swift protocol

● All objects are stored in RADOS, the gateway is just a abstraction between HTTP/S3 and RADOS

The RADOS Gateway

Our ideas

● We wanted to cache frequently accessed objects using Varnish– Only possible with anonymous clients

● SSL should be supported● Storage between Compute and Objects

services shared● 3x replication

Varnish

● A caching reverse HTTP proxy– Very fast

● Up to 100k requests/s

– Configurable using the Varnish Configuration Language (VCL)

– Used by Facebook and eBay

● Not a part of Ceph, but can be used with the RADOS Gateway

The Gateways

● SuperMicro 1U– AMD Opteron 6200 series CPU

– 128GB RAM

● 20Gbit LACP trunk● 4 nodes● Varnish runs locally with RGW on each node

– Uses the RAM to cache objects

The Ceph cluster

● SuperMicro 2U chassis– AMD Opteron 4334 CPU

– 32GB Ram

– Intel S3500 80GB SSD for OS

– Intel S3700 200GB SSD for Journaling

– 6x Seagate 3TB 7200RPM drive for OSD

● 2Gbit LACP trunk● 18 nodes● ~320TB of raw storage

Our problems

● When we cache Objects in Varnish, they don't show up in the usage accounting of the RGW– The HTTP request never reaches RGW

● When a Object changes we have to purge all caches to maintain cache consistency– User might change a ACL or modify a object with a

PUT request

● We wanted to make cached requests cheaper then non-cached requests

Our solution: Logstash

● All requests go from Varnish into Logstash and into ElasticSearch– From ElasticSearch we do the usage accounting

● When Logstash sees a PUT, DELETE or PUT request it makes a local request which sends out a multicast to all other RGW nodes to purge that specific object

● We also store bucket storage usage in ElasticSearch so we have an average over the month

Our solution: Logstash

● All requests go from Varnish into Logstash and into ElasticSearch– From ElasticSearch we do the usage accounting

● When Logstash sees a PUT, DELETE or PUT request it makes a local request which sends out a multicast to all other RGW nodes to purge that specific object

● We also store bucket storage usage in ElasticSearch so we have an average over the month

LogStash and ElasticSearch

● varnishncsa → logstash → redis → elasticsearchinput {

pipe {

command => "/usr/local/bin/varnishncsa.logstash"

type => "http"

}

}

● And we simply execute varnishncsavarnishncsa -F '%{VCL_Log:client}x %{VCL_Log:proto}x %{VCL_Log:authorization}x %{Bucket}o %m %{Host}i %U %b %s %{Varnish:time_firstbyte}x %{Varnish:hitmiss}x'

%{Bucket}o?

● With %{<header>}o you can display the output of the return header <header>:– %{Server}o: Apache 2

– %{Content-Type}o: text/html

● We patched RGW (is in master) that it can optionally return the bucket name in the response:200 OK

Connection: close

Date: Tue, 25 Feb 2014 14:42:31 GMT

Server: AuroraObjects

Content-Length: 1412

Content-Type: application/xml

Bucket: "ceph"

X-Cache-Hit: No

● 'rgw expose bucket = true' in ceph.conf returns Bucket

Usage accounting

● We only query RGW for storage usage and also store that in ElasticSearch

● ElasticSearch is used for all traffic accounting– Allows us to differentiate between cached and

non-cached traffic

Back to Ceph: CRUSHMap

● A good CRUSHMap design should reflect the physical topology of your Ceph cluster– All machines have a single power supply

– The datacenter has a A and B powercircuit● We use a STS (Static Transfer Switch) to create a third

powercircuit

● With CRUSH we store each replica on a different powercircuit– When a circuit fails, we loose 2/3 of the Ceph cluster

– Each powercircuit has it's own switching / network

The CRUSHMaptype 7 powerfeed

host ceph03 {

alg straw

hash 0

item osd.12 weight 1.000

item osd.13 weight 1.000

..

}

powerfeed powerfeed-a {

alg straw

hash 0

item ceph03 weight 6.000

item ceph04 weight 6.000

..

}

root ams02 {

alg straw

hash 0

item powerfeed-a

item powerfeed-b

item powerfeed-c

}

rule powerfeed {

ruleset 4

type replicated

min_size 1

max_size 3

step take ams02

step chooseleaf firstn 0 type powerfeed

step emit

}

The CRUSHMap

Testing the CRUSHMap

● With crushtool you can test your CRUSHMap● $ crushtool -c ceph.zone01.ams02.crushmap.txt -o /tmp/crushmap

● $ crushtool -i /tmp/crushmap --test --rule 4 --num-rep 3 –show-statistics

● This shows you the result of the CRUSHMap:rule 4 (powerfeed), x = 0..1023, numrep = 3..3

CRUSH rule 4 x 0 [36,68,18]

CRUSH rule 4 x 1 [21,52,67]

..

CRUSH rule 4 x 1023 [30,41,68]

rule 4 (powerfeed) num_rep 3 result size == 3: 1024/1024

● Manually verify those locations are correct

A summary

● We cache anonymously accessed objects with Varnish– Allows us to process thousands of requests per

second

– Saves us I/O on the OSDs

● We use LogStash and ElasticSearch to store all requests and do usage accounting

● With CRUSH we store each replica on a different power circuit

Resources

● LogStash: http://www.logstash.net/● ElasticSearch: http://www.elasticsearch.net/● Varnish: http://www.varnish-cache.org/● CRUSH: http://ceph.com/docs/master/

● E-Mail: wido@42on.com● Twitter: @widodh

Recommended