OpenStackTage Cologne - OpenStack at 99.999% availability with Ceph

OPENSTACK AT 99.999% AVAILABILITYWITH CEPH

Danny Al-Gaaf (Deutsche Telekom)Deutsche OpenStack Tage 2016 - Cologne

● Motivation● Availability and SLA's● Data centers

○ Setup and failure scenarios● OpenStack and Ceph

○ Architecture and Critical Components○ HA setup○ Quorum?

● OpenStack and Ceph == HA?○ Failure scenarios○ Mitigation

● Conclusions

Overview

2

Motivation

NFV Cloud @ Deutsche Telekom

● Datacenter design○ Backend DCs

■ Few but classic DCs ■ High SLAs for infrastructure and services■ For private/customer data and services

○ Frontend DCs■ Small but many■ Near to the customer■ Lower SLAs, can fail at any time■ NFVs:

● Spread over many FDCs● Failures are handled by services and not the infrastructure

● Run telco core services @OpenStack/KVM/Ceph4

Availability

High Availability

● Continuous system availability in case of component failures

● Which availability?○ Server ○ Network○ Datacenter○ Cloud○ Application/Service

● End-to-End availability most interesting6

availability downtime/year classification

99.9% 8.76 hours high availability

99.99% 52.6 minutes very high availability

99.999% 5.26 minutes highest availability

99.9999% 0.526 minutes disaster tolerant

High Availability

● Calculation○ Each component contributes to the service availability

■ Infrastructure■ Hardware■ Software■ Processes

○ Likelihood of disaster and failure scenarios○ Model can get very complex○ Hard to get all numbers required

● SLA’s○ ITIL (IT Infrastructure Library)○ Planned maintenance depending on SLA may be excluded

7

http://www.dict.cc/englisch-deutsch/likelihood.html

Data centers

Failure scenarios

● Power outage○ External○ Internal ○ Backup UPS/Generator

● Network outage ○ External connectivity○ Internal

■ Cables■ Switches, router

● Failure of:○ Cooling ○ server or component○ software services

9

Failure scenarios

● Human error○ Misconfiguration○ Accidents○ Emergency power-off○ Often leading cause of outage

● Disaster○ Fire○ Flood○ Earthquake○ Plane crash○ Nuclear accident

10

Data Center Tiers

11

Mitigation

● Identify potential SPoF● Use redundant components● Careful planning

○ Network design (external / internal)○ Power management (external / internal)○ Fire suppression○ Disaster management○ Monitoring

● 5-nines on DC/HW level hard to achieve ○ Tier IV often too expensive (compared with Tier III or III+)○ Even Tier IV does not provide 5-nines○ Requires HA concept on cloud and application level

12

Example: Network

● Spine/leaf arch● Redundant

○ DC-R○ Spine switches○ Leaf switches (ToR)○ OAM switches○ Firewall

● Server○ Redundant NICs○ Redundant power lines and

supplies

13

Ceph and OpenStack

Architecture: Ceph

15

Architecture: Ceph Components

● OSDs○ 10s - 1000s per cluster○ One per device (HDD/SSD/RAID Group, SAN …)○ Store objects○ Handle replication and recovery

● MONs:○ Maintain cluster membership and states○ Use PAXOS protocol to establish quorum consensus○ Small, lightweight○ Odd number

16

Architecture: Ceph and OpenStack

17

HA - Critical Components

Which services need to be HA? ● Control plane

○ Provisioning, management○ API endpoints and services ○ Admin nodes○ Control nodes

● Data plane○ Steady states○ Storage○ Network

18

HA Setup

● Stateless services○ No dependency between requests ○ After reply no further attention required○ API endpoints (e.g. nova-api, glance-api,...) or nova-scheduler

● Stateful service ○ Action typically comprises out of multiple requests○ Subsequent requests depend on the results of a former request○ Databases, RabbitMQ

19

OpenStack HA

20

Quorum?

● Required to decide which cluster partition/member is primary to prevent data/service corruption

● Examples:○ Databases

■ MariaDB / Galera, MongoDB, CassandraDB○ Pacemaker/corosync○ Ceph Monitors

■ Paxos■ Odd number of MONs required■ At least 3 MONs for HA, simple majority (2:3, 3:5, 4:7, …)■ Without quorum:

● no changes of cluster membership (e.g. add new MONs/OSDs)● Clients can’t connect to cluster

21

OpenStack and Ceph == HA ?

SPoF

● OpenStack HA○ No SPoF assumed

● Ceph○ No SPoF assumed○ Availability of RBDs is critical to VMs○ Availability of RadosGW can be easily managed via HAProxy

● What in case of failures on higher level?○ Data center cores or fire compartments○ Network

■ Physical■ Misconfiguration

○ Power23

Setup - Two Rooms

24

Failure scenarios - FC fails

25

Failure scenarios - FC fails

26

Failure scenarios - Split brain

27

● Ceph● Quorum selects B● Storage in A stops

● OpenStack HA:● Selects B

● VMs in B still running

● Best-case scenario

Failure scenarios - Split brain

28

● Ceph● Quorum selects B● Storage in A stops

● OpenStack HA:● Selects A

● VMs in A and B stop working

● Worst-case scenario

Other issues

● Replica distribution○ Two room setup:

■ 2 or 3 replica contain risk of having only one replica left■ Would require 4 replica (2:2)

● Reduced performance● Increased traffic and costs

○ Alternative: erasure coding ■ Reduced performance, less space required

● Spare capacity○ Remaining room requires spare capacity to restore○ Depends on

■ Failure/restore scenario■ Replication vs erasure code

○ Costs29

Mitigation - Three FCs

30

● Third FC/failure zone hosting all services

● Usually higher costs

● More resistant against failures

● Better replica distribution

● More east/west traffic

Mitigation - Quorum Room

31

● Most DCs have backup rooms

● Only a few servers to host quorum related services

● Less cost intensive

● Mitigate FCs split brain

Mitigation - Applications: First Rule

32

Mitigation - Applications: Third Rule

33

Mitigation - Applications: Third Rule

34

Mitigation - Applications: Pets vs Cattle

35

Mitigation - Failure tolerant applications

36

● DC Tier level is not the most relevant

● Application must build their own cluster mechanisms on top of the DC→ increases the service availability significantly

● Data replication must be done across multi-regions

● In case of a disaster traffic goes to remaining DCs

Mitigation - Federated Object Stores

37

● Use object storage for persistent data● Synchronize and replicate across

multiple DCs, sync in background

Open issues: ● Replication of databases● Applications:

○ Need to support object storage○ Need to support regions/zones

Mitigation - Outlook

● “Compute follows Storage” ○ Use RBDs as fencing devices in OpenStack HA setup○ Extend Ceph MONs

■ Include information about physical placement similar to CRUSH map■ Enable HA setup to monitor/query quorum decisions and map to physical layout

● Passive standby Ceph MONs to ease deployment of MONs if quorum fails○ http://tracker.ceph.com/projects/ceph/wiki/Passive_monitors

● Generic quorum service/library ?38

Conclusions

Conclusions

● OpenStack and Ceph provide HA if carefully planed○ Be aware of potential failure scenarios!○ All Quorum decisions must be in sync○ Third room must be used○ Replica distribution and spare capacity must be considered○ Ceph need more extended quorum information

● Target for five 9’s is E2E○ Five 9’s on data center level very expensive○ NO PETS, NO PETS, NO PETS !!!○ Distribute applications or services over multiple DCs

40

Get involved !

● Ceph○ https://ceph.com/community/contribute/ ○ [email protected]○ IRC: OFTC

■ #ceph, ■ #ceph-devel

● OpenStack○ Cinder, Glance, Manila, ...

41

https://ceph.com/community/contribute/

https://ceph.com/community/contribute/

mailto:[email protected]

mailto:[email protected]

[email protected]

dalgaaf

blog.bisect.de

@dannnyalgaaf

linkedin.com/in/dalgaaf

xing.com/profile/Danny_AlGaaf

Danny Al-Gaaf Senior Cloud Technologist

Q&A - THANK YOU!

Presentations & Public Speaking

OpenStackTage Cologne - OpenStack at 99.999% availability with Ceph