How Do we do Ceph @ CSC - Karan Singh · • Monitoring and Logs Management o OpsView, ELK stack ... • Interested to know, ... you might not get what you expect

Ceph @ CSC How Do we do

#whoami Karan Singh

System Specialist Cloud Storage

CSC-IT Center for Science FINLAND

[email protected]

2

•  Author for Learning Ceph – Packt Publication 2015

•  Author for Ceph Cookbook – Packt Publication 2016

•  Technical Reviewer for Mastering Ceph – Packt Publication 2016

•  www.ksingh.co.in - Tune in for my blogs

CSC-IT Center For Science

3

•  Founded in 1971

•  Finnish Non Profit organization, Funded by Ministry of Education

•  Connected Finland to Internet in 1988

•  Most Powerful academic computing facility in the Nordics

•  ISO27001:2013 Certification

•  Public cloud offering Pouta Cloud Services

More Information o https://www.csc.fi/ o https://research.csc.fi/cloud-computing

CSC Cloud Offering

4

•  Pouta Cloud Service [ IaaS ] o  cPouta - Public cloud , General Purpose o  ePouta - Public cloud , purposely built for sensitive data

•  Built using OpenStack

•  Uses upstream openstack packages, No distribution

•  Storage : Both Ceph and non-Ceph

Our Need for Ceph

5

•  To build our own storage – Not to buy black box

•  Software Defined , use commodity hardware

•  Unified – Block , Object , ( File )

•  Tightly Integrates with OpenStack

•  Open Source, no vendor lock-in

•  Scalable and High available

Our Need for Ceph

6

•  Remove SPOF for Storage in OpenStack

•  OpenStack alone is too complex – Let’s make it a bit less o  By using Ceph for storage needs

•  To be up-to-date with community o  Ceph is the most used storage backend for OpenStack

•  Need for Object storage

Storage Complexity

7

LUN

Gateway-‐1

Gateway-‐2

LUN

LUN

LUN

Enterprise Array

Storage for Cinder

OpenStack Compute OpenStack Compute OpenStack Compute

OpenStack Compute OpenStack Compute OpenStack Controller

Local Disk

NFS

Storage for Nova Instances

Storage for Glance

This is why we choose Ceph

•  One storage to rule them all

•  Goes hand-in-hand with OpenStack

•  Supports instance Live Migration, CoW

•  Bonus for using Ceph o  OpenStack Manila ( Shared filesystem) o  On the way

hFp://www.slideshare.net/ircolle/what-‐is-‐a-‐ceph-‐and-‐why-‐do-‐i-‐care-‐openstack-‐storage-‐colorado-‐openstack-‐meetup-‐october-‐14-‐2014

Ceph Infrastructure

9

Production Cluster

•  10 x HP DL380 o  E5-2450, 8c, 2.10 GHz o  24GB Memory o  12 x 3TB SATA o  2 x 40Gbe

•  Ceph Firefly 0.80.8

•  CentOS 6.6 o  3.10.69

•  360 TB Raw

Test Cluster

•  5 x HP DL380 o  E5-2450, 8c, 2.10 GHz o  24GB Memory o  12 x 3TB SATA o  2 x 40Gbe

•  Ceph Hammer 0.94.3

•  CentOS 6.6 o  3.10.69

•  180 TB Raw

Development Cluster

•  4 x HP SL4540 o  2 x E5-2470, 8c, 2.30 GHz o  192 GB Memory o  60 x 4TB SATA o  2 x 10Gbe

•  Ceph Hammer 0.94.3

•  CentOS 6.6 o  3.10.69

•  960 TB Raw

ePouta Cloud Service

Ceph Infrastructure .. Cont..

10

Pre-Production Cluster

•  4 x HP SL4540 o  2 x E5-2470, 8c, 2.30 GHz o  192 GB Memory o  60 x 4TB SATA o  2 x 10Gbe

•  Object Storage Service


•  CentOS 6.5 o  2.6.32

•  240 OSD / 870 TB Available

cPouta Cloud Service

Fujitsu Eternus CD10000

•  4 x Primergy Rx300 S8 o  2 x E5-2640, 8c, 2.00 GHz o  128 GB Memory o  1 x 10Gbe / 1 x 40Gbe o  15 x 900GB SAS 2.5“ 10K o  1 x 800G Fusion ioDrive2 PCIe SSD

•  4 x Eternus JX40 JBOD o  24 x 900GB SAS 2.5“ 10K


•  CentOS 6.6 o  3.10.42

•  156 OSD / 126 TB Available

Proof of Concept

Our toolkit for Ceph •  OS deployment, package mgmt.

o  Spacewalk

•  Ansible o  End to end system configuration

o  Network, Kernel, packages, OS Tuning, NTP, o  Metric collection, Monitoring, Central logging

etc. o  Entire Ceph deployment o  System / Ceph administration

•  Performance Metric & Dashboard o  Collectd, Graphite, Grafana

•  Monitoring and Logs Management o  OpsView, ELK stack

•  Version Control o  Git , GitHub

11

Live Demo

12

Near Future

•  CSC Espoo DC [ePouta Cloud Storage] o  Next 8-12 months à 3PB Raw o  Introduction to storage POD layout for scalability & better failure domain o  Dedicated Monitor node o  SSD Journals o  Erasure Coding

•  CSC Kajaani DC [cPouta Cloud Storage] o  Early next year à Add new capacity ~850TB ( total capacity ~1.8 PB Raw ) o  Enable full support to OpenStack ( Nova, Glance, Cinder, Swift ) o  Erasure Coding

•  Miscellaneous o  Multi DC replication [Espoo – Kajaani]

13

14

Build Ceph environment , that is •  Multi-Petabyte ( ~ 10 PB Usable ) •  Hyper Scalable •  Multi-Rack Fault tolerant

Storage PODs •  Design on paper currently •  Still thinking for the best way •  Interested to know, what other’s are doing ?

Long Term

15

Disks, Nodes , Racks

Disks

Storage Node

Rack Rack Rack

16

More Racks ... Hyper scale

Rack Rack Rack Rack Rack Rack Rack Rack Rack Rack Rack Rack

How to manage effecPvely

C E P H

17

Storage POD

•  Storage POD is a group of racks •  Ease of management , in a hyper scale environment •  Scalable modular design •  Can sustain multi-rack failure •  CRUSH failure domain changes required

•  Primary copy à One POD •  Secondary & Tertiary Copy à Other Two POD’s

18

Storage POD in action

Rack

POD-‐1

Rack

POD-‐2

Rack

POD-‐3

C E P H

Scaling up Multi Rack

19

Rack

POD-‐1

C E P H

Rack

POD-‐2

Rack

POD-‐3

20

Scaling up…even more racks

POD-‐1

C E P H

POD-‐2 POD-‐3

21

Scaling up…several PODs

C E P H

Some Recommendations

•  Monitor Nodes o Use dedicated monitor nodes, avoid sharing them with OSD’s o Use SSD for Ceph Monitor LevelDB

•  OSD nodes o Avoid overloading your SSD journals, you might not get what you expect. o Node Preference:

o  #1 Thin node (10-16 disk) o  #2 Thick Node (16-30 disk) o  #3 Fat Node (disk > 30)

o  If using FAT nodes , use several of them

22

Operational Experience •  Use dedicated disks for OS , OSD data & OSD Journal ( can be shared )

•  Plan your requirement well , choose PG count wisely for a prod. Cluster o  Increasing PG count is one of the most intensive operation o  Decreasing PG count is not allowed

•  Ceph version upgrades / rolling upgrades , works like charm

•  For Thick and FAT OSD nodes , tune kernel o  kernel.pid_max=4194303 o  kernel.threads-max=200000

23

Operational Experience

•  If you are seeing Blocked OPS/Slow OSD/Request, don’t worry you are not alone o  Ceph health detail -> Find OSD -> Find node -> Check “EVERYTHING” on that node -> Mark out o  If the problem is on most of the nodes -> Check “NETWORK”

o  Interface errors , MTU , Configuration, Network blocking , Architecture, Switch logs, remove iface, bonding. o  Even the cable change worked for us ( upgraded switch FW and the cable type became up supported )

•  Tune CRUSH for optimal parameters o  # ceph osd crush tunables optimal o  Caution this will trigger a lot of data movement

•  Ceph recovery/backfilling can starve your client for IO , you may want to reduce it

ceph tell osd.\* injectargs '--osd_recovery_max_active 1 --osd_recovery_max_single_start 1 --osd_recovery_op_priority 50 --osd_recovery_max_chunk 1048576 --osd_recovery_threads 1 --osd_max_backfills 1 --osd_backfill_scan_min 4 --osd_backfill_scan_max 8’

24

03/02/15 25

# 1 Health OK

#2

Operational Experience •  Increasing filestore max_sync and min_sync vlaues , helped to a certain extent

o  filestore_max_sync_interval = 140 o  filestore_min_sync_interval = 100

•  Firmware upgrade on the network switches as well as replacing physical network cables fixed the issue.

26

Advice : Always check your network TWICE !!!

THANK YOU

27

Documents

How Do we do Ceph @ CSC - Karan Singh · • Monitoring and Logs Management o OpsView, ELK stack ... • Interested to know, ... you might not get what you expect