Upload
ledat
View
221
Download
2
Embed Size (px)
Citation preview
Ceph @ CSC How Do we do
#whoami Karan Singh
System Specialist Cloud Storage
CSC-IT Center for Science FINLAND
2
• Author for Learning Ceph – Packt Publication 2015
• Author for Ceph Cookbook – Packt Publication 2016
• Technical Reviewer for Mastering Ceph – Packt Publication 2016
• www.ksingh.co.in - Tune in for my blogs
CSC-IT Center For Science
3
• Founded in 1971
• Finnish Non Profit organization, Funded by Ministry of Education
• Connected Finland to Internet in 1988
• Most Powerful academic computing facility in the Nordics
• ISO27001:2013 Certification
• Public cloud offering Pouta Cloud Services
More Information o https://www.csc.fi/ o https://research.csc.fi/cloud-computing
CSC Cloud Offering
4
• Pouta Cloud Service [ IaaS ] o cPouta - Public cloud , General Purpose o ePouta - Public cloud , purposely built for sensitive data
• Built using OpenStack
• Uses upstream openstack packages, No distribution
• Storage : Both Ceph and non-Ceph
Our Need for Ceph
5
• To build our own storage – Not to buy black box
• Software Defined , use commodity hardware
• Unified – Block , Object , ( File )
• Tightly Integrates with OpenStack
• Open Source, no vendor lock-in
• Scalable and High available
Our Need for Ceph
6
• Remove SPOF for Storage in OpenStack
• OpenStack alone is too complex – Let’s make it a bit less o By using Ceph for storage needs
• To be up-to-date with community o Ceph is the most used storage backend for OpenStack
• Need for Object storage
Storage Complexity
7
LUN
Gateway-‐1
Gateway-‐2
LUN
LUN
LUN
Enterprise Array
Storage for Cinder
OpenStack Compute OpenStack Compute OpenStack Compute
OpenStack Compute OpenStack Compute OpenStack Controller
Local Disk
NFS
Storage for Nova Instances
Storage for Glance
This is why we choose Ceph
• One storage to rule them all
• Goes hand-in-hand with OpenStack
• Supports instance Live Migration, CoW
• Bonus for using Ceph o OpenStack Manila ( Shared filesystem) o On the way
hFp://www.slideshare.net/ircolle/what-‐is-‐a-‐ceph-‐and-‐why-‐do-‐i-‐care-‐openstack-‐storage-‐colorado-‐openstack-‐meetup-‐october-‐14-‐2014
Ceph Infrastructure
9
Production Cluster
• 10 x HP DL380 o E5-2450, 8c, 2.10 GHz o 24GB Memory o 12 x 3TB SATA o 2 x 40Gbe
• Ceph Firefly 0.80.8
• CentOS 6.6 o 3.10.69
• 360 TB Raw
Test Cluster
• 5 x HP DL380 o E5-2450, 8c, 2.10 GHz o 24GB Memory o 12 x 3TB SATA o 2 x 40Gbe
• Ceph Hammer 0.94.3
• CentOS 6.6 o 3.10.69
• 180 TB Raw
Development Cluster
• 4 x HP SL4540 o 2 x E5-2470, 8c, 2.30 GHz o 192 GB Memory o 60 x 4TB SATA o 2 x 10Gbe
• Ceph Hammer 0.94.3
• CentOS 6.6 o 3.10.69
• 960 TB Raw
ePouta Cloud Service
Ceph Infrastructure .. Cont..
10
Pre-Production Cluster
• 4 x HP SL4540 o 2 x E5-2470, 8c, 2.30 GHz o 192 GB Memory o 60 x 4TB SATA o 2 x 10Gbe
• Object Storage Service
• Ceph Firefly 0.80.10
• CentOS 6.5 o 2.6.32
• 240 OSD / 870 TB Available
cPouta Cloud Service
Fujitsu Eternus CD10000
• 4 x Primergy Rx300 S8 o 2 x E5-2640, 8c, 2.00 GHz o 128 GB Memory o 1 x 10Gbe / 1 x 40Gbe o 15 x 900GB SAS 2.5“ 10K o 1 x 800G Fusion ioDrive2 PCIe SSD
• 4 x Eternus JX40 JBOD o 24 x 900GB SAS 2.5“ 10K
• Ceph Firefly 0.80.7
• CentOS 6.6 o 3.10.42
• 156 OSD / 126 TB Available
Proof of Concept
Our toolkit for Ceph • OS deployment, package mgmt.
o Spacewalk
• Ansible o End to end system configuration
o Network, Kernel, packages, OS Tuning, NTP, o Metric collection, Monitoring, Central logging
etc. o Entire Ceph deployment o System / Ceph administration
• Performance Metric & Dashboard o Collectd, Graphite, Grafana
• Monitoring and Logs Management o OpsView, ELK stack
• Version Control o Git , GitHub
11
Live Demo
12
Near Future
• CSC Espoo DC [ePouta Cloud Storage] o Next 8-12 months à 3PB Raw o Introduction to storage POD layout for scalability & better failure domain o Dedicated Monitor node o SSD Journals o Erasure Coding
• CSC Kajaani DC [cPouta Cloud Storage] o Early next year à Add new capacity ~850TB ( total capacity ~1.8 PB Raw ) o Enable full support to OpenStack ( Nova, Glance, Cinder, Swift ) o Erasure Coding
• Miscellaneous o Multi DC replication [Espoo – Kajaani]
13
14
Build Ceph environment , that is • Multi-Petabyte ( ~ 10 PB Usable ) • Hyper Scalable • Multi-Rack Fault tolerant
Storage PODs • Design on paper currently • Still thinking for the best way • Interested to know, what other’s are doing ?
Long Term
15
Disks, Nodes , Racks
Disks
Storage Node
Rack Rack Rack
16
More Racks ... Hyper scale
Rack Rack Rack Rack Rack Rack Rack Rack Rack Rack Rack Rack
How to manage effecPvely
C E P H
17
Storage POD
• Storage POD is a group of racks • Ease of management , in a hyper scale environment • Scalable modular design • Can sustain multi-rack failure • CRUSH failure domain changes required
• Primary copy à One POD • Secondary & Tertiary Copy à Other Two POD’s
18
Storage POD in action
Rack
POD-‐1
Rack
POD-‐2
Rack
POD-‐3
C E P H
Scaling up Multi Rack
19
Rack
POD-‐1
C E P H
Rack
POD-‐2
Rack
POD-‐3
20
Scaling up…even more racks
POD-‐1
C E P H
POD-‐2 POD-‐3
21
Scaling up…several PODs
C E P H
Some Recommendations
• Monitor Nodes o Use dedicated monitor nodes, avoid sharing them with OSD’s o Use SSD for Ceph Monitor LevelDB
• OSD nodes o Avoid overloading your SSD journals, you might not get what you expect. o Node Preference:
o #1 Thin node (10-16 disk) o #2 Thick Node (16-30 disk) o #3 Fat Node (disk > 30)
o If using FAT nodes , use several of them
22
Operational Experience • Use dedicated disks for OS , OSD data & OSD Journal ( can be shared )
• Plan your requirement well , choose PG count wisely for a prod. Cluster o Increasing PG count is one of the most intensive operation o Decreasing PG count is not allowed
• Ceph version upgrades / rolling upgrades , works like charm
• For Thick and FAT OSD nodes , tune kernel o kernel.pid_max=4194303 o kernel.threads-max=200000
23
Operational Experience
• If you are seeing Blocked OPS/Slow OSD/Request, don’t worry you are not alone o Ceph health detail -> Find OSD -> Find node -> Check “EVERYTHING” on that node -> Mark out o If the problem is on most of the nodes -> Check “NETWORK”
o Interface errors , MTU , Configuration, Network blocking , Architecture, Switch logs, remove iface, bonding. o Even the cable change worked for us ( upgraded switch FW and the cable type became up supported )
• Tune CRUSH for optimal parameters o # ceph osd crush tunables optimal o Caution this will trigger a lot of data movement
• Ceph recovery/backfilling can starve your client for IO , you may want to reduce it
ceph tell osd.\* injectargs '--osd_recovery_max_active 1 --osd_recovery_max_single_start 1 --osd_recovery_op_priority 50 --osd_recovery_max_chunk 1048576 --osd_recovery_threads 1 --osd_max_backfills 1 --osd_backfill_scan_min 4 --osd_backfill_scan_max 8’
24
03/02/15 25
# 1 Health OK
#2
Operational Experience • Increasing filestore max_sync and min_sync vlaues , helped to a certain extent
o filestore_max_sync_interval = 140 o filestore_min_sync_interval = 100
• Firmware upgrade on the network switches as well as replacing physical network cables fixed the issue.
26
Advice : Always check your network TWICE !!!
THANK YOU
27