Click here to load reader

Scaling Ceph at CERN - Ceph Day Frankfurt

  • View

  • Download

Embed Size (px)


Dan van der Ster, CERN

Text of Scaling Ceph at CERN - Ceph Day Frankfurt

  • 1. Scaling Ceph at CERN Dan van der Ster ([email protected]) Data and Storage Service Group | CERN IT Department

2. CERNs Mission and Tools CERN studies the fundamental laws of nature Why do particles have mass? What is our universe made of? Why is there no antimatter left? What was matter like right after the Big Bang? The Large Hadron Collider (LHC) Built in a 27km long tunnel, ~200m underground Dipole magnets operated at -271C (1.9K) Particles do ~11000 turns/sec, 600 million collisions/sec Detectors Four main experiments, each the size of a cathedral DAQ systems Processing PetaBytes/sec Scaling Ceph at CERN - D. van der Ster 3 3. Big Data at CERN Physics Data on CASTOR/EOS LHC experiments produce ~10GB/s 25PB/year User Data on OpenAFS & DFS Home directories for 30k users Physics analysis development Project spaces for applications Service Data on AFS/NFS Databases, admin applications Tape archival with CASTOR/TSM RAW physics outputs Desktop/Server backups Scaling Ceph at CERN - D. van der Ster 4 Service Size Files OpenAFS 290TB 2.3B CASTOR 89.0PB 325M EOS 20.1PB 160M 4. IT Evolution at CERN Scaling Ceph at CERN - D. van der Ster 5 Cloudifying CERNs IT infrastructure ... Centrally-managed and uniform hardware No more service-specific storage boxes OpenStack VMs for most services Building for 100k nodes (mostly for batch processing) Attractive desktop storage services Huge demand for a local Dropbox, Google Drive Remote data centre in Budapest More rack space and power, plus disaster recovery brings new storage requirements Block storage for OpenStack VMs Images and volumes Backend storage for existing and new services AFS, NFS, OwnCloud, Data Preservation, ... Regional storage Use of our new data centre in Hungary Failure tolerance, data checksumming, easy to operate, security, ... 5. Ceph at CERN Scaling Ceph at CERN - D. van der Ster 6 6. 12 racks of disk server quads Wiebalck / van der Ster -- Building an organic block storage service at CERN with Ceph 7. Our 3PB Ceph Cluster Dual Intel Xeon L5640 24 threads incl. HT Dual 1Gig-E NICs Only one connected 2x 2TB Hitachi system disks RAID-1 mirror 1x 240GB OCZ Deneva 2 /var/lib/ceph/mon 48GB RAM Scaling Ceph at CERN - D. van der Ster 8 Dual Intel Xeon E5-2650 32 threads incl. HT Dual 10Gig-E NICs Only one connected 24x 3TB Hitachi disks Eco drive, ~5900 RPM 3x 2TB Hitachi system disks Triple mirror 64GB RAM 47 disk servers/1128 OSDs 5 monitors #df-h/mnt/ceph Filesystem SizeUsedAvailUse%Mountedon xxx:6789:/3.1P173T2.9P6%/mnt/ceph 8. Use-Cases Being Evaluated 1. Images and Volumes for OpenStack 2. S3 Storage for Data Preservation / Public Dissemination 3. Physics data storage for archival and/or analysis Scaling Ceph at CERN - D. van der Ster 9 #1 is moving into production. #2 and #3 are more exploratory at the moment. 9. OpenStack Volumes & Images Glance: using RBD for ~3 months now. Only issue was to increase ulimit -n above 1024 (10k is good). Cinder: testing with close colleagues. 126 Cinder Volumes attached today 56TB used Scaling Ceph at CERN - D. van der Ster 10 Growing # of volumes/images Usual traffic is ~50-100MB/s with current usage. (~idle) 10. RBD for OpenStack Volumes Before general availability, we need to test and enable qemu iops/bps throttling Otherwise VMs with many IOs can disrupt other users. One ongoing issue is that a few clients are getting an (infrequent) segfault of qemu during a VM reboot. Happens on VMs with many attached RBDs. Difficult to get a complete (16GB) core dump. Scaling Ceph at CERN - D. van der Ster 11 11. CASTOR & XRootD/EOS Exploring RADOS backend for these two HEP-developed file systems Gateway model, similar to S3 via RADOSGW CASTOR needs raw throughput performance (to feed many tape drives at 250MBps each). Striped RWs across many OSDs are important. XRootD/EOS may benefit from the highly scalable namespace to store O(billion) objects Bonus: XRootD also offers http/webdav with X509/kerberos, possibly even fuse mountable. Developments are in early stages. Scaling Ceph at CERN - D. van der Ster 12 12. Operations & Lessons Learned Scaling Ceph at CERN - D. van der Ster 13 13. Configuration and Deployment Dumpling 0.67.7 Fully Puppet-ized Automated server deployment, automated OSD replacement Very few custom ceph.conf options Experimenting with the filestorewbthrottle we find that disabling it completely gives better IOps performance But dont do this!!! Scaling Ceph at CERN - D. van der Ster 14 monosddownoutinterval=900 osdpooldefaultsize=3 osdpooldefaultminsize=1 osdpooldefaultpgnum=1024 osdpooldefaultpgpnum=1024 osdpooldefaultflaghashpspool=true osdmaxbackfills=1 osdrecoverymaxactive=1 14. Cluster Activity Scaling Ceph at CERN - D. van der Ster 15 15. General Comments In these ~7 months of running the cluster, there have been very few problems No outages No data losses/corruptions No unfixable performance issues Behaves well during stress tests But now were starting to get real/varied/creative users, and this brings up many interesting issues... No amount of stress testing can prepare you for real users - Unknown (point being, dont take the next slides to be too negative Im just trying to give helpful advice ;) Scaling Ceph at CERN - D. van der Ster 16 16. Latency & Slow Requests Best latency we can achieve is 20-40ms Slow SATA disks, no SSDs: hard to justify SSDs in a multi-PB cluster, but could in a smaller limited use-case cluster (e.g. for Cinder-only) Latency can increase dramatically with heavy usage Dont mix latency-bound and throughput-bound users on the same OSDs Local processes scanning the disks can hurt performance Add /var/lib/ceph to the updatedb PRUNEPATH If you have slow disks like us, you need to understand your disk IO scheduler e.g. deadline prefers reads over writes: writes are given a 5 second deadline vs. 500ms for reads! Scrubbing! Kernel tuning: vm.* sysctl, dirty page flushing, memory reclaiming Something is flushing the buffers, blocking the OSD processes Slow requests: monitor them, eliminate them. Scaling Ceph at CERN - D. van der Ster 17 17. Life with 250 million objects Recently, a user decided to write 250 million 1kB objects Not so unreasonable: 250M * 4MB = 1PB, so this simulates the cluster being full of RBD images, at least in terms of # objects It worked no big problems from holding this many objects. Tested single OSD failure: ~7 hours to backfill, including a double-backfill glitch that were trying to understand. But now we want to cleanup, and it is not trivial to remove 250M objects! rados rmpool generated quite a load when we rmd a 3 million object pool (some OSDs were temporarily marked down). Probably due to a mistake in our wbthrottle tuning Scaling Ceph at CERN - D. van der Ster 18 18. Other backfilling issues During a backfilling event (draining a whole server), we started observing repeated monitor elections Caused by the mons LevelDBs being so active that the local SATA disks couldnt keep up. When a mon falls behind, it calls an election Could be due to LevelDB compaction We moved /var/lib/ceph/mon to SSDs no more elections during backfilling Avoid double backfilling when taking an OSD out of service: Start with cephosdcrushrm !! If you mark the OSD out first, then crush rm it, you will compute a new CRUSH map twice, i.e. backfill twice. Scaling Ceph at CERN - D. van der Ster 19 19. Fun with CRUSH CRUSH is simple yet powerful, so it is tempting to play with the cluster layout But once you have non-zero amounts of data, significant CRUSH changes will lead to massive data movements, which create extra disk load and may disrupt users. Early CRUSH planning is crucial! A network switch is a failure domain, so we should configure CRUSH to replicate across switches, right? But (assuming we dont have a private cluster network) that would send all replication traffic via the switch uplinks bottleneck! Unclear tradeoff between uptime and performance. Scaling Ceph at CERN - D. van der Ster 20 20. CRUSH & Data distribution CRUSH may give your cluster an uneven data distribution An OSDs used space will scale with the number of PGs assigned to it After you have designed your cluster, created your pools, started adding data, check the PG and volume distributions reweight-by-utilization is useful to iron out an uneven PG distribution The hashpspool flag is also important if you have many active pools Scaling Ceph at CERN - D. van der Ster 21 0 20 40 60 80 100 120 140 160 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30nOSDs n PGs Number of OSDs having N PGs (for pool = volumes) 21. RBD Reliability with 3 Replicas RBD devices are chunked across thousands of objects: A full 1TB volume is composed of 250,000 4MB objects If any single object is lost, the whole RBD can be considered to be corrupted (obviously, it depends which blocks are lost!) If you lose an entire PG, you can consider all RBDs to be lost / corrupted. Our incorrect & irrational fears: Any simultaneous triple disk failure in the cluster would lead to objects being lost and somehow all RBDs would be corrupted. As we add OSDs to the cluster, the data gets spread wider, and the chances of RBD data loss increase. But this is wrong!! The only triple disk failures that can lead to data loss are those combinations actively used by PGs so having e.g. 4096 PGs for RBDs means that only 4096 combinations out of the 10^9 possible combinations matter. N_PGs * ~(P_diskfailure^3) / 3! We use 4 replicas for the RBD volumes, but this is probably overkill. Scaling Ceph at CERN - D. van der Ster 22 22. Trust your clients There is no server-side per-client throttling A few nasty clients can overwhelm an OSD, leading to slow requests for everyone. When you have a high load / slow requests, it is not

Search related