Upload
james-saint-rossy
View
377
Download
4
Embed Size (px)
Citation preview
Designing for High Performance Ceph at Scale
April 26, 2016
James Saint-Rossy - Principal Storage Engineer, ComcastJohn Benton - Consulting Systems Engineer, WWT
Today’s Agenda
• Our Lab/Production Environment• Holistic Architecture• Strategies for Benchmarking• Performance Bottlenecks/Lessons Learned• Tuning Tips and Tricks
Designing for High Performance Ceph at Scale2
Our Typical Node ConfigurationStorage Node• 72 X 6 TB SATA 7.2K HDD’s• 3 X 1.6TB PCIe NVME’s (Journals)• 2 X Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (12 cores)• 256 GB of RAM• Dual Port 40Gbe NIC
Mon/RGW Node• 2 x Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz• 32 GB of Ram• Dual Port 10 Gbe NIC• ...Nothing Special
3 Designing for High Performance Ceph at Scale
Lab/Production Environment Layout
Designing for High Performance Ceph at Scale4
Holistic ArchitectureCustomer Requirements-IOPS/Read Write Mix/Object Size …-How Much Replication-Which APIs
Cost-HW Cost/Support Cost/Operational Cost?
Failure Domain-Servers/Racks/Servers/Rows Etc...
Data Center Constraints-Space/Power/Thermal
Operational Complexity-Complex Hardware Configs
Designing for High Performance Ceph at Scale5
Holistic Architecture Cont’d
Journals- Colocated?- SSD vs NVME?
Designing for High Performance Ceph at Scale6
Strategies for BenchmarkingTools-Fio for block-Cosbench for object
IOPS Isn’t Everything-1000 workers may give you 30% more iops but at the cost of 600% higher latency
Verify Published Stats With Benchmarks-… Always
Verify Scale-Out
Designing for High Performance Ceph at Scale7
Performance - TCMalloc• As cluster size increased, %SYS was increasingly taxed• System profiling revealed up to 50% of CPU resources used by TCMalloc• This library can be tuned to have more memory. This was good for nearly a
50% increase
Designing for High Performance Ceph at Scale8
Modern PC Architecture
9 Designing for High Performance Ceph at Scale
Performance - Inter-node data flow
10 Designing for High Performance Ceph at Scale
OSD Data Workflow
11
"complicated situation" by bandinisonfire is licensed under CC BY-NC-SA 2.0
Designing for High Performance Ceph at Scale
Performance - NUMA• The bigger and faster the data node, the bigger the
bottleneck potential• We tuned several areas to avoid unnecessary trips
across the QPI bus• To map everything you must:
• Map CPU cores to sockets• Map PCIE devices to sockets• Map storage disks (and journals) to the associated
HBA
Designing for High Performance Ceph at Scale12
NUMA - IRQsPin all soft IRQs for all IO devices to it’s associated NUMA node
13 Designing for High Performance Ceph at Scale
NUMA - Mount PointsAlign mount points so that the OSD and journal are on the same NUMA node
14 Designing for High Performance Ceph at Scale
NUMA - OSD ProcessesPin OSD processes to the NUMA node associated with the storage it controls
15 Designing for High Performance Ceph at Scale
Performance - General Tips• Use latest vendor drivers.
-We have seen 30% improvements from stock drivers• OS tuning focused on increasing threads, file handles,
etc.• Jumbo frames help, particularly on the cluster network• Flow control issues with 40Gbe network adapters• Scan for failing (but perhaps not completely failed) disks
Designing for High Performance Ceph at Scale16
Designing for High Performance Ceph at Scale17
"Question" by alphageek is licensed under CC BY-NC-SA 2.0
Designing for High Performance Ceph at Scale18
Performance - Mons• Mons are generally a glorified TFTP server and you can
get away with 1+2 for redundancy • That is, until they aren’t….....• In certain situations like cluster rebalancing or deleting a
pool with a lot of PG’s, a single CPU on *ALL* mons will become jammed up. They start evicting each other and meyhem ensues.
• How to fix this:
Presentation title (optional)19