Designing for High Performance Ceph at Scale

Designing for High Performance Ceph at Scale

April 26, 2016

James Saint-Rossy - Principal Storage Engineer, ComcastJohn Benton - Consulting Systems Engineer, WWT

Today’s Agenda

• Our Lab/Production Environment• Holistic Architecture• Strategies for Benchmarking• Performance Bottlenecks/Lessons Learned• Tuning Tips and Tricks

Designing for High Performance Ceph at Scale2

Our Typical Node ConfigurationStorage Node• 72 X 6 TB SATA 7.2K HDD’s• 3 X 1.6TB PCIe NVME’s (Journals)• 2 X Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (12 cores)• 256 GB of RAM• Dual Port 40Gbe NIC

Mon/RGW Node• 2 x Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz• 32 GB of Ram• Dual Port 10 Gbe NIC• ...Nothing Special

3 Designing for High Performance Ceph at Scale

Lab/Production Environment Layout


Holistic ArchitectureCustomer Requirements-IOPS/Read Write Mix/Object Size …-How Much Replication-Which APIs

Cost-HW Cost/Support Cost/Operational Cost?

Failure Domain-Servers/Racks/Servers/Rows Etc...

Data Center Constraints-Space/Power/Thermal

Operational Complexity-Complex Hardware Configs


Holistic Architecture Cont’d

Journals- Colocated?- SSD vs NVME?


Strategies for BenchmarkingTools-Fio for block-Cosbench for object

IOPS Isn’t Everything-1000 workers may give you 30% more iops but at the cost of 600% higher latency

Verify Published Stats With Benchmarks-… Always

Verify Scale-Out


Performance - TCMalloc• As cluster size increased, %SYS was increasingly taxed• System profiling revealed up to 50% of CPU resources used by TCMalloc• This library can be tuned to have more memory. This was good for nearly a

50% increase


Modern PC Architecture


Performance - Inter-node data flow


OSD Data Workflow

11

"complicated situation" by bandinisonfire is licensed under CC BY-NC-SA 2.0

Designing for High Performance Ceph at Scale

https://flic.kr/p/8ezAoX

https://www.flickr.com/people/bandinisonfire/

https://creativecommons.org/licenses/by-nc-sa/2.0/

Performance - NUMA• The bigger and faster the data node, the bigger the

bottleneck potential• We tuned several areas to avoid unnecessary trips

across the QPI bus• To map everything you must:

• Map CPU cores to sockets• Map PCIE devices to sockets• Map storage disks (and journals) to the associated

HBA


NUMA - IRQsPin all soft IRQs for all IO devices to it’s associated NUMA node


NUMA - Mount PointsAlign mount points so that the OSD and journal are on the same NUMA node


NUMA - OSD ProcessesPin OSD processes to the NUMA node associated with the storage it controls


Performance - General Tips• Use latest vendor drivers.

-We have seen 30% improvements from stock drivers• OS tuning focused on increasing threads, file handles,

etc.• Jumbo frames help, particularly on the cluster network• Flow control issues with 40Gbe network adapters• Scan for failing (but perhaps not completely failed) disks



"Question" by alphageek is licensed under CC BY-NC-SA 2.0

https://flic.kr/p/T1or

https://www.flickr.com/people/alphageek/

https://creativecommons.org/licenses/by-nc-sa/2.0/


Performance - Mons• Mons are generally a glorified TFTP server and you can

get away with 1+2 for redundancy • That is, until they aren’t….....• In certain situations like cluster rebalancing or deleting a

pool with a lot of PG’s, a single CPU on *ALL* mons will become jammed up. They start evicting each other and meyhem ensues.

• How to fix this:

Presentation title (optional)19

Technology

Designing for High Performance Ceph at Scale