19
Designing for High Performance Ceph at Scale April 26, 2016 James Saint-Rossy - Principal Storage Engineer, Comcast John Benton - Consulting Systems Engineer, WWT

Designing for High Performance Ceph at Scale

Embed Size (px)

Citation preview

Page 1: Designing for High Performance Ceph at Scale

Designing for High Performance Ceph at Scale

April 26, 2016

James Saint-Rossy - Principal Storage Engineer, ComcastJohn Benton - Consulting Systems Engineer, WWT

Page 2: Designing for High Performance Ceph at Scale

Today’s Agenda

• Our Lab/Production Environment• Holistic Architecture• Strategies for Benchmarking• Performance Bottlenecks/Lessons Learned• Tuning Tips and Tricks

Designing for High Performance Ceph at Scale2

Page 3: Designing for High Performance Ceph at Scale

Our Typical Node ConfigurationStorage Node• 72 X 6 TB SATA 7.2K HDD’s• 3 X 1.6TB PCIe NVME’s (Journals)• 2 X Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (12 cores)• 256 GB of RAM• Dual Port 40Gbe NIC

Mon/RGW Node• 2 x Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz• 32 GB of Ram• Dual Port 10 Gbe NIC• ...Nothing Special

3 Designing for High Performance Ceph at Scale

Page 4: Designing for High Performance Ceph at Scale

Lab/Production Environment Layout

Designing for High Performance Ceph at Scale4

Page 5: Designing for High Performance Ceph at Scale

Holistic ArchitectureCustomer Requirements-IOPS/Read Write Mix/Object Size …-How Much Replication-Which APIs

Cost-HW Cost/Support Cost/Operational Cost?

Failure Domain-Servers/Racks/Servers/Rows Etc...

Data Center Constraints-Space/Power/Thermal

Operational Complexity-Complex Hardware Configs

Designing for High Performance Ceph at Scale5

Page 6: Designing for High Performance Ceph at Scale

Holistic Architecture Cont’d

Journals- Colocated?- SSD vs NVME?

Designing for High Performance Ceph at Scale6

Page 7: Designing for High Performance Ceph at Scale

Strategies for BenchmarkingTools-Fio for block-Cosbench for object

IOPS Isn’t Everything-1000 workers may give you 30% more iops but at the cost of 600% higher latency

Verify Published Stats With Benchmarks-… Always

Verify Scale-Out

Designing for High Performance Ceph at Scale7

Page 8: Designing for High Performance Ceph at Scale

Performance - TCMalloc• As cluster size increased, %SYS was increasingly taxed• System profiling revealed up to 50% of CPU resources used by TCMalloc• This library can be tuned to have more memory. This was good for nearly a

50% increase

Designing for High Performance Ceph at Scale8

Page 9: Designing for High Performance Ceph at Scale

Modern PC Architecture

9 Designing for High Performance Ceph at Scale

Page 10: Designing for High Performance Ceph at Scale

Performance - Inter-node data flow

10 Designing for High Performance Ceph at Scale

Page 11: Designing for High Performance Ceph at Scale

OSD Data Workflow

11

"complicated situation" by bandinisonfire is licensed under CC BY-NC-SA 2.0

Designing for High Performance Ceph at Scale

Page 12: Designing for High Performance Ceph at Scale

Performance - NUMA• The bigger and faster the data node, the bigger the

bottleneck potential• We tuned several areas to avoid unnecessary trips

across the QPI bus• To map everything you must:

• Map CPU cores to sockets• Map PCIE devices to sockets• Map storage disks (and journals) to the associated

HBA

Designing for High Performance Ceph at Scale12

Page 13: Designing for High Performance Ceph at Scale

NUMA - IRQsPin all soft IRQs for all IO devices to it’s associated NUMA node

13 Designing for High Performance Ceph at Scale

Page 14: Designing for High Performance Ceph at Scale

NUMA - Mount PointsAlign mount points so that the OSD and journal are on the same NUMA node

14 Designing for High Performance Ceph at Scale

Page 15: Designing for High Performance Ceph at Scale

NUMA - OSD ProcessesPin OSD processes to the NUMA node associated with the storage it controls

15 Designing for High Performance Ceph at Scale

Page 16: Designing for High Performance Ceph at Scale

Performance - General Tips• Use latest vendor drivers.

-We have seen 30% improvements from stock drivers• OS tuning focused on increasing threads, file handles,

etc.• Jumbo frames help, particularly on the cluster network• Flow control issues with 40Gbe network adapters• Scan for failing (but perhaps not completely failed) disks

Designing for High Performance Ceph at Scale16

Page 17: Designing for High Performance Ceph at Scale

Designing for High Performance Ceph at Scale17

"Question" by alphageek is licensed under CC BY-NC-SA 2.0

Page 18: Designing for High Performance Ceph at Scale

Designing for High Performance Ceph at Scale18

Page 19: Designing for High Performance Ceph at Scale

Performance - Mons• Mons are generally a glorified TFTP server and you can

get away with 1+2 for redundancy • That is, until they aren’t….....• In certain situations like cluster rebalancing or deleting a

pool with a lot of PG’s, a single CPU on *ALL* mons will become jammed up. They start evicting each other and meyhem ensues.

• How to fix this:

Presentation title (optional)19