Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture

Ceph All-Flash Array Design Based on NUMA Architecture

QCT (Quanta Cloud Technology)Marco Huang, Technical Manager

Becky Lin, Program Manager

• All-flash Ceph and Use Cases• QCT QxStor All-flash Ceph for IOPS • QCT Lab Environment Overview & Detailed Architecture• Importance of NUMA and Proof Points

Agenda

2 QCT CONFIDENTIAL

QCT Powers Most of Cloud ServicesGlobal Tier 1 Hyperscale Datacenters, Telcos and Enterprises

3 QCT CONFIDENTIAL

• QCT (Quanta Cloud Technology) was a subsidiary of Quanta Computer

• Quanta Computer is a fortune global 500 company with over $32B revenue

Why All-flash Storage?

• Falling flash prices: Flash prices fell as much as 75% over the 18 months leading up to mid-2016 and the trend continues. “TechRepublic: 10 storage trends to watch in 2016”

• Flash is 10x cheaper than DRAM: with persistence and high capacity“NetApp”

• Flash is 100x cheaper than disk:pennies per IOPS vs. dollars per IOPS“NetApp”

• Flash is 1000x faster than disk:latency drops from milliseconds to microseconds“NetApp”

€$¥

• Flash performance advantage: HDDs have an advantage in $/GB, while flash has an advantage in $/IOPS.“TechTarget: Hybrid storage arrays vs. all-flash arrays: A little flash or a lot?”

• NVMe-based storage trend: 60% of enterprise storage appliances will have NVMe bays by 2020“G2M Research”

Require sub-millisecond latencyNeed performance-optimized storage for mission-critical apps

Flash capacity gains while the price drops

QCT CONFIDENTIAL

All-flash Ceph Use Cases

5 QCT CONFIDENTIAL

QCT QxStor Red Hat Ceph Storage EditionOptimized for workloads

Throughput Optimized

• Densest 1U Ceph building block• Smaller failure domain• 3x SSD S3710 journal• 12x HDD 7.2krpm

• Obtain best throughput & density at once

• Scale at high scale 700TB• 2x 2x NVMe P3700 journal• 2x 35xHDD

• Block or Object Storage, Video, Audio, Image, Streaming media, Big Data

• 3x replication

USE

CASE

QxStor RCT-400QxStor RCT-200

Cost/Capacity Optimized

QxStor RCC-400

• Maximize storage capacity• Highest density 560TB* raw

capacity per chassis• 2x 35xHDD

• Object storage, Archive,Backup, Enterprise Dropbox

• Erasure coding

D51PH-1ULH T21P-4U

* Optional model, one MB per chassis, can support 620TB raw capacity

IOPS Optimized

QxStor RCI-300

• All Flash Design• Lowest latency• 4x P3520 2TB or 4x P3700

1.6TB

• Database, HPC, MissionCritical Applications

• 2x replication

D51BP-1U

6

T21P-4U

7

QCT QxStor RCI-300All Flash Design Ceph for I/O Intensive Workloads

SKU 1: All flash Ceph - the Best IOPS SKU• Ceph Storage Server: D51BP-1U• CPU: 2x E5-2995 v4 or plus• RAM: 128GB• NVMe SSD: 4x P3700 1.6TB • NIC: 10GbE dual port or 40GbE dual port

SKU 2: All flash Ceph - IOPS/Capacity Balanced SKU (best TCO, as of today)• Ceph Storage Server: D51BP-1U• CPU: 2x E5-2980 v4 or higher cores• RAM: 128GB• NVMe SSD: 4x P3520 2TB • NIC: 10GbE dual port or 40GbE dual port

NUMA Balanced Ceph Hardware

Highest IOPS & Lowest Latency

Optimized Ceph & HW Integration for IOPS intensive workloads

QCT CONFIDENTIAL

8

NVMe: Best-in-Class IOPS, Lower/Consistent Latency

Lowest Latency of Standard Storage Interfaces

0

500000

100% Read 70% Read 0% Read

IOPS

IOPS - 4K Random WorkloadsPCIe/NVMe SAS 12Gb/s

3x better IOPS vs SAS 12Gbps For the same #CPU cycles, NVMe delivers over 2X the IOPs of SAS!

Gen1 NVMe has 2 to 3x better Latency Consistency vs SAS

Test and System Configurations: PCI Express* (PCIe*)/NVM Express* (NVMe) Measurements made on Intel® Core™ i7-3770S system @ 3.1GHz and 4GB Mem running Windows* Server 2012 Standard O/S, Intel PCIe/NVMe SSDs, data collected by IOmeter* tool. SAS Measurements from HGST Ultrastar* SSD800M/1000M (SAS), SATA S3700 Series. For more complete information about performance and benchmark results, visit http://www.intel.com/performance. Source: Intel Internal Testing.

QCT CONFIDENTIAL

RADOS

LIBRADOS

QCT Lab Environment Overview

Ceph 1

….Monitors

Storage Clusters

Interfaces RBD(Block Storage)

Cluster Network10GbE

Public Network10GbE

ClientsClient 1 Client 2 Client 9 Client 10

Ceph 2 Ceph 5

….

QCT CONFIDENTIAL

5-Node all-NVMe Ceph ClusterDual-Xeon E5 [email protected], 88 HT, 128GB DDR4

RHEL 7.3, 3.10, Red Hat Ceph 2.1

10x Client SystemsDual-Xeon E5 [email protected]

88 HT, 128 GB DDR4

Ceph OSD

1

Ceph OSD

2

Ceph OSD

3

Ceph OSD

4

Ceph OSD

16

…

NVMe3

NVMe2

NVMe4

NVMe1

20x 2TB P3520 SSDs80 OSDs

2x Replication19TB Effective CapacityTests at cluster fill-level

82%

Ceph RBD Client

Docker3Sysbench Client

Docker4Sysbench Client

Docker2 (krbd)Percona DB Server

Docker1 (krbd)Percona DB Server

Clus

ter N

W 1

0 G

bE

Sysbench Containers

16 vCPUs, 32GB RAMFIO 2.8, Sysbench 0.5

DB Containers

16 vCPUs, 32GB RAM,200GB RBD volume,

100GB MySQL datasetInnoDB buf cache 25GB(25%)

Public NW 10 GbE

QuantaGrid D51BP-1U

QCT CONFIDENTIAL

Detailed System Architecture in QCT Lab

QCT CONFIDENTIAL

QCT CONFIDENTIAL

Stage Test Subject Benchmark tools Major Task

I/O Baseline Raw Disk FIO Determine maximum server IO backplane bandwidth

Network Baseline NIC iPerf Ensure consistent network bandwidth between all nodes

Bare Metal RBD Baseline LibRBD FIO CBT Use FIO RBD engine to test performance using libRBD

Docker Container OLTP Baseline Percona DB + Sysbench Sysbench/OLTP Establish number of workload-

driver VMs desired per client

Benchmark criteria:1. Default: ceph.conf2. Software Level Tuning: ceph.conf tuned3. Software + NUMA CPU Pinning: ceph.conf tuned + NUMA CPU Pinning

Benchmark Methodology

QCT CONFIDENTIALQCT CONFIDENTIAL

QCT CONFIDENTIAL

• Use faster media for journals, metadata• Use recent Linux kernels

– blk-mq support packs big performance gains with NVMe media– optimizations for non-rotational media

• Use tuned where available – adaptive latency performance tuning [2]

• Virtual memory, network and storage tweaks– use commonly recommended VM, network settings [1-4]– enable rq_affinity, read ahead for NVMe devices

• BIOS and CPU performance governor settings– disable C-states and enable Turbo-boost– use “performance” CPU governor

Configuring All-flash CephSystem Tuning for Low-latency Workloads

[1] https://wiki.mikejung.biz/Ubuntu_Performance_Tuning[2] https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Power_Management_Guide/tuned-adm.html[3] http://www.brendangregg.com/blog/2015-03-03/performance-tuning-linux-instances-on-ec2.html[4] https://www.suse.com/documentation/ses-4/singlehtml/book_storage_admin/book_storage_admin.html

Parameter Default value Tuned value

objecter_inflight_ops 1024 102400 Objecter is responsible for sending requests to OSD.

objecter_inflight_ops_bytes

104857600 1048576000Objecter_inflight_ops/objecter_inflight_op_bytes tell objecter to throttle outgoing ops according to budget (values based on experiments in the Dumpling timeframe)

ms_dispatch_throttle_bytes

104857600 1048576000ms_dispatch_throttle_bytes throttle is to dispatch message size for simple messenger (values based on experiments in the Dumpling timeframe)

filestore_queue_max_ops 50 5000 filestore_queue_max_ops/filestore_queue_max_bytes throttle are used to throttle inflight ops for filestore

filestore_queue_max_bytes 104857600 1048576000These throttles are checked before sending ops to journal, so if filestore does not get enough budget for current op, OSD op thread will be blocked

Configuring All-flash CephCeph Tunables

QCT CONFIDENTIAL

14

Parameter Defaultvalue

Tunedvalue

filestore_max_sync_interval

5 10

filestore_max_sync_interval controls the interval (in seconds) that sync thread flush data from memory to disk. Use page cache - by default filestorewrites data to memory and sync thread is responsible for flushing data to disk, then journal entries can be trimmed. Note that large filestore_max_sync_interval can cause performance spike

filestore_op_threads 2 6

filestore_op_threads controls the number of filesystem operation threads that execute in parallel.If the storage backend is fast enough and has enough queues to support parallel operations, it’s recommended to increase this parameter, given there is enough CPU available

osd_op_threads 2 32

osd_op_threads controls the number of threads to service Ceph OSD Daemon operations.Setting this to 0 will disable multi-threading.Increasing this number may increase the request processing rateIf the storage backend is fast enough and has enough queues to support parallel operations, it’s recommended to increase this parameter, given there is enough CPU available


QCT CONFIDENTIAL

Parameter Defaultvalue

Tuned value

journal_queue_max_ops 300 3000

journal_queue_max_bytes/journal_queue_max_op throttles are to throttle inflight ops for journalIf journal does not get enough budget for current op, it will block OSD op thread

journal_queue_max_bytes 33554432 1048576000

journal_max_write_entries 100 1000

journal_max_write_entries/journal_max_write_bytesthrottles are used to throttle ops or bytes for every journal writeTweaking these two parameters maybe helpful for small writes

journal_max_write_bytes 10485760 1048576000


QCT CONFIDENTIAL

• Leverage latest Intel NVMe technology to reach high performance, bigger capacity, with lower $/GB

– Intel DC P3520 2TB raw performance: 375K read IOPS, 26K write IOPS

• By using multiple OSD partitions, Ceph performance scales linearly

– Reduces lock contention within a single OSD process– Lower latency at all queue-depths, biggest impact to random reads

• Introduces the concept of multiple OSD’s on the same physical device

– Conceptually similar crush map data placement rules as managing disks in an enclosure

Multi-partitioned NVMe SSDs

OSD

1

OSD

2

OSD

3

OSD

4

QCT CONFIDENTIAL

Multi-partitioned NVMe SSDs

0

2

4

6

8

10

12

0 200,000 400,000 600,000 800,000 1,000,000 1,200,000

Avg

Late

ncy

(ms)

IOPS

Multiple OSD's per Device comparison4K Random Read (Latency vs. IOPS)

5 nodes, 20/40/80 OSDs

1 OSD/NVMe 2 OSD/NVMe 4 OSD/NVMe

0

10

20

30

40

50

60

70

80

90

% C

PU U

tiliz

atio

n

Single Node CPU Utilization Comparison 4K Random Read @QD32

4/8/16 OSDs

1 OSD/NVMe 2 OSD/NVMe 4 OSD/NVMe

These measurements were done on a Ceph node based Intel P3700 NVMe SSDs but are equally applicable to other

QCT CONFIDENTIAL

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

9.00

600000 700000 800000 900000 1000000 1100000 1200000 1300000 1400000 1500000 1600000

Aver

age

Late

ncy

(ms)

IOPS

4K Random Read (Latency vs. IOPS), IODepth scaling 4-1285 nodes, 10 clients x 10 RBD volumes, Red Hat Ceph Storage 2.1

Default Tuned

~1.57M IOPS @~4ms200% improvement in IOPS and Latency

Performance Testing Results4K 100% Random Read

~1.34M IOPS @~1ms, QD=16200% improvement in IOPS and Latency

QCT CONFIDENTIAL

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

16.00

18.00

20.00

0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000

Aver

age

Late

ncy

(ms)

IOPS

Latency vs IOPS (100% wr, 70/30 rd/rw)IODepth scaling 4-128K, 5 nodes, 10 clients x 10 RBD volumes, Red Hat Ceph 2.1

100% write 70/30 OLTP mix

Performance Testing Results4K 100% Random Write, 70/30 OLTP Mix

~450k 70/30 OLTP IOPS @~1ms, QD=4

~165k Write IOPS @~2ms, QD=4

QCT CONFIDENTIAL

• NUMA-balance network and storage devices across CPU sockets• Bind IO devices to local CPU socket (IRQ pinning)• Align OSD data and Journals to the same NUMA node• Pin OSD processes to local CPU socket (NUMA node pinning)

NUMA Considerations

CORE CORE

CORE CORE

…

Socket 0

Ceph OSD

CORE CORECeph OSD

CORECeph OSD

CORE

…

Socket 1

Ceph OSD

Mem

ory M

emory

QPI

NUMA Node 0 NUMA Node 1

Storage NICs NICs StorageRemoteLocal

QCT CONFIDENTIAL

QCT CONFIDENTIAL

NUMA-Balanced Config on QCT QuantaGrid D51BP-1U

CPU 0 CPU 1

RAM RAM

4 NVMe drive slots1 NIC slot

QPI

PCIe Gen3 x4

Ceph OSD 1-8 Ceph OSD 9-16

PCIe Gen3 x8PCIe Gen3 x8 PCIe Gen3 x4

QCT QuantaGrid D51BP-1U

4 NVMe drive slots1 NIC slot

QCT CONFIDENTIAL

QCT CONFIDENTIAL22

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

200000 250000 300000 350000 400000 450000 500000

Aver

age

Late

ncy

(ms)

IOPS

70/30 4k OLTP Performance, Before vs After NUMA balanceIODepth scaling 4-128K, 5 nodes, 10 clients x 10 RBD volumes, Red Hat Ceph 2.1

SW Tuned SW+NUMA CPU Pinning

40% better IOPS, 100% better latency at QD=8

with NUMA balance

At QD=8100% better average latency

15-20% better 90th pct latency10-15% better 99th pct latency

Performance Testing ResultsLatency improvements after NUMA optimizations

QCT CONFIDENTIAL

§ All-NVMe Ceph enables high performance workloads§ NUMA balanced architecture§ Small footprint (1U), lower overall TCO§ Million IOPS with very low latency

QCT CONFIDENTIAL

Visit www.QCT.io for QxStor Red Hat Ceph Storage Edition:

• Reference Architecture Red Hat Ceph Storage on QCT Servers• Datasheet QxStor Red Hat Ceph Storage• Solution Brief QCT and Intel Hadoop Over Ceph Architecture• Solution Brief Deploying Red Hat Ceph Storage on QCT servers• Solution Brief Containerized Ceph for On-Demand, Hyperscale Storage

For Other Information…

QCT CONFIDENTIAL

Appendix

# Please do not change this file directly since it is managed by Ansible and will be overwritten

[global]fsid = 7e191449-3592-4ec3-b42b-e2c4d01c0104max open files = 131072crushtool = /usr/bin/crushtooldebug_lockdep = 0/1debug_context = 0/1debug_crush = 1/1debug_buffer = 0/1debug_timer = 0/0debug_filer = 0/1debug_objecter = 0/1debug_rados = 0/5debug_rbd = 0/5debug_ms = 0/5debug_monc = 0/5debug_tp = 0/5debug_auth = 1/5debug_finisher = 1/5

debug_heartbeatmap = 1/5debug_perfcounter = 1/5debug_rgw = 1/5debug_asok = 1/5debug_throttle = 1/1debug_journaler = 0/0debug_objectcatcher = 0/0debug_client = 0/0debug_osd = 0/0debug_optracker = 0/0debug_objclass = 0/0debug_filestore = 0/0debug_journal = 0/0debug_mon = 0/0debug_paxos = 0/0osd_crush_chooseleaf_type = 0filestore_xattr_use_omap = trueosd_pool_default_size = 1osd_pool_default_min_size = 1

Configuration Detail – ceph.conf (1/2)

rbd_cache = truemon_compact_on_trim = falselog_to_syslog = falselog_file = /var/log/ceph/$name.logmutex_perf_counter = truethrottler_perf_counter = falsems_nocrc = true[client]admin socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok # must be writable by QEMU and allowed by SELinux or AppArmorlog file = /var/log/ceph/qemu-guest-$pid.log # must be writable by QEMU and allowed by SELinux or AppArmorrbd_cache = truerbd_cache_writethrough_until_flush = false

[mon][mon.qct50]host = qct50# we need to check if monitor_interface is defined in the inventory per host or if it's set in a group_vars filemon addr = 10.5.15.50mon_max_pool_pg_num = 166496mon_osd_max_split_count = 10000

[osd]osd mkfs type = xfsosd mkfs options xfs = -f -i size=2048osd_mount_options_xfs = "rw,noatime,inode64,logbsize=256k,delaylog"osd journal size = 10240cluster_network = 10.5.16.0/24public_network = 10.5.15.0/24filestore_queue_max_ops = 5000osd_client_message_size_cap = 0objecter_infilght_op_bytes = 1048576000ms_dispatch_throttle_bytes = 1048576000filestore_wbthrottle_enable = Truefilestore_fd_cache_shards = 64objecter_inflight_ops = 1024000filestore_max_sync_interval = 10filestore_op_threads = 16osd_pg_object_context_cache_count = 10240journal_queue_max_ops = 3000filestore_odsync_write = Truejournal_queue_max_bytes = 10485760000journal_max_write_entries = 1000filestore_queue_committing_max_ops = 5000journal_max_write_bytes = 1048576000filestore_fd_cache_size = 10240osd_client_message_cap = 0journal_dynamic_throttle = Trueosd_enable_op_tracker = False

Configuration Detail – ceph.conf (2/2)

cluster:head: "root@qct50"clients: ["root@qct50", "root@qct51", "root@qct52", "root@qct53", "root@qct54",

"root@qct55", "root@qct56", "root@qct57", "root@qct58", "root@qct59"]osds: ["root@qct62", "root@qct63", "root@qct64", "root@qct65", "root@qct66"]

mons: ["root@qct50"]osds_per_node: 16fs: xfsmkfs_opts: -f -i size=2048 -n size=64kmount_opts: -o inode64,noatime,logbsize=256kconf_file: /etc/ceph/ceph.confceph.conf: /etc/ceph/ceph.confiterations: 1rebuild_every_test: Falsetmp_dir: "/tmp/cbt"clusterid: 7e191449-3592-4ec3-b42b-e2c4d01c0104use_existing: Truepool_profiles:replicated:pg_size: 8192pgp_size: 8192replication: 2

benchmarks:librbdfio:rbdadd_mons: "root@qct50:6789"rbdadd_options: "noshare"time: 300ramp: 100vol_size: 8192mode: ['randread']numjobs: 1use_existing_volumes: Falseprocs_per_volume: [1]volumes_per_client: [10]op_size: [4096]concurrent_procs: [1]iodepth: [4, 8, 16, 32, 64, 128]osd_ra: [128]norandommap: Truecmd_path: '/root/cbt_packages/fio/fio'log_avg_msec: 250pool_profile: 'replicated'

Configuration Detail - CBT YAML File

QCT CONFIDENTIAL

www.QCT.io

Looking forinnovative cloud solution?

Come to QCT, who else?