SLES Performance Enhancements for Large NUMA Systems

© 2014 Hewlett-Packard Development Company, L.P.The information contained herein is subject to change without notice.

SLES Performance Enhancements for Large NUMA Systems

Scott Norton – HP

Davidlohr Bueso – SUSE Labs (formerly HP)

SUSECon Orlando, FL

November 2014

© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Agenda

1. Why focus on locking?

2. Proof of concept

3. Methodology

4. Performance enhancements for 3.0.101-0.15-default

5. Performance enhancements for 3.0.101-0.30-default

6. Performance enhancements for 3.0.101-0.30-bigsmp

7. 3.0.101-0.30-bigsmp vs. 3.0.101-0.8-default

8. Performance enhancements for SLES 12

9. Wrap-up

2


Why Focus on Locking?


Cache Line Contention in Large NUMA Systems

• Lock contention (particularly spinning lock contention) is the primary, and probably worst, cause of cache line contention

• Cache line contention does have a “cost” associated with NUMA systems, but it is not the same “cost” that you experience with local vs. remote memory latency in NUMA systems

• However, it’s not only about lock contention

Cache line contention can also come from sharing cache lines due to poor data structure layout – two fields in a data structure that are accessed by completely different processes/threads, but end up in the same cache line

Worst case: an unrelated and frequently accessed field occupies the same cache line as a heavily contended lock

Other atomic operations, such as atomic-add, can also generate cache line contention

Additionally, the processor’s cache prefetch mechanism may also cause false cache line contention

4


Questions Driving Investigation

• How is performance impacted when cache line contention is:

• contained within a socket?

• spread among multiple sockets?

• using <n> cores in a socket vs <n> cores from multiple sockets?

• using all cores in 4-socket, 8-socket, and 16-socket systems?

5


Demonstrating Cache Line Contention Effects

• Test program to show the cost of cache line contention in large NUMA systems:

• Bind threads (1 per core) to specified cores. Memory is allocated from a specific node.

• Once the threads are synchronized, perform a tight loop doing spin_lock/spin_unlock 1,000,000 times. This generates an extreme amount of cache line contention. The spinlock implementation was taken from a Linux 3.0 based kernel.

• Based on the number of threads and the loop iteration count we can calculate the average number of “operations per second per CPU” when <N> CPUs are involved in the cache line contention.

• This is not a real-world test. While this is a micro-benchmark, it does show the effects of cache line contention so that real code can be written with cache line contention in mind.

• Test systems:

• 4-sockets/ 60-cores/120-threads Intel Xeon E7-4890 v2 1-TB

• 8-sockets/120-cores/240-threads Intel Xeon E7-2890 v2 6-TB

• 16-sockets/240-cores/480-threads Intel Xeon E7-2890 v2 12-TB

6


Contention within a Socket: Increasing Core Count

Performance degrades smoothly as more cores are involved in cache line contention

Execution Nodes

Memory node

Sockets Used

Cores used

Seconds Ops per Sec per

Core

% decrease from 2-core

% decrease from

previous

Node 1 Node 1 1-socket

2-cores 1.704489 5,866,861 0.0% 0.0%

3-cores 2.783121 3,593,088 38.8% 38.8%

4-cores 4.012157 2,492,425 57.5% 30.6%

5-cores 5.506802 1,815,936 69.0% 27.1%

6-cores 7.110453 1,406,380 76.0% 22.6%

7-cores 7.834159 1,276,461 78.2% 9.2%

8-cores 10.054136 994,616 83.0% 22.1%

9-cores 11.185041 894,051 84.8% 10.1%

10-cores 13.508867 740,255 87.4% 17.2%

11-cores 14.839633 673,871 88.5% 9.0%

12-cores 16.490477 606,411 89.7% 10.0%

13-cores 19.138960 522,494 91.1% 13.8%

14-cores 20.704514 482,986 91.8% 7.6%

7


Contention across 2-Sockets

• All 30 cores in 2 nodes/sockets participate in the cache line contention:

Execution Nodes

Memory node

Sockets Used

Cores used Seconds

Ops per Sec per

Core

% decrease from

1-socket

Node 0 Node 1 1-socket 15-cores 2.107396 474,519 0.0%

Nodes 0-1

Node 1 2-socket 30-cores

14.450938 69,200 85.4%

Nodes 1-2 14.897306 67,126 86.0%

Nodes 2-3 21.742537 45,993 90.4%

• There are two interesting points here:

1) There is a huge drop in performance when going from 15-cores on1-socket to 30-cores on 2-sockets

2) There is a smaller drop in performance when the lock’s memory location is completely remote from the sockets involved in cache line contention (nodes 1-2 vs. nodes 2-3)

8


Contention across 2-Sockets: Increasing Core Count

• We can see that the huge drop in performance occurs once we add a single core from the second socket.

• This is due to the need to go through QPI to handle the cache-to-cache traffic to resolve the cache line contention.

• This is a significant drop in performance when going through QPI.

Execution Nodes

Memory node

Sockets Used

Cores used Seconds

Ops per Sec per

Core

% decrease from 2-core

% decrease from

previous

Node 0 Node 1 1-socket

13-cores 1.649242 606,339 92.7% 9.1%

14-cores 1.905878 524,693 93.7% 13.5%

15-cores 1.649242 482,435 94.2% 8.1%

Nodes 0-1 Node 1 2-sockets

16-cores 1.905878 129,309 98.4% 73.2%

17-cores 8.348480 119,782 98.6% 7.4%

18-cores 8.264046 121,006 98.5% -1.0%

30-cores 15.146260 66,023 99.2% 8.5%

• Add one core at a time, filling node/socket-0 first, then filling node/socket-1:

9


Contention across 2-Sockets: Round-Robin vs. Fill-First• Contention measured across 2, 8, and 14 cores

• Cores spread among 2-sockets (round-robin) vs. all cores in one socket (fill-first):

Execution Nodes

Memory Node

Socketsused

Cores per Socket used

Cores used


Core

Node 1 Node 1 1-socket FF 2-cores 2-cores 0.120395 8,305,993

Node 0-1

Node 1 2-sockets RR 1-core 2-cores

0.314462 3,180,034

Node 1-2 0.305783 3,270,293

Node 2-3 0.453627 2,204,454

Node 1 Node 1 1-socket FF 8-cores 8-cores 1.018527 981,810

Node 0-1

Node 1 2-sockets RR 4-cores 8-cores

3.351590 298,366

Node 1-2 3.390266 294,962

Node 2-3 5.354243 186,768


Node 0-1

Node 1 2-sockets RR 7-cores 14-cores

6.214167 160,923

Node 1-2 6.275140 159,359

Node 2-3 9.471300 105,582

1) Numa effect is visible when memory is remote

2) Best performance when all cores are in one socket

10


Contention across 4-Sockets: Round-Robin vs. Fill-First

• Contention measured across 4, 8, and 12 cores

• Cores spread (round-robin) among 4-sockets vs. all cores in one socket (fill-first)

Execution Nodes

Memory Node

Socketsused


Cores used


Core


Node 0-3 Node 1 4-sockets RR 1-core 4-cores 1.491732 670,362


Node 0-3 Node 1 4-sockets RR 2-cores 8-cores 5.421381 184,455



Node 0-3 Node 1 4-sockets FF 15-cores 60-cores 49.786041 20,086

• Cache line contention is clearly better when all the contention is contained within a single socket.

• For the same core count, performance degrades as more sockets are involved in the contention

11



• Contention measured across 8, 16 and 24 cores

• Cores spread (round-robin) among 8-sockets vs. all cores in two sockets (fill-first):

Execution Nodes

Memory Node

Socketsused


Cores used


Core








• Cache line contention is clearly better when all the contention is contained within as few sockets as possible.

12



• Contention measured across 16, 32 and 64 cores

• Cores spread (round-robin) among 16-sockets vs. all cores in 1/2/4 sockets (fill-first):

Execution Nodes

Memory Node

Sockets used Cores per Socket used

Cores used


Core








• Cache line contention is clearly better when all the contention is contained within as few sockets as possible.

13


• On a 4-socket/60-core system you have a 25% chance that any two random cores participating in the same cache line contention are on the same socket

• On an 8-socket/120-core system this is reduced to a 12.5% chance

• With a 16-socket/240-core system you have only a 6.25% chance

Execution Nodes

Memory Node

Sockets used Cores per Socket used

Cores used

Seconds Ops per Sec per Core



Node 1 Node 1 1-sockets FF 8-cores 8-cores 1.185326 843,650




Inter- vs Intra-Cache Line Contention Probability

14


Why do we care so much?

• Many applications scale based on the number of CPUs available. For example, one or two worker threads per CPU.

• However, many applications today have been tuned for 4-socket/40-core and 8-socket/80-core Westmere platforms.

• Going from 40- or 80-cores to 240-cores (16-sockets) is a major jump.

• Scaling based only on the number of CPUs is likely to introduce significant lock and cache line contention inside the Linux kernel.

• As seen in the previous slides, the impact of cache line contention gets significantly worse as more sockets and cores are added into the system – this is a major concern when dealing with 8- and 16-socket platforms.

• This has led us to pursue minimizing cache line contention within Linux kernel locking primitives.

15


Proof of Concept


The AIM7 fserver workload* scales poorly on 8s/80core NUMA platform with a 2.6 based kernel

* The workload was run with ramfs.

Background

17


From the perf -g output, we find most of the CPU cycles are spent in file_move() and file_kill().

40 Users (4000 jobs)

+ 9.40% reaim reaim [.] add_int

+ 6.07% reaim libc-2.12.so [.] strncat…..- 1.68% reaim [kernel.kallsyms] [k] _spin_lock

- _spin_lock

+ 50.36% lookup_mnt

+ 7.45% __d_lookup

+ 6.71% file_move

+ 5.16% file_kill

+ 2.46% handle_pte_fault

Proportion of file_move() = 1.68% * 6.71% = 0.11%

Proportion of file_kill() = 1.68% * 5.16% = 0.09 %

Proportion of file_move() + file+kill() = 0.20%

400 users (40,000 jobs)

- 79.53% reaim [kernel.kallsyms] [k] _spin_lock

- _spin_lock

+ 34.28% file_move

+ 34.20% file_kill

+ 19.94% lookup_mnt

+ 8.13% reaim [kernel.kallsyms] [k] mutex_spin_on_owner

+ 0.86% reaim [kernel.kallsyms] [k] _spin_lock_irqsave

+ 0.63% reaim reaim [.] add_long

Proportion of file_move() = 79.53% * 34.28% = 27.26%

Proportion of file_kill() = 79.53% * 34.20% = 27.20%

Proportion of file_move() + file+kill() = 54.46%

Analysis (1-2)

This is significant spinlock contention!

18


We use the ORC tool to monitor the coherency controller results

(ORC is a platform dependent tool from HP that reads performance counters in the XNC node controllers)

Coherency Controller Transactions Sent to Fabric Link (PRETRY number)

Socket Agent 10users 40users 400users

0 0 17,341 36,782 399,670,585

0 8 36,905 45,116 294,481,463

1 0 0 0 49,639

1 8 0 0 25,720

2 0 0 0 1,889

2 8 0 0 1,914

3 0 0 0 3,020

3 8 0 0 3,025

4 1 45 122 1,237,589

4 9 0 110 1,224,815

5 1 0 0 26,922

5 9 0 0 26,914

6 1 0 0 2,753

6 9 0 0 2,854

7 1 0 0 6,971

7 9 0 0 6,897

PRETRY indicates the associated read needs to be re-issued.

We can see that when users increase, PRETRY on socket 0 increases rapidly.

There is serious cache line contention on socket 0 with 400 users. Many jobs are waiting for the memory location on socket 0 which contains the spinlock.

PRETRY number on socket 0:400 users = 400M + 294M = 694M

Analysis (2-2)

19


• Code snippet from the 2.6 based kernel for file_move() and file_kill():

extern spinlock_t files_lock;

#define file_list_lock() spin_lock(&files_lock);

#define file_list_unlock() spin_unlock(&files_lock);

void file_move(struct file *file,

struct list_head *list)

{

if (!list) return;

file_list_lock();

list_move(&file->f_u.fu_list, list);

file_list_unlock();

}

void file_kill(struct file *file)

{

if (!list_empty(&file->f_u.fu_list)) {

file_list_lock();

list_del_init(&file->f_u.fu_list);

file_list_unlock();

}

}

Removing Cache Line Contention

20

• Contention on this global spinlock is the cause of all the cache line contention

• We developed a prototype MCS/Queued spinlock to see its effect on cache line traffic

• MCS/Queued locks are NUMA aware and each locker spins on local memory rather than the lock word

• Implementation is available in the back-up slides

• No efforts were made to make this a finer grained lock


Comparing the performance of the new kernel (blue line) vs. the original kernel (red line)

Prototype Benchmark Results

2.4x improvement in throughput with the MCS/Queued spinlock prototype! 21


• The proportion of time for the functions file_move() and file_kill() is now small in the 400 users case when using an MCS/Queued spinlock (dropped from 54.46% to 2.38%)

• The functions lookup_mnt() and __mutex_lock_slowpath() now take most of the time.

400 users(40000 jobs)

44.71% reaim [kernel.kallsyms] [k] _spin_lock

-60.94%-- lookup_mnt

….

22.13% reaim [kernel.kallsyms] [k] mutex_spin_on_owner

-96.16%-- __mutex_lock_slowpath

……

1.19% reaim [kernel.kallsyms] [k] file_kill

1.19% reaim [kernel.kallsyms] [k] file_move

Proportion of lookup_mnt() = 27.2%

Proportion of __mutex_lock_slowpath() = 21.3%

Proportion of file_move() + file+kill() 2.38%

Prototype Analysis (1-2)

perf –g output of the kernel with MCS/Queued spinlock prototype:

22


Coherency controller results of the kernel with the MCS/Queued spinlock

Coherency Controller Transactions Sent to Fabric Link (PRETRY number)

Socket Agent 10users 40users 400users

0 0 18,216 24,560 83,720,570

0 8 37,307 42,307 43,151,386

1 0 0 0 0

1 8 0 0 0

2 0 0 0 0

2 8 0 0 0

3 0 0 0 0

3 8 0 0 0

4 1 52 222 16,786

4 9 28 219 10,068

5 1 0 0 0

5 9 0 0 0

6 1 0 0 0

6 9 0 0 0

7 1 0 0 0

7 9 0 0 0

We can see that as users increase, PRETRY in socket 0 also increases – but it is significantly lower than the kernel without the MCS/Queued lock.

The PRETRY number for socket 0:400 users = 84M + 43M = 127M.

This value is about 1/5 of the original kernel (694M).

This shows the MCS/Queued spinlock algorithm reduces the PRETRY traffic that occurs in file_move() and file_kill() significantly even though we still have the same contention on the spinlock.

Prototype Analysis (2-2)

23


• The MCS/Queued spinlock improved the throughput of large systems just by minimizing the inter-socket cache line traffic generated by the locking algorithm.

• The MCS/Queued spinlock did not reduce the amount of contention on the actual lock. We have the same number of spinners contending for the lock. No code changes were made to reduce lock contention.

• However, the benchmark throughput improved from ~160,000 to ~390,000 jobs per minute due to the reduced inter-socket cache-to-cache traffic.

• System time spent spinning on the lock dropped from 54% to 2%.

• Lock algorithms can play a huge factor in the performance of large-scale systems

• The impact of heavy lock contention on a 240-core system is much more severe than the impact of heavy lock contention on a 40-core system

• This is not a substitute for reducing lock contention… Reducing lock contention is still the best solution, but attention to lock algorithms that deal with contention *is* extremely important and can yield significant improvements.

Proof of Concept - Conclusions

24


Methodology


Benchmarks and Tools Used

• SPECjbb2005 / SPECjbb2013

• AIM7

• Swingbench OLTP

• Swingbench DSS

• Ebizzy

• Perf-bench

• Kernbench

• Hackbench

• Futextest

• IOzone

• fio

• SLOB (Silly Little Oracle Benchmark)

• Customer benchmarks and apps

26

• perf

• ORC/ORC2 (HP internal tool)

• LiKI (HP internal tool)

• lockstat

• numatop

• slabtop

• ftrace, strace

• top, sar, vmstat, powertop

SPEC and the benchmark name SPECjbb are registered trademarks of the Standard Performance Evaluation Corporation (SPEC).


Development Process

• Identify areas to improve performance and reduce cache line contention

• Determine if an upstream patch already fixes the problem.

• If no upstream patches exist:

• Develop fix

• Validate performance improvement

• Submit the patches upstream and work through the upstream process to get them accepted

• Depending on upstream feedback this may be an iterative process until the changes are accepted

• Back-port patches to the SLES kernels

• Validate performance improvement

• Submit back-ported patches along with performance justification to SUSE for inclusion in SLES

• SUSE delivers a PTF kernel to HP containing the performance enhancements

• HP performs a final performance validation of the PTF kernels, sometimes with a customer

• Changes put into the next SLES Maintenance Update for all SUSE customers to use.

27


SUSE / HP Collaboration

• Bugzillas were filed for each set of patches

• Technical discussions with SUSE Performance engineers

• Discussions on the design of the patch, the back-port of the patch, the performance validation data, kABI impacts, etc

• Engaged SUSE performance engineers in the analysis of the problem as needed

• Sometimes restructured the back-ported patch to work better with SLES 11sp3 after consultation with SUSE

• For some patches submitted upstream had initial review and feedback from SUSE performance engineers prior to the upstream submittal

• The recognition of customer benefit and the flexibility of both companies really made all this possible.

28


Performance Enhancements

SLES 11sp3 Kernel Version:

3.0.101-0.15-default


Idle Balance

• One of the problems with significant lock contention on blocking locks (such as a mutex) is that as more processes block on the mutex there is less to run – this causes the idle balancer to take processes from a different CPU’s run queue. This in turn causes even further cache issues.

• Ensure that we don’t attempt an idle balance operation when it takes longer to do the balancing than the time the cpu would be idle

• We do this by keeping track of the maximum time spent in idle balance for each scheduler domain and skipping idle balance if max-time-to-balance > avg_idle for this CPU

• Max-time-to-balance is decayed at a rate of about 1% per second

• Improve the accuracy of the average CPU idle duration.

• Previously the average CPU idle duration was over estimated resulting in too much idle balancing

30

93,609

18,600

Java Operations with 16-sockets / 240-cores / 480-threads

before changes

after changes


Mutex Atomics Reduction

• Reduce the number of atomic operations when trying to acquire a mutex

− Entry into the mutex lock slow path will cause 2 atomic _xchg instructions to be issued

− With the atomic decrement in the fast path a total of 3 atomic read-modify-write instructions are issued in rapid succession

− Causes lots of cache bouncing when many tasks are trying to acquire the mutex simultaneously

− By using atomic_read() to check the value before calling atomic_xchg() or atomic_cmpxchg()we avoid unnecessary cache coherency traffic

• Also known as compare-compare-and-swap.

− For example:

if ((atomic_read(&lock->count) == 1) && (atomic_cmpxchg(&lock->count, 1, 0) == 1))

31

137,370

93,609


before changes

after changes


XFS Small Block Size Performance

• A customer acceptance benchmark demonstrated really poor performance with XFS for 4k and 16k block sizes (sometimes 64k) for initial-writes as well as over-writes for multithreaded applications.

• Further investigation identified a set of patches already developed for the upstream Linux kernel revision 3.4

• The primary patch introduces per filesystem I/O completion workqueues (as opposed to global workqueues)

• Allows concurrency on the workqueues - blocking on one inode does not block others on a different inode.

• These patches were back-ported to SLES 11sp3 (and by default now part of 11sp4 and 12)

• Improved synchronous 16k initial-write performance from 1.2 MB/s to 138 MB/s

• Improved asynchronous 16k initial-write performance from 14 MB/s to 141 MB/s

• Also improves 16k over-write performance as well as 4k initial-write and over-write performance.

32


Futex Hash Size and Wait Queues

• Hash Bucket Size

− Originally 256 hash buckets for the whole system

• 256 * nCPUs cache line aligned hash buckets

• Fewer collisions and more spinlocks leading to more parallel futex call handling.

− Distribute the table among NUMA nodes instead of a single one.

− The perfect hash size will of course have one to one hash-bucket:futex ratio.

• Lockless Wait Queue Size

− A common misuse of futexes is to make FUTEX_WAKE calls when there are no waiters.

− In FUTEX_WAKE, there’s no reason to take the hb->lock if we already know the list is empty and thus one to wake up.

− Use an independent atomic counter to keep track of the list size.

− This can drastically reduce contention on the hash bucket spinlock.

33

0.10%

43.71%

% System time spinning on hb->lock for a large database

before changes

after changes


Futex: Larger NUMA-aware Hash Table

0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

1,400,000

512 256 128 80 64 32

Op

era

tio

ns

pe

r se

con

d

threads

Futex Hash Table Scaling

baseline

aligned buckets

large numa-awaretable

aligned buckets pluslarge numa-awaretable

34


3.0.101-0.15-default: Perf Improvements

• Performance measurements were done with a popular Java based workload (a higher number of operations is better)

35

6,502

67,447

0

10,000

20,000

30,000

40,000

50,000

60,000

70,000

80,000

Java Operations

16-sockets / 240-cores / 480-threads / 12-TB

3.0.101-0.8-default

3.0.101-0.15-default

Can’t g

et va

lid r

esu

lt at 1

6-sock

ets




3.0.101-0.30-default


Mutex Optimizations

• Unlock a mutex without acquiring the wait lock

− Workloads with high amounts of mutex contention would spend significant time spinning on the mutex’s internal waiter lock which then delays the mutex from getting unlocked.

− Changed the mutex unlock path to unlock the mutex before acquiring the internal waiter lock to deal with any waiters.

− Delays in acquiring the waiter lock will not prevent others from acquiring the mutex.

• Mutex slowpath optimizations

− When a lock can’t be acquired and a thread enters the mutex slowpath it put’s itself on the wait list and tries one last time to acquire the mutex.

− Changed the order and attempted the acquisition of the mutex first

− If acquired the we do not have to remove the thread from the waiter list

37

156,912

109,933


before changes

after changes


Lockref

• Reference counts are normally used to track the lifecycle of data structures.− A reference count of zero means the structure is unused and is free to be released

− A positive reference count indicates how many tasks are actively referencing the structure

− When embedded into a data structure, it is not uncommon to acquire a lock just to increment or decrement the reference count variable. Under load, this lock can become heavily contended.

• The lockref patch introduces a new mechanism for a lockless atomic update of a spinlock protected reference count. − Bundle a 4-byte spinlock and a 4-byte reference count into a single 8-byte word that can be

updated atomically while no one is holding the lock.

• The VFS layer makes heavy use of reference counts for dentry operations.− Workloads that generate lots of filesystem activity can be bottlenecked by the spinlock

contention on the dentry reference count update.

− The dentry operations were modified to make use of the lockref patch to resolve this contention by doing reference count updates without taking a lock.

38

0.01%

83.74%

% time spinning on dentry lock: AIM-7 short workload

before changes

after changes


Lockref and Dentry Benchmark Results

~ 7x improvement in throughput (JPM)!

39

0

1,000,000

2,000,000

3,000,000

4,000,000

5,000,000

6,000,000

7,000,000

8,000,000

10

20

30

40

50

60

70

80

90

100

200

300

400

500

600

700

800

900

100

0

110

0

120

0

130

0

140

0

150

0

160

0

170

0

180

0

190

0

200

0

Job

s P

er

Min

ute

Simulated Users

AIM7 short workload: 8-socket / 80-core / 1-TB

Pre-lockref

changes

Post-Lockref

Changes


SELinux Bit-Map Management

• Heavy use of the ls command results in a significant amount of CPU time being spent in the mls_level_isvalid() kernel function.

• Replaced the inefficient implementation of the mls_level_isvalid() function in the multi-level security (MLS) policy module of SELinux with a performance optimized version.

− More efficient bit-map management

• The CPU time spent in this code path is reduced from 8.95% to 0.12% in the AIM-7 high_systime workload

40

0.12%

8.95%

Changes in system time for the mls_level_isvalid() code path

before changes

after changes


Hugetlb: Parallel Page Faults

• The kernel originally serialized hugetlb page faults, handling a single fault at a time.

− Workloads with large working sets backed-by hugepages (i.e.: databases or KVM guests) can especially suffer from painful startup times due to this.

− Protection from spurious OOM errors under conditions of low availability of free hugepages.

− This problem is specific to hugepages because it is normal to want to use every single hugepage in the system - with normal pages we assume there will always be a few spare pages which can be used temporarily until the race is resolved.

• Address this problem by using a table of mutexes, allowing a better chance of parallelization, where each hugepage is individually serialized.

− The hash key is selected depending on the mapping type.

− Because the size of the table is static, this can, in theory, still produce contention, if reserving enough hugepages. But reality indicates that this is purely theoretical.

41

25.7

37.5

Startup time (seconds) of a 10-Gb Oracle DB (Data Mining)

before changes

after changes


epoll_ctl() Scaling

• Java workloads on 8- and 16-socket systems showed significant lock contention on the global epmutex in the epoll_ctl() system call when adding or removing file descriptors to/from an epoll instance.

• Further investigation identified a set of patches already developed for the upstream Linux kernel:

• Don’t take the global epmutex lock in EPOLL_CTL_ADD for simple topologies (it’s not needed)

• Remove the global epmutex lock from the EPOLL_CTL_DEL path and instead use RCU to protect the list of event poll waiters against concurrent traversals

• RCU (Read-Copy Update) is a Linux synchronization mechanism allowing lockless reads to occur concurrently with updates

42

165,119

87,903


before changes

after changes


3.0.101-0.30-default: Perf Improvements


43

6,50267,447

946,552

0

100,000

200,000

300,000

400,000

500,000

600,000

700,000

800,000

900,000

1,000,000

Java Operations


3.0.101-0.8-default

3.0.101-0.15-default

3.0.101-0.30-default

Can’t g

et va

lid r

esu

lt at 1

6-sock

ets




3.0.101-0.30-bigsmp


Why the “bigsmp” kernel?

• Some of the performance enhancements HP provided for SLES 11sp3 caused a breakage in the Kernel ABI (kABI)

• The User Application ABI remains the same – all applications that run on the “default” SLES 11sp3 kernel have full binary and source compatibility with the “bigsmp” SLES 11sp3 kernel.

• There was a small possibility that this kABI breakage would impact kernel drivers and modules

• Rather than risk compatibility issues at customer sites SUSE created the “bigsmp” flavor of the SLES 11sp3 kernel which contains these additional performance enhancements.

• The bigsmp flavor of SLES 11sp3 has it’s own kABI

• Requires a recompile of kernel drivers and modules

• SUSE experience and process flexibility allowed for the creation of the bigsmp kernel so that these additional performance enhancements could be delivered to customers.

• All of these changes will be included in the SLES 11sp4 GA and SLES 12 GA “default” kernels. Bigsmp will be an on-going flavor for SLES 11sp3 for all platforms.

45


Mutex MCS Optimizations

• The MCS lock is a new locking primitive inside Linux

• Each locker spins on a local variable while waiting for the lock rather than spinning on the lock itself.• Maintains a list of spinning waiters.

• When a lock is released the unlocker changes the local variable of the next spinner.

• This change causes the spinner to stop spinning and acquire the lock.

• Eliminates most of the cache-line bouncing experienced by simpler locks, especially in the contended case when a simple CAS (Compare-and-Swap) calls fail.

• Fair, passing the lock to each locker in the order that the locker arrived.

• Specialized cancelable MCS locking was applied internally to kernel mutexes

• The cancelable MCS lock is a specially tailored lock for MCS: when needing to reschedule, we need to abort the spinning in order to block.

46

250,981

137,268


before changes

after changes


All Mutex Lock Performance Improvements

• Performance measurements were done with a popular Java based workload (higher number of operations is better)

• System used: 16-sockets, 240-cores, 480-threads

47

72,089

137,268

250,891

0

100,000

200,000

300,000

NU

MBER O

F O

PERA

TIO

NS

Baseline prior to any mutex changes

3 non-MCS mutex changes

90% over baseline

All mutex changes

248% over baseline

83% over non-MCS mutex


Per-thread VMA caching

• A process’s address space is divided among VMAs (virtual memory areas) – each storing a range of addresses that share similar properties, such as permissions.

− A common operation when dealing with memory is locating (find_vma()) a VMA that contains a range of addresses.

• Traditionally the Linux kernel will cache the last used VMA.

− Avoids expensive tree lookups (scales poorly in multi-thread programs).

− This works nicely for workloads with good locality (over 70% hit rates), yet very badly for those with poor locality (less than 1% hit rates).

• Replace the cache by a small, per-thread, hash table.

− O(1) lookups/updates, cheap to maintain and small overhead.

− Improves poor locality hit-rates to ~99.9%.

− Improves Oracle 11g Data Mining (4k pages) hit-rates from 70% to 91%.

48


3.0.101-0.30-bigsmp: Perf Improvements

• Performance measurements were done with a popular Java based workload (higher number of operations is better)

49

6,50267,447

946,552994,771

0

100,000

200,000

300,000

400,000

500,000

600,000

700,000

800,000

900,000

1,000,000

Java Operations


3.0.101-0.8-default

3.0.101-0.15-default

3.0.101-0.30-default

3.0.101-0.30-bigsmp

Can’t g

et va

lid r

esu

lt at 1

6-sock

ets


3.0.101-0.30-bigsmp

vs.

3.0.101-0.8-default


3.0.101-0.8-default Scaling


51

69,426

114,617

180,922

139,825

6,502

0

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000

180,000

200,000

Java Operations - 1/2/4/8/16 sockets

1-socket / 15-cores

2-sockets / 30-cores




• Scaling issues at 8- and 16-sockets. Other Linux distributions are similar.

Can’t g

et va

lid resu

lt at 16

-sock

ets


3.0.101-0.30-bigsmp Scaling


52

71,517132,996

260,423

512,953

994,771

0

100,000

200,000

300,000

400,000

500,000

600,000

700,000

800,000

900,000

1,000,000

Java Operations - 1/2/4/8/16 sockets

1-socket / 15-cores





Scaling is now approximately 1.9x for each doubling of sockets!


3.0.101-0.30-bigsmp: AIM-7 alltests

53

~ 4x performance

improvement!

0

100,000

200,000

300,000

400,000

500,000

600,000

700,000

800,000

900,000

10

20

30

40

50

60

70

80

90

10

0

20

0

30

0

40

0

50

0

60

0

70

0

80

0

90

0

10

00

11

00

12

00

13

00

14

00

15

00

16

00

17

00

18

00

19

00

20

00

Job

s P

er

Min

ute

Simulated Users

3.0.101-0.30-bigsmp

3.0.101-0.30-default

3.0.101-0.15-default

3.0.101-0.8-default

AIM-7 alltests workload: 16-sockets / 240-cores / 480-threads / 12-TB


3.0.101-0.30-bigsmp: AIM-7 custom

54

~ 5x performance

improvement!

0

100,000

200,000

300,000

400,000

500,000

600,000

10

20

30

40

50

60

70

80

90

10

0

20

0

30

0

40

0

50

0

60

0

70

0

80

0

90

0

10

00

11

00

12

00

13

00

14

00

15

00

16

00

17

00

18

00

19

00

20

00

Job

s P

er

Min

ute

Simulated Users

3.0.101-0.30-bigsmp

3.0.101-0.30-default

3.0.101-0.15-default

3.0.101-0.8-default

AIM-7 custom workload: 16-sockets / 240-cores / 480-threads / 12-TB


3.0.101-0.30-bigsmp: AIM-7 high_systime

55

~ 4x performance

improvement!

0

50,000

100,000

150,000

200,000

250,000

10

20

30

40

50

60

70

80

90

10

0

20

0

30

0

40

0

50

0

60

0

70

0

80

0

90

0

10

00

11

00

12

00

13

00

14

00

15

00

16

00

17

00

18

00

19

00

20

00

Job

s P

er

Min

ute

Simulated Users

3.0.101-0.30-bigsmp

3.0.101-0.30-default

3.0.101-0.15-default

3.0.101-0.8-default

AIM-7 high_systime workload: 16-sockets / 240-cores / 480-threads / 12-TB


3.0.101-0.30-bigsmp: AIM-7 shared

56

~ 3.75x performance

improvement!

0

100,000

200,000

300,000

400,000

500,000

600,000

700,000

800,000

10

20

30

40

50

60

70

80

90

10

0

20

0

30

0

40

0

50

0

60

0

70

0

80

0

90

0

10

00

11

00

12

00

13

00

14

00

15

00

16

00

17

00

18

00

19

00

20

00

Job

s P

er

Min

ute

Simulated Users

3.0.101-0.30-bigsmp

3.0.101-0.30-default

3.0.101-0.15-default

3.0.101-0.8-default

AIM-7 shared workload: 16-sockets / 240-cores / 480-threads / 12-TB



SLES 12


IPC Lock with System V Semaphores

• The global IPC spinlock is held too long in various System V Semaphore paths

• When one semaphore array is doing lots of operations scaling is poor

• Single semaphore arrays now use a finer grained spinlock as an alternative to the coarse grained IPC lock.

• Introduced rcu_read_lock only methods to obtain IPC objects

• Similar changes were made to System V Shared Memory and Message Queues

58

21.86%

84.35%

5.23% 7.39%

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

400 Users 800 Users

% time spinning on IPC lock: Oracle 11gR2 Data Mining

before changes

after changes


SysV Semaphore Performance Improvements

• Performance measurements were done with an Oracle 11gR2 Data Mining workload

• System used: 8-sockets, 80-cores, 160-threads

59

1,2571,438

1,237

2,8052,665 2,751

0

500

1,000

1,500

2,000

2,500

3,000

100 Users 400 Users 800 Users

before changes

after changes


World Record Performance


World record performance: HP CS900 for SAP HANA

Key take aways

• #1 and #2 overall max-jOPS results

• #1 and #4 overall critical-jOPS results

• #1 16-socket (16s) results on both max-jOPS and critical-jOPS

• #1 8-socket (8s) results on max-jOPS

• 16s max-jOPS results 2.1X greater than Fujitsu 16s results

• 8s max-jOPS results are 2.2X greater than Sugon 8s results

• 8s max-jOPS results 1.1X greater than Fujitsu16s results

• HP CS900 demonstrates excellent scaling from 8s to 16s

23,058

168,127

198,418

126,617

247,581

214,961

308,936

425,348

474,575

888,164

0 200,000 400,000 600,000 800,000 1,000,000

Sugon I980G108-socket, Intel Xeon E7-8890 v2

Fujitsu SPARC M10-4S16-socket, SPARC64 X

Fujitsu SPARC M10-4S16-socket, SPARC64 X+

HP ConvergedSystem 900 for SAPHANA 8s/6TB 8-socketIntel Xeon E7-2890 v2

HP ConvergedSystem 900 for SAPHANA 16s/12TB 16-socket

Intel Xeon E7-2890 v2

HP ConvergedSystem 900 for SAP HANApowered by SLES 11sp3

owns top two SPECjbb2013 max-jOPS records

max-jOPS

SAP and SAP HANA are trademarks or registered trademarks of SAP AG in Germany and several other countries. SPEC and the benchmark name SPECjbb are registered trademarks of the Standard Performance Evaluation Corporation (SPEC). The stated results are published on spec.org as of 07/30/2014.

http://www.spec.org/


HP and SUSE Break 1-Million max-jOPS!!!

129,876

244,162

524,790

1,002,350

0 200,000 400,000 600,000 800,000 1,000,000 1,200,000

HP ConvergedSystem 900 for SAPHANA (8s/6TB) 8-socket


HP ConvergedSystem 900 for SAPHANA (16s/12TB) 16-socket


HP ConvergedSystem 900 for SAP HANA

powered by SLES 11sp3

#1 16-socket SPECjbb2013 max-jOPS record

max-jOPS

SAP and SAP HANA are trademarks or registered trademarks of SAP AG in Germany and several other countries. SPEC and the benchmark name SPECjbb are registered trademarks of the Standard Performance Evaluation Corporation (SPEC). The stated results are published on spec.org as of 11/14/2014.

http://www.spec.org/


World record performance: HP CS900 for SAP HANA

• The HP BL920s Gen8 Server Blade powers the HP ConvergedSystem 900 for SAP HANA system.

• Publicly available SPECjbb2013-MultiJVM benchmark performance briefs:

TBD – provide link for new brief Nov 2014

http://h20195.www2.hp.com/V2/GetDocument.aspx?docname=4AA5-3288ENW&cc=us&lc=en June 2014

• Official benchmark results for HP ConvergedSystem900 for SAP HANA on spec.org:

TBD – provide link for new result (16s/240c/12TB) Nov 2014

TBD – provide link for new result (8s/120c/6TB) Nov 2014

http://spec.org/jbb2013/results/res2014q2/jbb2013-20140610-00081.html (16s/240c/12TB) June 2014

http://spec.org/jbb2013/results/res2014q2/jbb2013-20140610-00080.html (8s/120c/6TB) June 2014

SAP and SAP HANA are trademarks or registered trademarks of SAP AG in Germany and several other countries. SPEC and the benchmark name SPECjbb are registered trademarks of the Standard Performance Evaluation Corporation (SPEC).

http://h20195.www2.hp.com/V2/GetDocument.aspx?docname=4AA5-3288ENW&cc=us&lc=en

http://spec.org/jbb2013/results/res2014q2/jbb2013-20140610-00081.html

http://spec.org/jbb2013/results/res2014q2/jbb2013-20140610-00080.html


Wrap-up


Acknowledgements

• HP Linux Kernel Performance team contributing this work:

− Davidlohr Bueso

− Tuan Bui

− Waiman Long

− Jason Low

− Scott Norton

− Thavatchai Makphaibulchoke

− Tom Vaden

− Aswin Chandramouleeswaran

65

• SUSE Linux R&D team contributing to this work:

− Scott Bahling

− Matthias Eckermann

− Mike Galbraith

− Mel Gorman

− Jan Kara

− Jeff Mahoney

− Ahmad Sadeghpour

− Miklos Szeredi


Thank you


Back-up Slides


We developed a prototype MCS/Queued lock to see the effect on cache line traffic

(MCS/Queued locks are NUMA aware and each locker spins on local memory rather than the lock word)

MCS/Queued Lock Prototype (1-2)

typedef struct _local_qnode {

volatile bool waiting;

volatile struct _local_qnode *volatile next;

} local_qnode;

static inline void

mcsfile_lock_acquire(mcsglobal_qlock *global,

local_qnode_ptr me)

{

local_qnode_ptr pred;

me->next = NULL;

pred = xchg(global, me);

if (pred == NULL)

return;

me->waiting = true;

pred->next = me;

while (me->waiting); /*spin on local mem*/

}

static inline void

mcsfile_lock_release(mcsglobal_qlock *global,

local_qnode_ptr me)

{

local_qnode_ptr succ;

if (!(succ = me->next)) {

if ( cmpxchg(global, me, NULL) == me ) return;

do {

succ = me->next;

} while (!succ); /* wait for succ ready */

}

succ->waiting = false;

}

68


Replacing the files_lock spinlock with the prototype mcsfiles_lock MCS/Queued spinlock

extern mcsglobal_qlock mcsfiles_lock;

#define file_list_lock(x) mcsfile_lock_acquire(&mcsfiles_lock, &x);

#define file_list_unlock(x) mcsfile_lock_release(&mcsfiles_lock, &x);

void file_kill(struct file *file)

{

volatile local_qnode lq;

if (!list_empty(&file->f_u.fu_list)) {

file_list_lock(lq);

list_del_init(&file->f_u.fu_list);

file_list_unlock(lq);

}

}

void file_move(struct file *file,

struct list_head *list)

{

volatile local_qnode lq;

if (!list)

return;

file_list_lock(lq);

list_move(&file->f_u.fu_list, list);

file_list_unlock(lq);

}

MCS/Queued Lock Prototype (2-2)

69


AIM7 Benchmark Suite

• Traditional UNIX system-level benchmark (written in C).

• Multiple forks, each of which concurrently executes a common, randomly-orderedset of subtests called jobs.

• Each of the over fifty kind of jobs exercises a particular facet of system functionality• Disk IO operations, process creation, virtual memory operations, pipe I/O, and compute-bound arithmetic loops.

• AIM7 includes disk subtests for sequential reads, sequential writes, random reads, random writes, and randommixed reads and writes.

• An AIM7 run consists of a series of subruns with the number of tasks, N, beingincreased after the end of each subrun.

• Each subrun continues until each task completes the common set of jobs. Theperformance metric, "Jobs completed per minute", is reported for each subrun.

• The result of the entire AIM7 run is a table showing the performance metric versusthe number of tasks, N.

• Reference: “Filesystem Performance and Scalability in Linux 2.4.17”, 2002.70


perf-bench futex (1-2)

• To measure some of the changes done by the futex hastable patchset, a futex set of microbenchmarks are added to perf-bench:− perf bench futex [<operation> <all>]

• Measures latency of different operations:− Futex hash

− Futex wake

− Futex requeue/wait


perf-bench futex (2-2)$ perf bench futex wake

# Running 'futex/wake' benchmark:

Run summary [PID 4028]: blocking on 4 threads (at futex 0x7e20f4), waking up 1 at a time.

[Run 1]: Wokeup 4 of 4 threads in 0.0280 ms



…



Wokeup 4 of 4 threads in 0.0703 ms (+-14.22%)

$ perf bench futex hash

# Running 'futex/hash' benchmark:

Run summary [PID 4069]: 4 threads, each operating on 1024 futexes for 10 secs.

[thread 0] futexes: 0x1982700 ... 0x19836fc [ 3507916 ops/sec ]

[thread 1] futexes: 0x1983920 ... 0x198491c [ 3651174 ops/sec ]

[thread 2] futexes: 0x1984ab0 ... 0x1985aac [ 3557171 ops/sec ]

[thread 3] futexes: 0x1985c40 ... 0x1986c3c [ 3597926 ops/sec ]

Averaged 3578546 operations/sec (+- 0.85%), total secs = 10

Software

SLES Performance Enhancements for Large NUMA Systems