CPU QoS 1.1

CPU QoS(Quality of Service)

Bob Sneed - Sr. Staff Engineer Sun Microsystems, Inc.Systems Quality Office

Southern Area CMG MeetingsSeptember 24, 2009 @ Richmond, VA

September 25, 2009 @ Raleigh, NCRev 1.1; October 27, 2009

Copyright © 2009 by Sun Microsystems, Inc. All Rights Reserved.

Copyright © 2007 by Sun Microsystems, Inc. All rights reserved.2

Abstract

This is a discussion of the qualitative aspects of a CPU-second. These low-level metrics are critical to understanding efficiency and capacity, yet there's no consideration on

them in mainstream CP practices!

Examples here are from Sun Solaris, but the concerns here have a much broader scope.


Disclaimers Opinions and views expressed herein are those of the

author, Bob Sneed, and do not represent any official opinion of Sun Microsystems, Inc.

I'm not a doctor - and I don't even play one on TV, but I am a huge fan of Tom Baker and Chris Eccleston.

There is no warranty, expressed or implied, in the quality of the information herein, or its fitness for any given purpose.

If you goof up applying this stuff and have a bad outcome or destroy a bunch of data – it's not my fault or Sun's.

This is version 1.0 material.Batteries not included.

Your mileage may vary (YMMV).


What About Bob?• Bob works in the Systems Quality Office (SQO) at

Sun Microsystems, Inc; 13-year veteran at Sun> Main focus is real-world performance and capacity issues

> monitor root causes and their cures> promote Best Practices> work with ISV's on performance- and capacity-related matters> assist with performance-related service incidents> provide feedback to engineering and marketing> travel to teach/share/fix

> SQO colleagues are among Sun's top trouble-shooters• See also: http://blogs.sun.com/bobs

> (Sorry; it has been on pause for many months!)

http://blogs.sun.com/bobs


Agenda

• Context & Motivations• Metrics & Measurements• Tales of CPU QoS• Conclusions


Context & Motivations


End-User QoS

At the risk of stating the obvious ... QUser

= ...

Σ N1Q

1 + N

2Q

2 + N

3Q

3 + N

4Q

4 + ...

or Σ NI/O

QI/O

+ NNet

QNet

+ NMem

QMem

+ NCPU

QCPU

+ ...... where N = Quantity, and Q = Quality

NOTE: These equations are notional, not mathematical,so relax ... plus ... they are obvious!

In Plain English: “Well, it depends!” or “A chain is only as strong as its weakest link.”


CPU QoS and Capacity

• Capacity α Efficiency> 100% busy is 50% capacity if efficiency could be doubled> 100% busy is 50% capacity if SLA is exceeded by 2X> 100% busy is 25% capacity if both of these are true

• Amdahl's Law> Scaling is limited by the serial portion of the work

> Shouldn't serial sections get special attention for efficiency?> The benefit to the user of optimizing a part of the work is

limited by the dominance of the part being optimized> Yeah, but a 2X reduction looks great on the bottom line!

• Parkinson's Law> Work expands to fill the time (CPU) available

> Not very 'green', eh?


Quality vs. Quality

• Qualities of Time> Start time; early or late?> End time; on-time?> Interruptions versus focus and flow?> Productive versus wasted or overhead?> Variance; predictable versus erratic or skewed?

• “Quality Time”> Doing stuff that matters ...> ... using time of the right qualities


Quality vs. Quality Analogues

• Qualities of Time> Start time; early or late?> Interruptions versus focus

and flow?> Productive versus wasted

or overhead?> End time; on-time?> Variance; predictable

versus erratic or skewed?• “Quality Time”> Doing stuff that matters ...> ... using time of the right

qualities

• Qualities of Computing> Dispatch latency> Timeslice expiration or

preemption; interrupts> “Business logic” versus

“overhead” and “waste”?> SLA; attained or missed?> Controlled on purpose,

or not?• Priorities> Critical path to SLA> ... controlled to minimize

variance


Controlling CPU QoS

• Developer-level> Algorithmic efficiency> Data structures and locality factors> Compile & link-time efficiency factors> Platform-specific APIs and pre-optimized libraries> Hints to the OS

• Operations-level> OS scheduling factors> ISV-provided tunables> Competition factors; competing workloads, virtualization

• Architectural-level> NUMA effects> Cache size, organization, and usage> Specific CPU considerations> Specific system architectural factors

Out of my control!☹

Maybe; how?

Huh?


Metrics & Measurements


A Thread in Heaven: Ideal CPU QoS• The scheduler is not interrupting me often

> I have a nice big quantum, and my priority is high> My CPU pipeline is not being flushed by any sort of context switches> The scheduler is not migrating me to a cold cache

• My compute is highly register-to-register• No hardware interrupts are interrupting me• I'm not doing things that cause global cache coherency events• My memory references are hitting nicely in L1 or L2 caches

> My performance-tuned data structures are paying off! > No other threads sharing my caches are spoiling them for me> Data and instruction pre-fetching is working well for me> My partner threads are leaving our shared data in my cache

• My L2 misses have tight locality in large pages, so I'm not waiting much on TLB remaps• My L2 misses are in low-latency local memory in this NUMA architecture• My branch predictions are working out really well, statistically-speaking• My programmer inlined a few frequently-used small functions

> That's saving me some function calls and keeping my I$ locality tight• That new compiler I came through taught me a lot of new tricks!

> I'm keeping multiple instructions units busy at the same time> I'm get multiple instruction completions per cycle on this superscalar CPU

• It must be my birthday; I want a pony!


Scheduling: Some Useful Metrics

• With Solaris microstate accounting, prstat -mL shows per-thread, among other things ...> Scheduling Latency (LAT) – wait time for compute-ready

thread to execute> Involuntary Context Switch (ICX) – rate of thread interruption

by threads of higher priority or for exceeding their scheduling quantum

PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 19798 oracle 49 46 3.5 0.0 0.0 0.0 0.0 0.9 496 1K .4M 0 oracle/1 19800 oracle 31 63 6.1 0.0 0.0 0.0 0.0 0.4 0 1K .8M 0 oracle/1 19788 oracle 35 30 4.1 0.0 0.0 0.0 19 12 4K 2K .3M 0 oracle/1 19790 oracle 36 26 1.9 0.0 0.0 0.0 27 8.6 4K 2K .3M 0 oracle/1 19796 oracle 35 28 5.3 0.0 0.0 0.0 27 4.9 818 1K .2M 0 oracle/1 4172 oracle 3.8 41 20 0.0 0.0 0.0 27 8.4 5K 2K 33K 0 tnslsnr/1 1779 root 0.1 1.2 0.1 0.0 0.0 0.0 98 0.4 169 27 1K 0 init.cssd/1 1 root 0.3 0.8 0.0 0.0 0.0 0.0 99 0.1 549 13 19K 573 init/1 2893 oracle 0.8 0.2 0.0 0.0 0.0 0.0 98 1.3 1K 37 14K 1K oracle/1...


Scheduling: Solaris Preemption Control

• Preemption Control API> See: schedctl_init(3C) et al - allows a process to tell the OS

it's in a critical section and should get extra time if its quantum expires

> If thread does not yield CPU after getting a reprieve, future requests are ignored

> “Baked in” to some products, including Oracle> No user action required; it's a programmed-in thing

• Challenge questions:> What if someone's capacity model is based on data from a

system where preemption control has been disabled for some key process?

> How could one measure if that had happened?


Scheduling: What Matters Most?$ ps -e -o class,pri | sort | uniq -c | sort -nr +2 1 RT 157 1 RT 140 1 RT 100 1 SYS 98 1 SYS 96 3 TS 60 2 FX 60 1 SYS 608238 TS 59 1 TS 58 3 TS 54 11 TS 53 2 TS 52 1 TS 51 6 TS 50 14 TS 49 1 TS 36 1 TS 34 1 TS 29 1 TS 22 1 TS 12 3 TS 0$ ps -e -o pid,ppid,class,pri,args | grep lgw10494 1 TS 34 ora_lgwr_XYZP

SOLUTION: Force LGWR into FX 60 as a Best Practice!NOTE: Snapshot 'ps -e -o pid,ppid,class,pri,args' to a file

for analysis; these details change rapidly!

Oracle's log writer?Hey! Wait a

minute! That's really important! Why

didn't anyone tell the OS? Help!

Important!

CPU hogs; demoted by theTS scheduler

Primary modality; OLTP processes


Memory QoS

• CPU QoS is tightly coupled with memory QoS> “If Mama ain't happy, ain't nobody happy” - Dr. Phil

• Memory QoS metrics are not widely-known> Popular naiveté underlies the belief that memory is “flat”


Memory QoS: Metrics

• Hardware-level memory quality factors> Latency, varies with technologies and backplane> Base technology (eg: DDRN vs. FB-DIMM)

> Locality, NUMA proximity> Coherency events> Memory Interleave

• OS-level factors> TLB remap rate> segmap remap rate (newer mechanisms now in use)> Faults: major, minor> ISM vs. DISM> Hardware page size

• Swap page-in latency> ... when this matters, you're in trouble!


Memory QoS: Observability

• Lots of instrumentation in the Solaris OS ...> Some well-documented and end-user accessible

> trapstat – TLB maintenance overhead> pmap -x – pages sizes and other qualities> ppgsz – platform-specific page sizes> ipcs - IPC allocations & characteristics> ps – footprint; RSS ... wait ... that's quantity, not quality

> Some very propeller-headed> DTrace – able to probe kernel and userland> kstat – many memory-related counters> cpustat – performance of caches> busstat – memory controllers and cache coherency events


Memory QoS: High-Order Bits

• TLB-management generally shows up as %usr> Mapping a page involves a table walk lookup, a TLB

'shootdown', and instantiation of new mapping in the MMU> High %usr is not necessarily a good thing!

• Rate of TLB-misses varies with HW page sizes used> Mapping larger pages less-often is less overhead than

mapping small pages more-often• TLB observability varies> On most SPARC processors, use 'trapstat -T' in Solaris> Many modern processors features “hardware table walk”,

making remaps far more efficient, but less observable


OS-Level Counters: ISA Emulation

• Instruction Emulation: not all instructions in an Instruction Set Architecture (ISA or 'CPU family') are implemented on all CPU models> Chip designers save transistors by not implementing certain

complex, rarely-used, or deprecated opcodes> Unimplemented opcodes trap into emulation code, costing

many more cycles than native instructions> Emulator traps are counted in Solaris using kstat facility> Common candidates for emulation include ...

> Visual Instruction Set (VIS) opcodes> Certain Floating-Point (FP) opcodes

> Suggested reading ...> http://www.sun.com/blueprints/1205/819-5144.pdf > http://blogs.sun.com/travi/entry/corestat_for_ultrasparc_t2> http://docs.sun.com/source/817-6702/ncg_sparc.html (historical)

http://www.sun.com/blueprints/1205/819-5144.pdf

http://blogs.sun.com/travi/entry/corestat_for_ultrasparc_t2

http://docs.sun.com/source/817-6702/ncg_sparc.html


OS-Level Counters: ISA Augmentation

• Special-purpose extensions to an ISA may be designed into some CPUs for acceleration of certain tasks, like encryption> Leveraging such extensions typically requires a vendor-

supplied, platform-optimized library – and some application-specific configuration

> OS-level counters and chip-level counters may both be available to assess the utilization of such extensions

> As with multiple CPU functional units, such features blur the concept of CPU utilization as a simple percentage

> Example: Cryptographic acceleration extensions on Sun's CoolThreadsTM series CMT CPUs> See: “USING THE CRYPTOGRAPHIC ACCELERATORS IN THE ULTRASPARC® T1 AND T2 PROCESSORS“

http://www.sun.com/blueprints/0306/819-5782.pdf


Chip-Level Counters: Typical

• Specifics vary enormously by chip ISA and model> Clock rate (may vary with power management)> Cycles (1/clock_rate)> Instructions

> DERIVED: Cycles-per-Instruction (CPI)> DERIVED: Millions of Instructions per Second (MIPS)

> Branch mispredictions> L1 cache misses> L2 cache misses

> NOTE: These often explain or correlate with CPI and MIPS


CPI: Bigger isn't Better• High cycles-per-instruction (CPI) implies memory

waits and/or long-running/complex instructions• Low CPI isn't necessarily better ...

while (!white_of_their_eyes) ; // Hard poll fire_our_guns();

• Context is everything!• Profiling tells you where the time is going; low-level

metrics can give important clues about why


Chip-Level Counters: Constraints

• In general, not accessible from virtual environments> Access usually requires privilege in the primary host context

or domain (dom0, control domain, global zone, etc.)• In general, limited counters available per-sample> On-chip counters tend to be plentiful, but need to be

mapped for retrieval thru limited windows• They are counters, not rates; post-processing required> perl and awk scripts and spreadsheets are popular

• Counter names can vary on the chip designer's whims> They are neither stable nor standard, even between chips in

the same family


Chip-Level Counters: Access in Solaris

• cpustat – basic counter access• cputrack – counters as they change for a process• busstat – counters on non-CPU components• kstat – scalable mechanism for OS-level counters• DTrace with libcpc extensions – potential to correlate

low-level measurements with anything else> per-thread per-vcpu> per schedule interval> per transaction (with some work)> per transaction class (with a lot of work)


Chip-Level Counters: Bedtime Reading

• 'cpustat -h' lists counters available on current CPU, but counter details vary from chip-to-chip

• SPARC CPUs> [PDF] UltraSPARC IV+ Processor User's Manual Supplement> [PDF] SPARC64™ VI Extensions> [PDF] SPARC64™ VII Extensions> [HTML] Using busstat to Monitor Performance Counters for UltraSPARC T2

Plus External Coherency Hub Architecture • x64 CPUs> Intel and AMD implementations are different> See chip-specific manufacturer documents

http://www.sun.com/processors/manuals/USIVplus_v1.0.pdf

http://www.fujitsu.com/downloads/SPARCE/others/sparc64vi-extensions.pdf

http://www.fujitsu.com/downloads/SPARCE/others/SPARC64VIIext-R10.pdf

http://www.sun.com/bigadmin/features/techtips/busstat_perf.jsp


Chip-Level Counters: Rollup Tools

• har – Hardware Activity Reporter - reports MIPS and more on selected CPUs – SPARC and x64> http://blogs.sun.com/openomics/entry/cpu_hardware_counter_stats

• corestat – for SPARC Coolthreads CPUs, show actual core utilizations> http://cooltools.sunsource.net/corestat

• EMON – For Intel x64, Intel proprietary toolkit for observing low-level counters> http://software.intel.com/en-us/articles/code-downloads/

http://blogs.sun.com/openomics/entry/cpu_hardware_counter_stats

http://cooltools.sunsource.net/corestat

http://software.intel.com/en-us/articles/code-downloads/


Tales of CPU QoS


Common Themes

• Capacity thinking is done in “MIPS”, but the real problem is often elsewhere> latency> scheduling> algorithm scalability

• Based on simple capacity models, upgrades often disappoint> “More iron” was not the best strategy> “Work smarder, not harder” - always worth investigating


Case Study: A Famous Compute-Hog

• SPARC-specific Oracle Bug #6814520> DSS workload was exhibiting disappointing bandwidth> Profiling revealed huge elapsed time in checksum routine> Low-level counters revealed severe memory waits

• Diagnosis: Old “hand-rolled” assembler code for checksum validation had no data prefetch hints, resulting in really-high memory-wait for data!• Remedy & Payoff: Added prefetch hint for 4X speedup

on some CPUs. Used standard optimized compile of generic C code to get 16X speedup on other CPUs.• Challenge: Might this have first been seen by low-level

metrics?> Bonus Q: How long did this issue go unnoticed, and why?


Case Study: Partner Pairs

• Problem: Disappointing throughput on messaging app with many producer/consumer pairs on large SMP• Diagnosis: Messages written by producers suffered

high latency being read by consumers which were migrating freely around the system> Cache-to-cache copies were inefficient on host architecture> Co-locating producers and consumers caused 4X gain

• Remedy & payoff: Custom daemon written to dynamically co-locate producer/consumer pairs; result is 3X improved aggregate throughput.• Challenge: What would capacity planners have done

without this analysis?


Case Study: Foxes and Hens

• Problem: Disappointing throughput on messaging app with many producer/consumer pairs on small SMP• Diagnosis: Consumers and producers were invalidating

each other's cache contents, resulting in high rates of cache misses• Remedy & Payoff: Segregated producers and

consumers into distinct processor sets causing 4X gain by keeping respective caches warmed to each task• Challenge: How would one diagnose this without low-

level metrics?


Checkpoint ...

• Regarding the last two examples ...> Same approximate problem description

... but completely opposite problem resolution (segregation versus integration)

> Shared concepts: proximity or juxtaposition of ...> processes to CPU resources> processes to memory resources> processes to other processes> Real estate: “Location, location, location.”> Comedy:

“Timing.”


Juxtaposition Games You Can Play• How well does it run when bound to a dedicated CPU?> In Solaris, a processor set (psrset) can be used

> A psrset can be set 'nointr' for immunity from hardware interrupts> A psrset can be made to contain all HW threads associated with a

pipeline, core, socket, or sysboard> Memory will tend to be local with Memory Placement Optimization

(MPO) policy of 'set lgrp_mem_pset_aware=1' in /etc/system, • If the answer is “a lot better”, hypotheses include ...> Excessive migrations were spoiling its caches> Remote memory latency was a problem> Interrupt handlers were preempting it; possibly even

“pinning” it (ie: preventing it migrating to an idle thread)• Challenge questions: > How useful is “utilization” versus “CPU QoS” in such cases?> How might future tools automate such sensitivity analyses?


Case Study: How Much Concurrency?

• Problem: Oracle Parallel Query produced disappointing results on 64-way CMT chip• Diagnosis: Excessive Degree-of-Parallelism (DOP) was

being used, causing CPU to go increasingly to overhead categories (context switches, migrations, excessive spins on mutexes, etc.)> 'corestat' utility was used to observe low-level utilization

while varying DOP• Remedy & Payoff: Stop increasing DOP when CPUs'

theoretical MIPS limits are reached and deploy other• Challenge: ISV defaults did not anticipate the thread

density of modern CMT systems


Case Study: How Much Concurrency?

• Problem: Oracle Parallel Query produced non-linear scaling results on large VMT system with DSS workload• Diagnosis: Classic low-CPI DSS workload saturated a

CPU core running only its 'primary' hardware thread> 'cpustat' showed many thread-switching operations and L2

cache saturation• Remedy & Payoff: Turning off secondary thread on

each CPU eliminated negative scaling• Challenge: The architectural differences between

different multi-core, multi-thread CPUs can be highly significant!


Checkpoint ...

• Regarding the last two examples ...> Same approximate problem description

... but resolution was highly architecture-dependent> Shared concept: workload-specific architectural impact

> “If it hurts when you do that; don't do that!”> “Work smarder, not harder!”> “Size matters.”


Case Study: The Sneaky Leak

• Problem: Batch application takes far too long; only 1/4 of month's data can be processed per month! • Diagnoses: Numerous> Sub-batch hash terrible; longest-running sub-batch takes

weeks for only a portion of the month's data!> Memory leak! Longer-running jobs develop ever-worse

memory locality, shown by a trend in cache misses.• Remedies & Payoffs: Implement better hash; run many

more smaller jobs. Shorter-running jobs suffer less from memory leak and keep all SMP threads busy. One month's data now processes in two days!


“... ay, there's the rub ...”

• Developers have many options ...> Application architecture> Algorithm selection> Data structure design> Hinting the execution environment> Compile-time optimizations

> Minimal target chip architecture> Selective function inlining> Conditional compilation (eg: #ifdef)> Compiler version: newer is better!

> Link-time optimizations> Feedback-optimized linking> Platform-optimized libraries

• ... but Capacity Planners are all-too-often far removed from the developers, ISVs, or application teams


Controlling CPU QoS

• Developer-level> Algorithmic efficiency> Data structures and locality factors> Compile & link-time efficiency factors> Platform-specific APIs and pre-optimized libraries> Hints to the OS

• Operations-level> OS scheduling factors> ISV-provided tunables> Competition factors; competing workloads, virtualization

• Architectural-level> NUMA effects> Cache size, organization, and usage> Specific CPU considerations> Specific system architectural factors

Out of my control!☹

Maybe; how?

Huh?


Closing Remarks & Call to Action


So, Here's the Thing ...

• There must be many cases undiagnosed in the wild of issues illustrated by the cases cited here!> Where so, by implication, might the prevailing capacity

plan may be 2X – 8X inflated due to unexploited latent capacity?

• Bob says ...> “How can anyone confine themselves to the realm of

'utilization' and 'cpu-seconds' and truly believe they are properly – or optimally - managing the high-order aspects of capacity?”

> “Low-level metrics and cycle accounting are the frontier of performance analysis and QoS management.”


Call to Action!

• Find tools for observing low-level metrics on your critical production platforms• Poke around on those systems to see what's

'normal' at the system and workload level• See what correlations you can find between low-

level metrics and end-user QoS• See if you can diagnose some mechanisms linking

your low-level- and high-level-QoS• Let us know what discover! Bring it to CMG!• EXTRA CREDIT: What can tool vendors and

modelers do with these metrics?


Q&A?Q&A?

Technology

CPU QoS 1.1