VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

Extreme Performance Series:vSphere Compute & Memory

Fei Guo, VMware, IncSeong Beom Kim, VMware, Inc

INF5701

#INF5701

• This presentation may contain product features that are currently under development.

• This overview of new technology represents no commitment from VMware to deliver these features in any generally available product.

• Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind.

• Technical feasibility and market demand will affect final delivery.

• Pricing and packaging for any new technologies or features discussed or presented have not been determined.

Disclaimer

vSphere CPU Management

Outline• What to Expect on VM Performance

• Ready Time (%RDY)

• VM Sizing: How Many vCPUs?

• NUMA & vNUMA

Set the Right Expectation on VM Performance

What Happens When Idle Active?

VT / AMD-V

-Privileged inst.-TLB miss

-VCPU state to RDY-Schedule-RDY Queue

IO HLT

-De-schedule VCPU-VCPU state to IDLE

-Issue to IO threads

When Your App is Slow in VM• High virtualization overhead

– A lot of privileged instructions / operations• CPUID, mov CR3, etc.

– A lot of TLB misses (addressing huge memory)• Large page helps a lot

• Resource contention– High ready (%RDY) time?– Host memory swap? (i.e. memory over-commit)

Reasonable Expectation on VM Performance• Best cases

– Computation heavy, small memory footprint– No CPU / memory over-commit– ~100% of the bare metal performance

• Common cases– Moderate mix of compute / memory / IO– Little to no CPU / memory over-commit– ~ 90% of the bare metal performance

• Worst cases– Huge number of TLB misses / privileged instructions– Heavy ESXi host memory swap

%RDY Can Happen Without CPU Contention

CPU Scheduler Accounting

t1 t2 t3

t6 t7 t8

CPU scheduling cost Time in ready queue Actual execution Efficiency loss from power

mgmt, hyper-threading, etc.Interrupted

%RUN%OVRLP

%SYS += D if for this VM

%USED = %RUN + %SYS - %OVRLP - E

Meaning of High %RDY

A: Scheduling Cost B: Time In Ready Q C: Actual Execution

A B C– CPU contention– Limit, low shares– Poor CPU affinity– Poor load balancing

A C A C A C A C A C

e.g. Frequent Sleep/Wakeup

Troubleshooting High %RDY• High queue time

– Check DRS load balancing issue– Check CPU resource specification (limit, low shares)

• %MLMTD – Percent time in RDY state due to CPU limit

– Avoid using CPU affinity

• Dominant CPU scheduling cost– Change application behavior (avoid frequent sleep / wakeup)– Delay or do not yield PCPU

• monitor.idleLoopSpinUS > 0– Burns more CPU power– OK for consolidation

• LatencySensitivity = HIGH– Power efficient– Bad for consolidation

Same %RDY, Different Performance Impact

%RDY Impact on Throughput

0 2 4 6 8 10 12 14 16 18 200.75

• Throughput workload

• Java server

• CPU & memory

• %RDY ~ throughput drop

%RDY Impact on Latency• Latency workload

• In-memory key-value store

• CPU & memory

• %RDY can have significant impact on tail latency

• Same %RDY but different impact

0 5 10 15 20 250

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

When %RDY Is Acceptable• VMs are consolidated into one NUMA node

– When VMs share data (communication, same IO context, etc.)– %RDY may increase– Better than running slowly without %RDY on separate NUMA nodes

• vSphere 6.0 becomes less aggressive– Leave 10% CPU headroom– Lower /Numa/CoreCapRatioPct to increase the headroom

Oversizing VM is Wasteful and Even Harmful

Unused VCPU Wastes CPU

RHEL5 100Hz (*) RHEL5 1kHz RHEL6 tickless (*) Win2k8 64Hz (*) Win2k8 1kHz0.00

• Idle VCPU does consume CPU

• Can be significant with 1kHz timer (RHEL5 1kHz)

• Mostly trivial

Over-sizing VM Can Hurt Performance

• Single-threaded app

• Does not benefit from more VCPUs

• Hurt by in-guest migrations

1 2 4 8 16 32 640.00

VM Size (#vCPUs)

ESXi is Optimized for NUMA

What is NUMA?• Non-Uniform Memory Access system architecture

– Each node consists of CPU cores and memory

• VM can access memory on remote NUMA nodes, but at a performance cost– Access time can be 30% ~ 200% longer

NUMA node 1 NUMA node 2

C5C4C3

C0 C1 C2

C5C4C3

C0 C1 C2

C5C4C3

C0 C1 C2

C5C4C3

C0 C1 C2

Good Memory Locality

ESXi Schedules VM for Optimal NUMA Performance

C5C4C3

C0 C1 C2

C5C4C3

C0 C1 C2

C5C4C3

C0 C1 C2

C5C4C3

C0 C1 C2

Poor Memory Locality Without vNUMA.

Wide VM Needs vNUMA

C5C4C3

C0 C1 C2

C5C4C3

C0 C1 C2

C5C4C3

C0 C1 C2

C5C4C3

C0 C1 C2

Good Memory Locality With vNUMA.

vNUMA Achieves Optimal NUMA Performance

Stick to vNUMA Default

Do Not Change coresPerSocket Without Good Reason• Changing default means you set vNUMA

• If licensing requires fewer vSockets– Find optimal vNUMA size– Match coresPerSocket to vNUMA size– e.g. 20-vCPU VM on 10 cores/node system

• Default vNUMA size = 10/vNode• Set coresPerSocket = 10

• Enabling “CPU Hot Add” disables vNUMA

Key Takeaways

Summary• Set the right expectation on VM performance

• %RDY can happen without CPU contention– Watch out for frequent sleep / wakeup

• Same %RDY, different performance impact– More significant impact on the tail latency

• Oversizing VM wastes CPU and may hurt performance

• ESXi is optimized for NUMA

• Stick to vNUMA default

• Check out CPU scheduler white paper– https://www.vmware.com/files/pdf/techpaper/VMware-vSphere-CPU-Sched-Perf.pdf

vSphere Memory Management

Outline• ESXi Memory Management Basics

• VM Sizing

• Reservation vs. Preallocation

• Page sharing vs. Large page

• Memory Overhead

• Memory Overcommitment Guidance

Memory Terminology

Memory SizeTotal Amount of Memory

Allocated Memory Free Memory

Active MemoryAllocated Memory Recently

Accessed or Used

Idle MemoryAllocated Memory Not Recently

Accessed

Task of Memory Scheduler• Compute memory entitlement for each VM

– Based on reservation, limit, shares, and memory demand– Memory demand is determined by active memory

• Sampling based estimation

• Reclaim guest memory if entitlement < consumed

Performance goal: handle burst memory pressure well

Memory Reclamation Basics (vSphere 5.5 and earlier)

Host Memory

minFreeconsumedfree State:

HIGHLOW

STATE Page Sharing Ballooning Compression Swapping

High X

Low X X X XExpensive

Refer to http://www.vmware.com/files/pdf/mem_mgmt_perf_vsphere5.pdf for details.

Reservation vs. Preallocation

Different in Many Aspects• Reservation

– Used in admission control and entitlement calculation – Setting it does NOT mean memory is fully allocated– General protection against memory reclamation

• Preallocation– Memory is fully reserved AND fully allocated

• Advanced configure option: sched.mem.prealloc = TRUE

– Mostly used for latency sensitive workloads

VM Sizing

Guard Against “Active Memory” Reclamation• VM memory size > the peak demand• If necessary, setting reservation above guest demand

Page Sharing & Large Page

Memory Saving from Page Sharing• Significant for homogeneous VMs

Workload Guest OS # of VMs Total guest memory

Swingbench RedHat 5.6 12 48GB

VDI Windows 7 15 60GB

43%57%

Swingbench

Sharednon-Shared

Share saved

What “Prevents” Sharing • Guest features

– ASLR (Address Space Layout Randomization) • Less than 50MB sharing reduction

– Super fetching (proactive caching)• Largely reduces sharing• Increase in I/Os hurts VM performance

• Host features– Host large page

• ESXi does not share large pages• Page sharing scanning thread still works (generates page signatures)

Why Large Page?• Fewer TLB misses • Faster page table look up time

• Enable by default

Guest Large Pages Host Large Pages SPECjbb Swingbench

√ √ +30% +12%

× √ +12% +7%

√ × +6% -

× × - -baseline

Large Page Impact on Memory Overcommitment• Higher memory pressure due to no sharing• Broken when any small page is ballooned or swapped

– Sharing happens thereafter

0 1.5 3 4.5 6 7.5 910

.5 12 13.5 15 16

.5 18 19.5 21 22

.5 24 25.5 27 28

.5 30 31.5 33 34

.5 36 37.5

5000000

10000000

15000000

Memory Overcommitment with Swingbench VMs

Time (minutes)

nrLarge

Ballooned

Shared Swapped

New in vSphere 6.0• Add a new memory state “CLEAR” (between High and Low)

• Breaking large pages in Clear state– Only if they contain shareable small pages– Avoid entering Low state– Best use of large pages and small pages

minFree

LowClear

0 1 2 3 40.8

1.4VDI

ESXi 5.5

ESXi 6.0

# of Extra VMs

)Performance Improvement

• ESXi 6.0 (with Clear state): sharing happens much earlier => no ballooning/swapping!

• Reference: http://dl.acm.org/citation.cfm?id=27311870

10.5 14

17.5 21

24.5 28

31.5 35

200000

400000

600000

800000

1000000

1200000

1400000

1600000

Total Ballooned + Swapped Memory (MB)

ESXi 5.5

ESXi 6.0

Time (minutes)

0 3 6 9 12 15 18 21 24 27 30 33 36

3,000,001

6,000,001

9,000,001

12,000,001

15,000,001

18,000,001

Total Shared Memory (GB)

ESXi 5.5

ESXi 6.0

Time (minutes)

Overhead Memory

Per Host & Per VM

• Composed of MANY components– In an idle host, kernel overhead

memory breakdown looks like this …

• Impossible to conduct an accurate formula

“Experimentally Safe” Estimation

– Per VM overhead • Less than 10% of configured memory

– Host memory usage without noticeable impact• <= 64GB : 90% of host memory• > 64GB: 95% of host memory

– Above are conservative!

Memory Overcommitment Guidance

• Two types of memory overcommitment– “Configured” memory overcommitment

• SUM (memory size of all VMs) / host memory size

– “Active” memory overcommitment• SUM (mem.active of all VMs) / host memory size

• Performance impact– “Active” memory overcommitment ≈ 1 high likelihood of performance degradation!

• Some active memory are not in physical RAM

– “Configured” memory overcommitment > 1 zero or negligible impact• Most reclaimed memory are free/idle guest memory

Configured vs. Active Memory Overcommitment

General Principles• High consolidation

– Keep “active memory” overcommitment < 1

• How to know the “active memory” of a VM?– Use vRealize Operations to track a VM’s average and maximum memory demand

• What if I have no idea about active memory…– Monitor performance counters while adding VMs

Host Statistics (Not Recommended)• mem.consumed

– Memory allocation varies dynamically based on entitlement– It does not imply performance problem

• Reclamation related counters– mem.balloon– mem.swapUsed– mem.compressed– mem.shared

Nonzero values do NOT necessarily mean a performance problem!

Example One (Transient Memory Pressure)• Six 4GB Swingbench VMs (VM-4,5,6 are idle) in a 16GB host

10000VM1 VM2 VM3

Time (minutes)

2000000

4000000

6000000

8000000

10000000

12000000

Balloon

Swap Used

Compressed

Shared

Time (minutes)

∆VM1 = 0%∆VM2 = 0%

Which Statistics to Watch? • mem.swapInRate

– Constant nonzero value indicates performance problem

• mem.latency– % of time waiting for decompression and swap-in. – Estimate the performance impact due to compression and swapping

• mem.active– If active is low, reclaimed memory is less likely to be a problem

Example Two (Constant Memory Pressure)• All six VMs run Swingbench workloads

10000VM1 VM2 VM3 VM4 VM5 VM6

Time (minutes)

Swap-in Rate

Time (minutes)

∆VM1 = -16%∆VM2 = -21%

Key Takeaways

Summary• Track mem.{swapInRate, active, latency} for performance issues.

• VM memory should be sized based on memory demand.

• “Single digit” memory overhead.

• New ESXi memory management feature improves performance.

• ESXi is expected to handle transient memory pressure well.

Extreme Performance Series:vSphere Compute & Memory

Fei Guo, VMware, IncSeong Beom Kim, VMware, Inc

INF5701

#INF5701

VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

Technology

vSphere PowerCLI - Demo VMworld 2010

VMworld 2015: VMware vSphere Certificate Management for Mere Mortals

VMworld 2015: Advanced SQL Server on vSphere

SER2343BU Extreme Performance Series: vSphere Compute ... · Extreme Performance Series: vSphere Compute & Memory Schedulers ... Throughput Gain due to Hyperthreading CONFIDENTIAL

VMworld 2013: Operating and Architecting a vSphere Metro Storage Cluster based infrastructure

VMworld 2013: Virtualizing Mission Critical Oracle RAC with vSphere and vCOPS

VMworld 2014: vSphere HA Best Practices and FT Tech Preview

VMworld 2013: Virtualization Rookie or Pro: Why vSphere is Your Best Choice

VMworld 2013: Examining vSphere Design Through a Design Scenario

VMworld 2014: What's New in vSphere

VMworld 2015: Managing vSphere 6 Deployments and Upgrades

VMworld 2014: Advanced SQL Server on vSphere Techniques and Best Practices

VMworld 2013: vSphere Flash Read Cache Technical Overview

VMworld 2013: Architecting Oracle Databases on vSphere 5 with NetApp Storage

VMworld 2013: vSphere Distributed Switch – Design and Best Practices

VMworld 2013: vSphere with Operations Management: The Customer Perspective

VMworld 2015: vSphere Web Client- Yesterday, Today, and Tomorrow

What's New in vSphere? - VMwaredownload3.vmware.com/vmworld/2014/downloads/... · vSphere Updates – Announcing at VMworld 2014* • Support for the new vCloud Suite 5.8 • Additional

SER1289BU vSphere Troubleshooting Tips and … Menendez & Francis Daly SER1289BU vSphere Troubleshooting Tips and Tricks #VMworld #SER1289BU VMworld 2017 Content: Not for publication

VMworld 2013: vSphere Web Client - Technical Walkthrough