Upload
lynga
View
253
Download
1
Embed Size (px)
Citation preview
Xunjia Lu - VMware, Inc
SER2343BU
#VMworld #SER2343BU
Extreme Performance Series:vSphere Compute & Memory Schedulers
VMworld 2017 Content: Not fo
r publication or distri
bution
• This presentation may contain product features that are currently under development.
• This overview of new technology represents no commitment from VMware to deliver these features in any generally available product.
• Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind.
• Technical feasibility and market demand will affect final delivery.
• Pricing and packaging for any new technologies or features discussed or presented have not been determined.
Disclaimer
2#SER2343BU CONFIDENTIAL
VMworld 2017 Content: Not fo
r publication or distri
bution
CONFIDENTIAL 3
VMworld 2017 Content: Not fo
r publication or distri
bution
CONFIDENTIAL 4
VMworld 2017 Content: Not fo
r publication or distri
bution
Agenda
CONFIDENTIAL 5
1 CPU Scheduler
2 Memory Management
3 NUMA Scheduler
4 VM Sizing and Host Configuration
VMworld 2017 Content: Not fo
r publication or distri
bution
Agenda
CONFIDENTIAL 6
1 CPU Scheduler
2 Memory Management
3 NUMA Scheduler
4 Host Configuration and VM Sizing
VMworld 2017 Content: Not fo
r publication or distri
bution
CPU Scheduler Overview
CONFIDENTIAL 7
• Goals
– High CPU utilization, high application throughput
– Ensure fairness (shares, reservation, limit)
VMworld 2017 Content: Not fo
r publication or distri
bution
CPU Scheduler Overview
CONFIDENTIAL 8
• When?
– The idle PCPUs have new runnable worlds (wakeup: VM power on, etc.)
– The running world voluntarily yields CPU (wait: idle/none-idle)
– The running world involuntarily gives up CPU (preemption: high priority/fair share reached)
• What?
– World in ready queue with the least (consumed CPU time / fair share)
• Where?
– Balance load across PCPUs
– Preserve cache state, minimize migration cost
– Avoid HT/LLC contention, sibling vCPUs
– Close to worlds that have frequent communication pattern
VMworld 2017 Content: Not fo
r publication or distri
bution
Scheduling Through the Lens of esxtop
CONFIDENTIAL 9
A Command Line Tool for Performance Monitoring
• For real time monitoring
– Just type esxtop in to ESXi shell / terminal
• For batch mode collection
– esxtop –b –a –d $DELAY –n
$SAMPLES > $FILE_NAME.csv
– Tools: perfmon, excel
• A few changes in 6.5
– Processor turbo or frequency scaling efficiency (%APERF/MPERF)
– More intuitive accounting
VMworld 2017 Content: Not fo
r publication or distri
bution
Peeking into Virtual Machine using esxtop
CONFIDENTIAL 10
A virtual machine consists of more than vCPU worlds.
VMworld 2017 Content: Not fo
r publication or distri
bution
CPU Scheduler Accounting: %USED vs. %RUN
11
%USED vs. %RUN (UTIL)– For 2 seconds, world achieves different amount of work
– RUN (UTIL) is based on wall clock time (TSC)
– USED reflects frequency scaling (power, turbo) and hyper-thread contention
CONFIDENTIAL
VMworld 2017 Content: Not fo
r publication or distri
bution
CPU Scheduler: Throughput Gain due to Hyperthreading
12CONFIDENTIAL
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
perlbench bzip2 gcc mcf gobmk hmmer sjeng libquantum h264ref
No
rmalized
Th
rou
gh
pu
t G
ain
SPEC CPU2006 in 1-vCPU VMs (Haswell)
Baseline HT
VMworld 2017 Content: Not fo
r publication or distri
bution
CPU Scheduler: Slowdown due to Hyperthreading
13
0
5
10
15
20
25
perlbench bzip2 gcc mcf gobmk hmmer sjeng libquantum h264ref
Ru
nti
me i
n m
inu
tes (
min
)SPEC CPU2006 in 1-vCPU VMs (Haswell)
baseline HT
1.5x
1.6x
1.6x
1.9x
1.8x
1.6x 1.6x
1.7x
1.8x
CONFIDENTIAL
VMworld 2017 Content: Not fo
r publication or distri
bution
CPU Scheduler: Hyperthreading and %USED time
CONFIDENTIAL 14
• Improve throughput
• Each vCPU might runs slower with contention from hyper-twin
– 2 cores vs. 2 HT / single core
• Hyperthreading-aware scheduling
– Same %RUN translates into different amount of %USED
• 100% %RUN may only give 70% %USED in case of contention
• Enabled HT by default for the extra throughput
• Be aware of HT contention. ☺VMworld 2017 Content: N
ot for publicatio
n or distribution
CPU Scheduler Accounting: Breakdown
1515
A B C
t1 t2 t3
D
t4 t5
E
t6 t7 t8
CPU
scheduling
cost
Time in ready
queue
Actual
execution
Efficiency loss from
power mgmt, hyper-
threading, etc.
Interrupted
%RDY
%RUN%OVRLP
%SYS += D if for this VM
%USED = %RUN + %SYS - %OVRLP - E
CONFIDENTIAL
W
t0
Waiting
%WAITVMworld 2017 Content: Not fo
r publication or distri
bution
CPU Scheduler Accounting: Time from Kernel Contexts
16
vSphere 6.5
vSphere 6.0
NEW!
CONFIDENTIAL
VMworld 2017 Content: Not fo
r publication or distri
bution
CPU Scheduler Accounting: Group vs. World
17
Group (VM) stats aggregate world stats.
128 vCPUs!!
CONFIDENTIAL
VMworld 2017 Content: Not fo
r publication or distri
bution
%RDY Impact on Throughput (Java Workload)
18
0.00
0.20
0.40
0.60
0.80
1.00
1.20
0 2 4 6 8 10 12 14 16 18 20
Th
rou
gh
pu
t (b
op
s)
%RDY
%RDY affects throughput
-15%
CONFIDENTIAL
VMworld 2017 Content: Not fo
r publication or distri
bution
%RDY Impact on Latency (Redis Workload)
19CONFIDENTIAL
0
2
4
6
8
10
12
14
16
18
0 5 10 15 20 25
99.9
9 P
erc
en
tile
Late
nc
y (
ms
ec)
%RDY
spiky
flat
0 10 20 30 0 10 20 30
Latency depends on the competing workloads.
(-4ms)(-7ms)
VMworld 2017 Content: Not fo
r publication or distri
bution
CPU Scheduler Co-scheduling
20
• *NOT* gang-scheduling
– Allows a subset of vCPUs to run simultaneously
– Costop a leading vCPU if it advances too far ahead
– Efficient in consolidated setup
• High %CSTOP?
– Any %RDY time?
– Watch out for vCPU’s (WAIT – WAIT_IDLE), i.e. %VMWAIT from esxtop
• vCPU blocks due to IO (to snapshot) or host level memory swap
CONFIDENTIAL
VMworld 2017 Content: Not fo
r publication or distri
bution
Agenda
CONFIDENTIAL 21
1 CPU Scheduler
2 Memory Management
3 NUMA Scheduler
4 Host Configuration and VM Sizing
VMworld 2017 Content: Not fo
r publication or distri
bution
Memory Management Overview
CONFIDENTIAL 22
• Goals
– Allow memory over-commitment
– Handle transient memory pressure well
Total Memory Size
Allocated Memory
Free Memory
Active Memory
Idle Memory
• Terminology
VMworld 2017 Content: Not fo
r publication or distri
bution
Memory Management Overview
CONFIDENTIAL 23
• Reclaim memory if consumed > entitled
– Entitlement: shares, limit, reservation, active estimation
– Page sharing > Ballooning > Compression > Host swapping
• Breaks host large pages
• Page sharing vs. large pages
– Using large pages for both guest and ESXi improves performance by 10 – 30%
– Page sharing avoids ballooning and swapping
– vSphere 6.0 breaks large pages earlier and increase page sharing (clear state)
VMworld 2017 Content: Not fo
r publication or distri
bution
Transient Memory Pressure Example
• Six 4GB Swingbench VMs (VM-4,5,6 are idle) in a 16GB host
0
2000
4000
6000
8000
100000 1 2 3 3 4 5 6 7 7 8 9
10
11
12
13
13
14
15
16
17
18
18
19
20
21
22
22
23
24
25
26
27
27
28
29
30
31
32
32
33
34
35
36
37
38
38
39
40
41
42
43
43
44
45
46
47
48
48
49
50
51
52
53
53
54
55
56
Op
era
tio
ns
per
Min
ute
s
Time (minutes)
VM1 VM2 VM3
0
2
4
6
8
10
12
0 1 2 3 4 5 6 7 8 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
Siz
e(G
B)
Time (minutes)
Balloon
Swap Used
Compressed
Shared
∆VM1 = 0%
∆VM2 = 0%
24CONFIDENTIAL
VMworld 2017 Content: Not fo
r publication or distri
bution
Constant Memory Pressure Example
• All six VMs run Swingbench workloads
0
2000
4000
6000
8000
10000
1 2 2 3 4 5 6 7 7 8 91
01
11
21
21
31
41
51
61
71
71
81
92
02
12
22
22
32
42
52
62
72
72
82
93
03
13
23
23
33
43
53
63
73
73
83
94
04
14
24
24
34
44
54
64
74
74
84
95
05
15
25
25
35
45
55
6
Op
era
tio
ns
per
Min
ute
Time (minutes)
VM1 VM2 VM3 VM4 VM5 VM6
0
1000
2000
3000
4000
0 1 2 3 4 5 6 7 8 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
KB
per
Se
co
nd
Time (minutes)
Swap-in Rate
∆VM1 = -16%
∆VM2 = -21%
25CONFIDENTIAL
VMworld 2017 Content: Not fo
r publication or distri
bution
• Two types of memory overcommitment
– “Configured” memory overcommitment: SUM (memory size of all VMs) / host memory size
– “Active” memory overcommitment: SUM (mem.active of all VMs) / host memory size
• Performance impact
– “Active” memory overcommitment ≈ 1 high likelihood of performance degradation!
• Some active memory are not in physical RAM
– “Configured” memory overcommitment > 1 zero or negligible impact
• Most reclaimed memory are free/idle guest memory
• Aim for high consolidation while keeping down “active” memory overcommitment
General Principles
26CONFIDENTIAL
VMworld 2017 Content: Not fo
r publication or distri
bution
Agenda
CONFIDENTIAL 27
1 CPU Scheduler
2 Memory Management
3 NUMA Scheduler
4 Host Configuration and VM Sizing
VMworld 2017 Content: Not fo
r publication or distri
bution
NUMA
CONFIDENTIAL 28
• Non-Uniform Memory Access system architecture– Each node consists of CPU cores, memory and possible devices
– Access time can be 30% ~ 200% longer across nodes
• NUMA node vs. Sockets
– Multiple NUMA nodes per socket (Cluster-on-Die)
– Multiple sockets per NUMA node (less common)
• Small VMs scheduled on a single physical NUMA node– 100% local memory accesses
– “Fixes” scale-out apps that don’t scale-up well
– Consider sizing databases to fit
NUMA node 0
NUMA node 1VMworld 2017 Content: Not fo
r publication or distri
bution
NUMA Scheduler: Overview
CONFIDENTIAL 29
• Load balancing
– To balance VMs across different NUMA nodes
• vNUMA
– To properly expose virtualized NUMA topology to guest VMs for best performance
VMworld 2017 Content: Not fo
r publication or distri
bution
NUMA Scheduler: Load Balancing
CONFIDENTIAL 30
• Initial placement
– Initial placement based on CPU/memory load + round-robin
• Periodic Rebalancing Algorithm
– At every 2 seconds, try incremental move (1 or 2 VMs)
– To improve load balance / memory locality / relation sharing / fairness
VMworld 2017 Content: Not fo
r publication or distri
bution
NUMA Rebalancing In Action (TPCx-V)
CONFIDENTIAL 31
0
30
60
90
120
PC
PU
Nu
mb
er
Group-1
vm2 vm3 vm4
0
30
60
90
120
PC
PU
Nu
mb
er
Group-2
vm5 vm6 vm7
0
30
60
90
120
PC
PU
Nu
mb
er
Group-4
vm11 vm12 vm13
0
30
60
90
120
PC
PU
Nu
mb
er
Group-3
vm8 vm9 vm10
VMworld 2017 Content: Not fo
r publication or distri
bution
32
NUMA Scheduler: Impact of Cluster-on-Die (Haswell)
0.96
0.98
1.00
1.02
1.04
1.06
1.08
1.10
36Vx1 18Vx1 9Vx2
No
rma
lize
d T
hro
ug
hp
ut
Ga
in
SPECjbb2015 Throughput
default CoD
• Cluster-on-Die
– Breaks each socket into 2 NUMA domain
• Performance Considerations
– Lower LLC hit latency and local memory latency
– Higher local memory bandwidth
• vSphere 6.5 supports Haswell, Broadwell and future generations of processors
VMworld 2017 Content: Not fo
r publication or distri
bution
NUMA Scheduler: vNUMA
CONFIDENTIAL 33
• vNUMA
– Useful for wide VMs (#vCPUs > #cores/NUMA node)
– Expose virtual NUMA topology to improve memory locality for better guest scheduling
• Example: 10-vCPU VM maps directly to 2 pNUMA nodes (out of 4)
C5C4C3
C0 C1 C2
C5C4C3
C0 C1 C2
C5C4C3
C0 C1 C2
C5C4C3
C0 C1 C2
VMworld 2017 Content: Not fo
r publication or distri
bution
NUMA Scheduler: vNUMA vs. vSocket
CONFIDENTIAL 34
vNUMA nodevNUMA node
vSocket vSocketvSocketvSocket
• Decoupling vSocket from vNUMA (vSphere 6.5)
– ESXi will always try to pick the optimal vNUMA topology when possible
• As long as: vNUMA = N x vSocket or vice versa
– Blog posts: “Virtual Machine vCPU and vNUMA Rightsizing – Rules of Thumb”
vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU
NEW!
VMworld 2017 Content: Not fo
r publication or distri
bution
NUMA Scheduler: Virtual CPU Topology Example
35CONFIDENTIAL
• numactl –hardware
• lstopo –s
• coreinfo –n –s
VMworld 2017 Content: Not fo
r publication or distri
bution
Agenda
CONFIDENTIAL 36
1 CPU Scheduler
2 Memory Management
3 NUMA Scheduler
4 Host Configuration and VM Sizing
VMworld 2017 Content: Not fo
r publication or distri
bution
Host Configuration: Power Management Policy
37
0.9
0.95
1
1.05
1.1
1.15
1.2
IvyBridge Haswell
No
rma
lized
Perf
orm
an
ce
Throughput of 1-vCPU VM Java Workload
High Perf Balanced (P-states+C-states)
CONFIDENTIAL
VMworld 2017 Content: Not fo
r publication or distri
bution
VM Sizing: #vCPUs
CONFIDENTIAL 38
• Cost of over-sizing
– Small CPU overhead per vCPU from periodic timer, etc.
– May hurt performance due to process migrations (e.g. Redis, up to 40% regression)
• Cost of under-sizing
– Internal CPU contention
– Check CPU usage and processor queue length
VMworld 2017 Content: Not fo
r publication or distri
bution
VM Sizing: vRAM
CONFIDENTIAL 39
• Cost of over-sizing
– Some apps/OSes treat unused memory as cache
• e.g. SuperFetch
• Increases active memory
• May suffer from memory reclamation
• Cost of under-sizing
– Guest level paging
VMworld 2017 Content: Not fo
r publication or distri
bution
Summary
CONFIDENTIAL 40
• Be aware of the different between per-VM %RDY and per-world %RDY
• Pay attention to ready time (%RDY) if the tail latency matters
• Sizing VM based on number of physical cores instead of hyperthreads
• In vSphere 6.5 and beyond
– Insignificant %OVRLP and some %SYS moved to %RUN
– No more hassle between vSocket and vNUMA
– Even smaller scheduling overhead
• Avoid active memory overcommit.
• Watch out for under-sizing a VM.
• Power policy matters!
NEW!
VMworld 2017 Content: Not fo
r publication or distri
bution
Extreme Performance Series – Las Vegas
• SER2724BU Performance Best Practices
• SER2723BU Benchmarking 101
• SER2343BU vSphere Compute & Memory Schedulers
• SER1504BU vCenter Performance Deep Dive
• SER2734BU Byte Addressable Non-Volatile Memory in vSphere
• SER2849BU Predictive DRS – Performance & Best Practices
• SER1494BU Encrypted vMotion Architecture, Performance, & Futures
• STO1515BU vSAN Performance Troubleshooting
• VIRT1445BU Fast Virtualized Hadoop and Spark on All-Flash Disks
• VIRT1397BU Optimize & Increase Performance Using VMware NSX
• VIRT2550BU Reducing Latency in Enterprise Applications with VMware NSX
• VIRT1052BU Monster VM Database Performance
• VIRT1983BU Cycle Stealing from the VDI Estate for Financial Modeling
• VIRT1997BU Machine Learning and Deep Learning on VMware vSphere
• FUT2020BU Wringing Max Perf from vSphere for Extremely Demanding Workloads
• FUT2761BU Sharing High Performance Interconnects across Multiple VMs
CONFIDENTIAL 41
VMworld 2017 Content: Not fo
r publication or distri
bution
Extreme Performance Series – Barcelona
• SER2724BE Performance Best Practices
• SER2343BE vSphere Compute & Memory Schedulers
• SER1504BE vCenter Performance Deep Dive
• SER2849BE Predictive DRS – Performance & Best Practices
• VIRT1445BE Fast Virtualized Hadoop and Spark on All-Flash Disks
• VIRT1397BE Optimize & Increase Performance Using VMware NSX
• VIRT1052BE Monster VM Database Performance
• FUT2020BE Wringing Max Perf from vSphere for Extremely Demanding Workloads
CONFIDENTIAL 42
VMworld 2017 Content: Not fo
r publication or distri
bution
Extreme Performance Series - Hand on Labs
Don’t miss these popular Extreme Performance labs:
• HOL-1804-01-SDC: vSphere 6.5 Performance Diagnostics & Benchmarking
– Each module dives deep into vSphere performance best practices, diagnostics, and optimizations using various interfaces and benchmarking tools.
• HOL-1804-02-CHG: vSphere Challenge Lab
– Each module places you in a different fictional scenario to fix common vSphere operational and performance problems.
CONFIDENTIAL 43
VMworld 2017 Content: Not fo
r publication or distri
bution
Performance Survey
CONFIDENTIAL 44
The VMware Performance Engineeringteam is always looking for feedback about your experience with theperformance of our products, ourvarious tools, interfaces and wherewe can improve.
Scan this QR code to access ashort survey and provide us directfeedback.
Alternatively: www.vmware.com/go/perf
Thank you!
VMworld 2017 Content: Not fo
r publication or distri
bution
VMworld 2017 Content: Not fo
r publication or distri
bution
VMworld 2017 Content: Not fo
r publication or distri
bution