SER2343BU Extreme Performance Series: vSphere Compute ... · Extreme Performance Series: vSphere Compute & Memory Schedulers ... Throughput Gain due to Hyperthreading CONFIDENTIAL

Xunjia Lu - VMware, Inc

SER2343BU

#VMworld #SER2343BU

Extreme Performance Series:vSphere Compute & Memory Schedulers

VMworld 2017 Content: Not fo

r publication or distri

bution

• This presentation may contain product features that are currently under development.

• This overview of new technology represents no commitment from VMware to deliver these features in any generally available product.

• Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind.

• Technical feasibility and market demand will affect final delivery.

• Pricing and packaging for any new technologies or features discussed or presented have not been determined.

Disclaimer

2#SER2343BU CONFIDENTIAL



bution

CONFIDENTIAL 3



bution

CONFIDENTIAL 4



bution

Agenda

CONFIDENTIAL 5

1 CPU Scheduler

2 Memory Management

3 NUMA Scheduler

4 VM Sizing and Host Configuration



bution

Agenda

CONFIDENTIAL 6

1 CPU Scheduler

2 Memory Management

3 NUMA Scheduler

4 Host Configuration and VM Sizing



bution

CPU Scheduler Overview

CONFIDENTIAL 7

• Goals

– High CPU utilization, high application throughput

– Ensure fairness (shares, reservation, limit)



bution

CPU Scheduler Overview

CONFIDENTIAL 8

• When?

– The idle PCPUs have new runnable worlds (wakeup: VM power on, etc.)

– The running world voluntarily yields CPU (wait: idle/none-idle)

– The running world involuntarily gives up CPU (preemption: high priority/fair share reached)

• What?

– World in ready queue with the least (consumed CPU time / fair share)

• Where?

– Balance load across PCPUs

– Preserve cache state, minimize migration cost

– Avoid HT/LLC contention, sibling vCPUs

– Close to worlds that have frequent communication pattern



bution

Scheduling Through the Lens of esxtop

CONFIDENTIAL 9

A Command Line Tool for Performance Monitoring

• For real time monitoring

– Just type esxtop in to ESXi shell / terminal

• For batch mode collection

– esxtop –b –a –d $DELAY –n

$SAMPLES > $FILE_NAME.csv

– Tools: perfmon, excel

• A few changes in 6.5

– Processor turbo or frequency scaling efficiency (%APERF/MPERF)

– More intuitive accounting



bution

Peeking into Virtual Machine using esxtop

CONFIDENTIAL 10

A virtual machine consists of more than vCPU worlds.



bution

CPU Scheduler Accounting: %USED vs. %RUN

11

%USED vs. %RUN (UTIL)– For 2 seconds, world achieves different amount of work

– RUN (UTIL) is based on wall clock time (TSC)

– USED reflects frequency scaling (power, turbo) and hyper-thread contention

CONFIDENTIAL



bution

CPU Scheduler: Throughput Gain due to Hyperthreading

12CONFIDENTIAL

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

perlbench bzip2 gcc mcf gobmk hmmer sjeng libquantum h264ref

No

rmalized

Th

rou

gh

pu

t G

ain

SPEC CPU2006 in 1-vCPU VMs (Haswell)

Baseline HT



bution

CPU Scheduler: Slowdown due to Hyperthreading

13

0

5

10

15

20

25

perlbench bzip2 gcc mcf gobmk hmmer sjeng libquantum h264ref

Ru

nti

me i

n m

inu

tes (

min

)SPEC CPU2006 in 1-vCPU VMs (Haswell)

baseline HT

1.5x

1.6x

1.6x

1.9x

1.8x

1.6x 1.6x

1.7x

1.8x

CONFIDENTIAL



bution

CPU Scheduler: Hyperthreading and %USED time

CONFIDENTIAL 14

• Improve throughput

• Each vCPU might runs slower with contention from hyper-twin

– 2 cores vs. 2 HT / single core

• Hyperthreading-aware scheduling

– Same %RUN translates into different amount of %USED

• 100% %RUN may only give 70% %USED in case of contention

• Enabled HT by default for the extra throughput

• Be aware of HT contention. ☺VMworld 2017 Content: N

ot for publicatio

n or distribution

CPU Scheduler Accounting: Breakdown

1515

A B C

t1 t2 t3

D

t4 t5

E

t6 t7 t8

CPU

scheduling

cost

Time in ready

queue

Actual

execution

Efficiency loss from

power mgmt, hyper-

threading, etc.

Interrupted

%RDY

%RUN%OVRLP

%SYS += D if for this VM

%USED = %RUN + %SYS - %OVRLP - E

CONFIDENTIAL

W

t0

Waiting

%WAITVMworld 2017 Content: Not fo


bution

CPU Scheduler Accounting: Time from Kernel Contexts

16

vSphere 6.5

vSphere 6.0

NEW!

CONFIDENTIAL



bution

CPU Scheduler Accounting: Group vs. World

17

Group (VM) stats aggregate world stats.

128 vCPUs!!

CONFIDENTIAL



bution

%RDY Impact on Throughput (Java Workload)

18

0.00

0.20

0.40

0.60

0.80

1.00

1.20

0 2 4 6 8 10 12 14 16 18 20

Th

rou

gh

pu

t (b

op

s)

%RDY

%RDY affects throughput

-15%

CONFIDENTIAL



bution

%RDY Impact on Latency (Redis Workload)

19CONFIDENTIAL

0

2

4

6

8

10

12

14

16

18

0 5 10 15 20 25

99.9

9 P

erc

en

tile

Late

nc

y (

ms

ec)

%RDY

spiky

flat

0 10 20 30 0 10 20 30

Latency depends on the competing workloads.

(-4ms)(-7ms)



bution

CPU Scheduler Co-scheduling

20

• *NOT* gang-scheduling

– Allows a subset of vCPUs to run simultaneously

– Costop a leading vCPU if it advances too far ahead

– Efficient in consolidated setup

• High %CSTOP?

– Any %RDY time?

– Watch out for vCPU’s (WAIT – WAIT_IDLE), i.e. %VMWAIT from esxtop

• vCPU blocks due to IO (to snapshot) or host level memory swap

CONFIDENTIAL



bution

Agenda

CONFIDENTIAL 21

1 CPU Scheduler

2 Memory Management

3 NUMA Scheduler




bution

Memory Management Overview

CONFIDENTIAL 22

• Goals

– Allow memory over-commitment

– Handle transient memory pressure well

Total Memory Size

Allocated Memory

Free Memory

Active Memory

Idle Memory

• Terminology



bution

Memory Management Overview

CONFIDENTIAL 23

• Reclaim memory if consumed > entitled

– Entitlement: shares, limit, reservation, active estimation

– Page sharing > Ballooning > Compression > Host swapping

• Breaks host large pages

• Page sharing vs. large pages

– Using large pages for both guest and ESXi improves performance by 10 – 30%

– Page sharing avoids ballooning and swapping

– vSphere 6.0 breaks large pages earlier and increase page sharing (clear state)



bution

Transient Memory Pressure Example

• Six 4GB Swingbench VMs (VM-4,5,6 are idle) in a 16GB host

0

2000

4000

6000

8000

100000 1 2 3 3 4 5 6 7 7 8 9

10

11

12

13

13

14

15

16

17

18

18

19

20

21

22

22

23

24

25

26

27

27

28

29

30

31

32

32

33

34

35

36

37

38

38

39

40

41

42

43

43

44

45

46

47

48

48

49

50

51

52

53

53

54

55

56

Op

era

tio

ns

per

Min

ute

s

Time (minutes)

VM1 VM2 VM3

0

2

4

6

8

10

12

0 1 2 3 4 5 6 7 8 9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

Siz

e(G

B)

Time (minutes)

Balloon

Swap Used

Compressed

Shared

∆VM1 = 0%

∆VM2 = 0%

24CONFIDENTIAL



bution

Constant Memory Pressure Example

• All six VMs run Swingbench workloads

0

2000

4000

6000

8000

10000

1 2 2 3 4 5 6 7 7 8 91

01

11

21

21

31

41

51

61

71

71

81

92

02

12

22

22

32

42

52

62

72

72

82

93

03

13

23

23

33

43

53

63

73

73

83

94

04

14

24

24

34

44

54

64

74

74

84

95

05

15

25

25

35

45

55

6

Op

era

tio

ns

per

Min

ute

Time (minutes)

VM1 VM2 VM3 VM4 VM5 VM6

0

1000

2000

3000

4000

0 1 2 3 4 5 6 7 8 9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

KB

per

Se

co

nd

Time (minutes)

Swap-in Rate

∆VM1 = -16%

∆VM2 = -21%

25CONFIDENTIAL



bution

• Two types of memory overcommitment

– “Configured” memory overcommitment: SUM (memory size of all VMs) / host memory size

– “Active” memory overcommitment: SUM (mem.active of all VMs) / host memory size

• Performance impact

– “Active” memory overcommitment ≈ 1 high likelihood of performance degradation!

• Some active memory are not in physical RAM

– “Configured” memory overcommitment > 1 zero or negligible impact

• Most reclaimed memory are free/idle guest memory

• Aim for high consolidation while keeping down “active” memory overcommitment

General Principles

26CONFIDENTIAL



bution

Agenda

CONFIDENTIAL 27

1 CPU Scheduler

2 Memory Management

3 NUMA Scheduler




bution

NUMA

CONFIDENTIAL 28

• Non-Uniform Memory Access system architecture– Each node consists of CPU cores, memory and possible devices

– Access time can be 30% ~ 200% longer across nodes

• NUMA node vs. Sockets

– Multiple NUMA nodes per socket (Cluster-on-Die)

– Multiple sockets per NUMA node (less common)

• Small VMs scheduled on a single physical NUMA node– 100% local memory accesses

– “Fixes” scale-out apps that don’t scale-up well

– Consider sizing databases to fit

NUMA node 0

NUMA node 1VMworld 2017 Content: Not fo


bution

NUMA Scheduler: Overview

CONFIDENTIAL 29

• Load balancing

– To balance VMs across different NUMA nodes

• vNUMA

– To properly expose virtualized NUMA topology to guest VMs for best performance



bution

NUMA Scheduler: Load Balancing

CONFIDENTIAL 30

• Initial placement

– Initial placement based on CPU/memory load + round-robin

• Periodic Rebalancing Algorithm

– At every 2 seconds, try incremental move (1 or 2 VMs)

– To improve load balance / memory locality / relation sharing / fairness



bution

NUMA Rebalancing In Action (TPCx-V)

CONFIDENTIAL 31

0

30

60

90

120

PC

PU

Nu

mb

er

Group-1

vm2 vm3 vm4

0

30

60

90

120

PC

PU

Nu

mb

er

Group-2

vm5 vm6 vm7

0

30

60

90

120

PC

PU

Nu

mb

er

Group-4

vm11 vm12 vm13

0

30

60

90

120

PC

PU

Nu

mb

er

Group-3

vm8 vm9 vm10



bution

32

NUMA Scheduler: Impact of Cluster-on-Die (Haswell)

0.96

0.98

1.00

1.02

1.04

1.06

1.08

1.10

36Vx1 18Vx1 9Vx2

No

rma

lize

d T

hro

ug

hp

ut

Ga

in

SPECjbb2015 Throughput

default CoD

• Cluster-on-Die

– Breaks each socket into 2 NUMA domain

• Performance Considerations

– Lower LLC hit latency and local memory latency

– Higher local memory bandwidth

• vSphere 6.5 supports Haswell, Broadwell and future generations of processors



bution

NUMA Scheduler: vNUMA

CONFIDENTIAL 33

• vNUMA

– Useful for wide VMs (#vCPUs > #cores/NUMA node)

– Expose virtual NUMA topology to improve memory locality for better guest scheduling

• Example: 10-vCPU VM maps directly to 2 pNUMA nodes (out of 4)

C5C4C3

C0 C1 C2

C5C4C3

C0 C1 C2

C5C4C3

C0 C1 C2

C5C4C3

C0 C1 C2



bution

NUMA Scheduler: vNUMA vs. vSocket

CONFIDENTIAL 34

vNUMA nodevNUMA node

vSocket vSocketvSocketvSocket

• Decoupling vSocket from vNUMA (vSphere 6.5)

– ESXi will always try to pick the optimal vNUMA topology when possible

• As long as: vNUMA = N x vSocket or vice versa

– Blog posts: “Virtual Machine vCPU and vNUMA Rightsizing – Rules of Thumb”

vCPU vCPU vCPU vCPU vCPU vCPU vCPU vCPU

NEW!



bution

NUMA Scheduler: Virtual CPU Topology Example

35CONFIDENTIAL

• numactl –hardware

• lstopo –s

• coreinfo –n –s



bution

Agenda

CONFIDENTIAL 36

1 CPU Scheduler

2 Memory Management

3 NUMA Scheduler




bution

Host Configuration: Power Management Policy

37

0.9

0.95

1

1.05

1.1

1.15

1.2

IvyBridge Haswell

No

rma

lized

Perf

orm

an

ce

Throughput of 1-vCPU VM Java Workload

High Perf Balanced (P-states+C-states)

CONFIDENTIAL



bution

VM Sizing: #vCPUs

CONFIDENTIAL 38

• Cost of over-sizing

– Small CPU overhead per vCPU from periodic timer, etc.

– May hurt performance due to process migrations (e.g. Redis, up to 40% regression)

• Cost of under-sizing

– Internal CPU contention

– Check CPU usage and processor queue length



bution

VM Sizing: vRAM

CONFIDENTIAL 39

• Cost of over-sizing

– Some apps/OSes treat unused memory as cache

• e.g. SuperFetch

• Increases active memory

• May suffer from memory reclamation

• Cost of under-sizing

– Guest level paging



bution

Summary

CONFIDENTIAL 40

• Be aware of the different between per-VM %RDY and per-world %RDY

• Pay attention to ready time (%RDY) if the tail latency matters

• Sizing VM based on number of physical cores instead of hyperthreads

• In vSphere 6.5 and beyond

– Insignificant %OVRLP and some %SYS moved to %RUN

– No more hassle between vSocket and vNUMA

– Even smaller scheduling overhead

• Avoid active memory overcommit.

• Watch out for under-sizing a VM.

• Power policy matters!

NEW!



bution

Extreme Performance Series – Las Vegas

• SER2724BU Performance Best Practices

• SER2723BU Benchmarking 101

• SER2343BU vSphere Compute & Memory Schedulers

• SER1504BU vCenter Performance Deep Dive

• SER2734BU Byte Addressable Non-Volatile Memory in vSphere

• SER2849BU Predictive DRS – Performance & Best Practices

• SER1494BU Encrypted vMotion Architecture, Performance, & Futures

• STO1515BU vSAN Performance Troubleshooting

• VIRT1445BU Fast Virtualized Hadoop and Spark on All-Flash Disks

• VIRT1397BU Optimize & Increase Performance Using VMware NSX

• VIRT2550BU Reducing Latency in Enterprise Applications with VMware NSX

• VIRT1052BU Monster VM Database Performance

• VIRT1983BU Cycle Stealing from the VDI Estate for Financial Modeling

• VIRT1997BU Machine Learning and Deep Learning on VMware vSphere

• FUT2020BU Wringing Max Perf from vSphere for Extremely Demanding Workloads

• FUT2761BU Sharing High Performance Interconnects across Multiple VMs

CONFIDENTIAL 41



bution

Extreme Performance Series – Barcelona

• SER2724BE Performance Best Practices

• SER2343BE vSphere Compute & Memory Schedulers

• SER1504BE vCenter Performance Deep Dive

• SER2849BE Predictive DRS – Performance & Best Practices

• VIRT1445BE Fast Virtualized Hadoop and Spark on All-Flash Disks

• VIRT1397BE Optimize & Increase Performance Using VMware NSX

• VIRT1052BE Monster VM Database Performance

• FUT2020BE Wringing Max Perf from vSphere for Extremely Demanding Workloads

CONFIDENTIAL 42



bution

Extreme Performance Series - Hand on Labs

Don’t miss these popular Extreme Performance labs:

• HOL-1804-01-SDC: vSphere 6.5 Performance Diagnostics & Benchmarking

– Each module dives deep into vSphere performance best practices, diagnostics, and optimizations using various interfaces and benchmarking tools.

• HOL-1804-02-CHG: vSphere Challenge Lab

– Each module places you in a different fictional scenario to fix common vSphere operational and performance problems.

CONFIDENTIAL 43



bution

Performance Survey

CONFIDENTIAL 44

The VMware Performance Engineeringteam is always looking for feedback about your experience with theperformance of our products, ourvarious tools, interfaces and wherewe can improve.

Scan this QR code to access ashort survey and provide us directfeedback.

Alternatively: www.vmware.com/go/perf

Thank you!



bution



bution



bution

Documents

SER2343BU Extreme Performance Series: vSphere Compute ... · Extreme Performance Series: vSphere Compute & Memory Schedulers ... Throughput Gain due to Hyperthreading CONFIDENTIAL