Chicago, October 19 - 22, 2010
Virtualization Technical Deep Dive Key Concepts for for Developers Richard McDougall - VMware
SpringOne 2GX 2009. All rights reserved. Do not distribute without permission.
Virtualization Technical Deep Dive
We’ll be covering • Virtualization Capabilities • Workstation Virtualization • How Virtual machines work, what is the overhead • How Server Virtualization/Consolidation works • Java and Consolidation on Server Virtualization
What is Virtualization?
Partitioning • Run multiple operating
systems on one physical machine
• Fully utilize server resources • Support high availability by
clustering virtual machines
Encapsulation • Encapsulate the entire state
of the virtual machine in hardware-independent files
• Save the virtual machine state as a snapshot in time
• Re-use or transfer whole virtual machines with a simple file copy
Isolation • Isolate faults and security at
the virtual-machine level • Dynamically control CPU,
memory, disk and network resources per virtual machine
• Guarantee service levels
Three Properties of Virtualization
Virtualization for Desktops/Laptops
• Desktop products – VMware Fusion and Workstation
• Features for Developers – Run multiple OS versions concurrently – Test Server applications on your desktop/laptop – Leverage the record/replay capability for debug
Consolidation targets are often <30% Utilized " Windows average utilization: 5-8% " Linux/Unix average: 10-35%
Virtualization for Servers: Problem: Underutilized Servers
Initial Virtualization Benefits: Consolidation
BEFORE VMware AFTER VMware
1,000 Direct attach 3000 cables/ports 200 racks 400 power whips
80 Tiered SAN and NAS 400 cables/ports 10 racks 20 power whips
Servers Storage Network Facilities
Servers Storage Network Facilities
Next Benefit: Simpler Management VMotion Technology
VMotion Technology moves running virtual machines from one host to another while maintaining continuous service availability
- Enables Resource Pools - Enables High Availability
Pooling of resources
Resource Pool
Resource Pool Resource Pool
Pools replace hosts as the primary
compute abstraction
Balanced Cluster
Automated Pool of Resources
Heavy Load
Lighter Load
vCenter
Imbalanced Cluster
DRS Scalability – Transactions per minute (Higher the better)
Already balanced So, fewer gains Higher gains (> 40%)
with more imbalance
VIRTUALIZATION TECHNOLOGY
“Hosted” vs vSphere Virtualization Architecture
Host Operating System (Linux, Windows, MacOSX)
Guest Guest
Physical Hardware
VMware (Fusion, Workstation)
Guest Guest
Physical Hardware
VMware vSphere
(Server Virtualization)
“Hosted” Virtualization Architecture
Host Operating System
Guest
Physical Hardware
Monitor .
Guest
NIC Drivers I/O Drivers
Local File System mydisk.vmdk
Monitor
OS Process
Virtual NIC Virtual SCSI
TCP/IP File
System
Virtual CPU abstraction is created by “monitor”
Each VM is an OS process
Monitor supports: BT (Binary Translation) HW (Hardware assist) PV (Paravirtualization)
Memory is allocated by the OS and virtualized by the monitor
Network and I/O devices are emulated and proxied though native device drivers
OS Process
rmc$ ps -fp 4295 UID PID PPID C STIME TTY TIME CMD 0 4295 1 0 18:15.66 ?? 21:05.14 /Library/
Application Support/VMware Fusion/vmware-vmx /Users/rmc/Documents/Virtual Machines/Windows XP Pro.vmwarevm/Windows XP Pro.vmx
rmc$ more Windows XP Pro.vmx virtualHW.version = "7” memsize = "776” ide0:0.fileName = "Windows XP Professional.vmdk” ethernet0.connectionType = "nat"
Inside the Monitor: Classical Instruction Virtualization Trap-and-emulate
Nonvirtualized (“native”) system – OS runs in privileged mode – OS “owns” the hardware – Application code has less privilege
Virtualized – VMM most privileged (for isolation) – Classical “ring compression” or “de-privileging”
• Run guest OS kernel in Ring 1 • Privileged instructions trap; emulated by VMM
– But: does not work for x86 (lack of traps)
Ring 3
Ring 0 OS
Apps
Ring 3
Ring 0
Guest OS
Apps
VMM
Ring 1
Binary Translation of Guest Code Translate guest kernel code Replace privileged instrs with safe “equivalent” instruction
sequences No need for traps BT is an extremely powerful technology
– Permits any unmodified x86 OS to run in a VM – Can virtualize any instruction set
Combining BT and Direct Execution
Direct Execution (user mode guest code)
Binary Translation (kernel mode guest code)
VMM
Faults, syscalls interrupts
IRET, sysret
BT Mechanics Each translator invocation
– Consume one input basic block (guest code) – Produce one output basic block
Store output in translation cache – Future reuse – Amortize translation costs – Guest-transparent: no patching “in place”
translator
input basic block Guest
translated basic block
Translation cache
Intel VT/ AMD-V: 1st Generation HW Support
• Key feature: root vs. guest CPU mode
– VMM executes in root mode
– Guest (OS, apps) execute in guest mode
• VMM and Guest run as “co-routines”
– VM enter
– Guest runs
– A while later: VM exit
– VMM runs
– ...
Root m
ode G
uest mode
Ring 3
Ring 0
VM exit
VM enter
Guest OS
Apps
VMM
Qualitative Comparison of BT and VT-x/AMD-V
• VT-x/AMD-V loses on: – exits (costlier than “callouts”) – no adaptation (cannot elim. exits) – page table updates – memory-mapped I/O – IN/OUT instructions
• VT-x/AMD-V wins on: – system calls – almost all code runs “directly”
• BT loses on: – system calls – translator overheads – path lengthening – indirect control flow
• BT wins on: – page table updates (adaptation) – memory-mapped I/O (adapt.) – IN/OUT instructions – no traps for priv. instructions
Can I Virtualize CPU Intensive Applications? Most CPU intensive applications have very low overhead
VMware ESX 3.x compared to Native SPECcpu results covered by O.Agesen and K.Adams Paper Websphere results published jointly by IBM/VMware SPECjbb results from recent internal measurements
Virtualizing Virtual Memory
• To run multiple VMs on a single system, another level of memory virtualization must be done – Guest OS still controls virtual to physical mapping: VA -> PA – Guest OS has no direct access to machine memory (to enforce
isolation) • VMM maps guest physical memory to actual machine memory: PA -> MA
Virtual Memory
Physical Memory
VA
PA
VM 1 VM 2
Process 1 Process 2 Process 1 Process 2
Machine Memory
MA
Virtualizing Virtual Memory Shadow Page Tables
• VMM builds “shadow page tables” to accelerate the mappings – Shadow directly maps VA -> MA – Can avoid doing two levels of translation on every access – TLB caches VA->MA mapping – Leverage hardware walker for TLB fills (walking shadows) – When guest changes VA -> PA, the VMM updates shadow page tables
Virtual Memory
Physical Memory
VA
PA
VM 1 VM 2
Process 1 Process 2 Process 1 Process 2
Machine Memory
MA
2nd Generation Hardware Assist Nested/Extended Page Tables
VA MA TLB
TLB fill hardware
guest VMM
Guest PT ptr
Nested PT ptr
VA→PA mapping
PA→MA mapping
. . .
Hardware-assisted Memory Virtualization
0%
10%
20%
30%
40%
50%
60%
Apache Compile SQL Server Citrix XenApp
Efficiency Improvement
Efficiency Improvement
“Hosted” vs vSphere Virtualization Architecture
Host Operating System (Linux, Windows, MacOSX)
Guest Guest
Physical Hardware
VMware (Fusion, Workstation)
Guest Guest
Physical Hardware
VMware vSphere
vSphere Virtualization Architecture
Guest
Physical Hardware
Guest TCP/IP File
System
Virtual CPU abstraction is created by “monitor”
Each VM is an OS process
Monitor supports: BT (Binary Translation) HW (Hardware assist) PV (Paravirtualization)
Memory is allocated by the OS and virtualized by the monitor
Network and I/O devices are emulated and proxied though native device drivers
vSphere
Monitor Monitor
Memory Allocator
NIC Drivers
Virtual Switch
I/O Drivers
File System Scheduler
Virtual NIC Virtual SCSI
Performance
Ability to satisfy Performance Demands
General Population Of Apps
ESX 2.x (2003)
Overhead:30-60% VCPUs: 2 VM RAM:3.6 GB Phys RAM:64GB PCPUs:16 core IOPS:<10,000 N/W:380 Mb/s Monitor Type: Binary Translation
VI 3.0 (2005)
Overhead:20-40% VCPUs:2 VM RAM:16 GB Phys RAM:64GB PCPUs:16 core IOPS:10,000 N/W:800 Mb/s Gen-1 HW Virtualization Monitor Type: VT / SVM
Mission Critical Apps
100%
VI 3.5 (2007)
Overhead:10-30% VCPUs:4 VM RAM:64GB Phys RAM:256GB PCPUs:64 core IOPS:100,000 N/W:9 Gb/s 64-bit OS Support Gen-2 HW Virtualization Monitor Type: NPT
vSphere 4.0 (2009)
Overhead:2-15% VCPUs:8 VM RAM:255GB Phys RAM:1 TB PCPUs:64 core IOPS:350,000 N/W:28 Gb/s 64-bit OS Support 320 VMs per host 512 vCPUs per host Monitor Type: EPT
High Throughput Web Workloads (SPECweb)
Overall response time is lower when CPU utilization is less than 100% due to multi-core offload
>95% of All Databases fit in a Virtual Machine
VMkernel
Guest
Physical CPUs
o Schedule virtual CPUs on physical CPUs
o Virtual time based proportional-share CPU scheduler
o Flexible and accurate rate-based controls over CPU time allocations
o NUMA/processor/cache topology aware
o Provide graceful degradation in over-commitment situations
o High scalability with low scheduling latencies
o Fine-grain built-in accounting for workload observability
o Support for VSMP virtual machines
Monitor
Scheduler
Guest
Monitor Monitor
Guest
CPUs and Scheduling
VM Scheduling: How will multiple VMs operate?
• VM state – running (%used) – waiting (%twait) – ready to run (%ready)
• When does a VM go to “ready to run” state – Guest wants to run or need to be woken up (to deliver an
interrupt) – All available CPU is running other VMs
Run
Ready Wait
Resource Controls: Performance SLA
• Reservation – Minimum service level guarantee (in MHz) – Even when system is overcommitted – Needs to pass admission control
• Shares – CPU entitlement is directly proportional to VM's
shares and depends on the total number of shares issued
– Abstract number, only ratio matters • Limit
– Absolute upper bound on CPU entitlement (in MHz) – Even when system is not overcommitted
Limit
Reservation
0 Mhz
Total Mhz
Shares apply here
vSphere Memory Management
Guest A VM Size: 1GB 1GB
Guest B
400Mb used 1GB
Physical
200MB used
200MB used
Guest A 1GB
Guest B
1GB used 1GB Physical
1GB used
1GB used
Thin Provisioned
(Undercommited)
2GB of VMs on 1GB host is OK
(Overcommited) 1GB
Paging and Swapping to
Disk
Virtual Memory
guest
hypervisor
“machine” memory
“physical” memory
“virtual” memory
“virtual” memory
“physical” memory
“machine” memory
guest
hypervisor
Application
Operating System
Hypervisor
App
OS
Hypervisor
Application Memory Management
– Starts with no memory – Allocates memory through syscall to
operating system – Often frees memory voluntarily through
syscall – Explicit memory allocation interface
with operating system
Hypervisor
OS
App
Operating System Memory Management
– Assumes it owns all physical memory
– No memory allocation interface with hardware
• Does not explicitly allocate or free physical memory
– Defines semantics of “allocated” and “free” memory
• Maintains “free” list and “allocated” lists of physical memory
• Memory is “free” or “allocated” depending on which list it resides
Hypervisor
OS
App
Hypervisor Memory Management
– Very similar to operating system memory management
• Assumes it owns all machine memory
• No memory allocation interface with hardware
• Maintains lists of “free” and “allocated” memory
Hypervisor
OS
App
VM Memory Allocation
– VM starts with no physical memory allocated to it
– Physical memory allocated on demand
• Guest OS will not explicitly allocate
• Allocate on first VM access to memory (read or write)
Hypervisor
OS
App
VM Memory Reclamation
• Guest physical memory not “freed” in typical sense – Guest OS moves memory to its
“free” list – Data in “freed” memory may
not have been modified
• Hypervisor isn’t aware when guest frees memory – Freed memory state
unchanged – No access to guest’s “free” list – Unsure when to reclaim “freed”
guest memory
OS
App
Guest free list
Hypervisor
VM Memory Reclamation Cont’d
• Guest OS (inside the VM) – Allocates and frees… – And allocates and
frees… – And allocates and
frees… " VM
" Allocates…
" And allocates…
" And allocates…
Hypervisor needs some way of reclaiming memory!
App
Guest free list
Inside the VM
OS
VM
Hypervisor
Ballooning
Guest OS
balloon
Guest OS
balloon
Guest OS
inflate balloon (+ pressure)
deflate balloon (– pressure)
may free buffers or page out to virtual disk
May grow buffers or page in from virtual disk
guest OS manages memory implicit cooperation
Java Memory Management (Hotspot)
Java Heap Usage
JVM Heap Size (-Xmx=)
VM Usage
Garbage Collection
VMware ESX and Java Memory Management Combined
Java Heap Usage – Without reservations VM Config Size
JVM Heap Size (-Xmx=)
VM Usage
Java Heap Usage – With VM Reservation VM Config Size
JVM Heap Size
VM Usage
Limit
Reservation
0 MB
Total MB
Performance Measurement in a Virtual World Traditionally, the OS was the authority
Operating system performs various roles – Application Runtime Libraries – Resource Management (CPU, Memory etc) – Hardware + Driver management
" Performance & Scalability of the OS was paramount
" Performance Observability tools are a feature of the OS
Performance Measurement in a Virtual World The OS becomes the “Application Library”, and the Hypervisor becomes the authority
Important Notes about Measuring Performance • Resources measured from within the Guest-OS may not
be accurate – The OS is sharing physical resources with others – CPU utilization is often under-reported (some CPU time is
stolen to other guest-Oses) • Time measurements
– Course grained time measurements are correct (if VMware tools are installed/enabled)
– Fine grained measurements are subject to jitter (don’t try to measure sub-millisecond response times without special tools)
– CPU steals will add to latency of non-CPU measured events (e.g. I/O response times)
Tools for Performance Analysis
• Guest Tools: vmstat, mpstat, management tools • VirtualCenter client (VI client):
– Per-host and per-cluster stats – Graphical Interface – Historical and Real-time data
• esxtop: per-host statistics – Command-line tool found in the console-OS
• Java SDK – Allows you to collect only the statistics they want
Potential Impacts to Performance
• Virtual Machine Contributors Latency: – CPU Overhead can contribute to latency (but it’s small!) – Scheduling latency (VM runnable, but waiting…) – Waiting for a global memory paging operation – Disk Reads/Writes taking longer
• Virtual machine impacts to Throughput: – Throughput ceiling if not enough resources allocated – Throughput ceiling if not enough virtual CPU/Mem allocated
VMkernel
Physical Hardware
Memory Allocator
NIC Drivers
Virtual Switch
I/O Drivers
File System Scheduler
Virtual NIC Virtual SCSI
vSphere Instrumentation Points
vCPU
pCPU
HBA
Physical Disk
vNIC
Virtual Disk Guest
Monitor
Service Console
Monitor
TCP/IP File
System
pNIC
VMHBA cCPU
VI Client
Real-time vs. Historical
Rollup Stats type
Object
Counter type
Chart Type
CPU capacity
Ready time < used time
Used time
Ready time ~ used time
Some caveats on ready time Used time ~ ready time: may
signal contention. However, might not be overcommitted due to workload variability
In this example, we have periods of activity and idle periods: CPU isn’t overcommitted all the time
(screenshot from VI Client)
esxtop What is esxtop ?
• Performance troubleshooting tool for ESX host • Displays performance statistics in rows and column format
Fields
Performance Summary
• Use vSphere rather than Workstation/Fusion for any performance testing – Better performance from Sched, I/O, Large Pages, etc,…
• vSphere will provide near-native performance – Ensure resources are available (under-commit or use
controls) – If I/O intensive, ensure shared storage is configured with
enough capacity – Ensure VMware tools are installed
• Use the correct performance instrumentation – vSphere or esxtop
SpringOne 2GX 2010. All rights reserved. Do not distribute without permission.
Q&A
Recommended