Download pdf - Virtualization Primer for Java Developers

Chicago, October 19 - 22, 2010

Virtualization Technical Deep Dive Key Concepts for for Developers Richard McDougall - VMware

SpringOne 2GX 2009. All rights reserved. Do not distribute without permission.

Virtualization Technical Deep Dive

We’ll be covering •  Virtualization Capabilities •  Workstation Virtualization •  How Virtual machines work, what is the overhead •  How Server Virtualization/Consolidation works •  Java and Consolidation on Server Virtualization

What is Virtualization?

Partitioning •  Run multiple operating

systems on one physical machine

•  Fully utilize server resources •  Support high availability by

clustering virtual machines

Encapsulation •  Encapsulate the entire state

of the virtual machine in hardware-independent files

•  Save the virtual machine state as a snapshot in time

•  Re-use or transfer whole virtual machines with a simple file copy

Isolation •  Isolate faults and security at

the virtual-machine level •  Dynamically control CPU,

memory, disk and network resources per virtual machine

•  Guarantee service levels

Three Properties of Virtualization

Virtualization for Desktops/Laptops

•  Desktop products –  VMware Fusion and Workstation

•  Features for Developers –  Run multiple OS versions concurrently –  Test Server applications on your desktop/laptop –  Leverage the record/replay capability for debug

Consolidation targets are often <30% Utilized "  Windows average utilization: 5-8% "  Linux/Unix average: 10-35%

Virtualization for Servers: Problem: Underutilized Servers

Initial Virtualization Benefits: Consolidation

BEFORE VMware AFTER VMware

1,000 Direct attach 3000 cables/ports 200 racks 400 power whips

80 Tiered SAN and NAS 400 cables/ports 10 racks 20 power whips

Servers Storage Network Facilities

Servers Storage Network Facilities

Next Benefit: Simpler Management VMotion Technology

VMotion Technology moves running virtual machines from one host to another while maintaining continuous service availability

- Enables Resource Pools - Enables High Availability

Pooling of resources

Resource Pool

Resource Pool Resource Pool

Pools replace hosts as the primary

compute abstraction

Balanced Cluster

Automated Pool of Resources

Heavy Load

Lighter Load

vCenter

Imbalanced Cluster

DRS Scalability – Transactions per minute (Higher the better)

Already balanced So, fewer gains Higher gains (> 40%)

with more imbalance

VIRTUALIZATION TECHNOLOGY

“Hosted” vs vSphere Virtualization Architecture

Host Operating System (Linux, Windows, MacOSX)

Guest Guest

Physical Hardware

VMware (Fusion, Workstation)

Guest Guest

Physical Hardware

VMware vSphere

(Server Virtualization)

“Hosted” Virtualization Architecture

Host Operating System

Guest

Physical Hardware

Monitor .

Guest

NIC Drivers I/O Drivers

Local File System mydisk.vmdk

Monitor

OS Process

Virtual NIC Virtual SCSI

TCP/IP File

System

Virtual CPU abstraction is created by “monitor”

Each VM is an OS process

Monitor supports:  BT (Binary Translation)  HW (Hardware assist)  PV (Paravirtualization)

Memory is allocated by the OS and virtualized by the monitor

Network and I/O devices are emulated and proxied though native device drivers

OS Process

rmc$ ps -fp 4295 UID PID PPID C STIME TTY TIME CMD 0 4295 1 0 18:15.66 ?? 21:05.14 /Library/

Application Support/VMware Fusion/vmware-vmx /Users/rmc/Documents/Virtual Machines/Windows XP Pro.vmwarevm/Windows XP Pro.vmx

rmc$ more Windows XP Pro.vmx virtualHW.version = "7” memsize = "776” ide0:0.fileName = "Windows XP Professional.vmdk” ethernet0.connectionType = "nat"

Inside the Monitor: Classical Instruction Virtualization Trap-and-emulate

  Nonvirtualized (“native”) system –  OS runs in privileged mode –  OS “owns” the hardware –  Application code has less privilege

  Virtualized –  VMM most privileged (for isolation) –  Classical “ring compression” or “de-privileging”

•  Run guest OS kernel in Ring 1 •  Privileged instructions trap; emulated by VMM

–  But: does not work for x86 (lack of traps)

Ring 3

Ring 0 OS

Apps

Ring 3

Ring 0

Guest OS

Apps

VMM

Ring 1

Binary Translation of Guest Code  Translate guest kernel code  Replace privileged instrs with safe “equivalent” instruction

sequences  No need for traps  BT is an extremely powerful technology

–  Permits any unmodified x86 OS to run in a VM –  Can virtualize any instruction set

Combining BT and Direct Execution

Direct Execution (user mode guest code)

Binary Translation (kernel mode guest code)

VMM

Faults, syscalls interrupts

IRET, sysret

BT Mechanics   Each translator invocation

–  Consume one input basic block (guest code) –  Produce one output basic block

  Store output in translation cache –  Future reuse –  Amortize translation costs –  Guest-transparent: no patching “in place”

translator

input basic block Guest

translated basic block

Translation cache

Intel VT/ AMD-V: 1st Generation HW Support

•  Key feature: root vs. guest CPU mode

–  VMM executes in root mode

–  Guest (OS, apps) execute in guest mode

•  VMM and Guest run as “co-routines”

–  VM enter

–  Guest runs

–  A while later: VM exit

–  VMM runs

–  ...

Root m

ode G

uest mode

Ring 3

Ring 0

VM exit

VM enter

Guest OS

Apps

VMM

Qualitative Comparison of BT and VT-x/AMD-V

•  VT-x/AMD-V loses on: –  exits (costlier than “callouts”) –  no adaptation (cannot elim. exits) –  page table updates –  memory-mapped I/O –  IN/OUT instructions

•  VT-x/AMD-V wins on: –  system calls –  almost all code runs “directly”

•  BT loses on: –  system calls –  translator overheads –  path lengthening –  indirect control flow

•  BT wins on: –  page table updates (adaptation) –  memory-mapped I/O (adapt.) –  IN/OUT instructions –  no traps for priv. instructions

Can I Virtualize CPU Intensive Applications? Most CPU intensive applications have very low overhead

VMware ESX 3.x compared to Native SPECcpu results covered by O.Agesen and K.Adams Paper Websphere results published jointly by IBM/VMware SPECjbb results from recent internal measurements

Virtualizing Virtual Memory

•  To run multiple VMs on a single system, another level of memory virtualization must be done –  Guest OS still controls virtual to physical mapping: VA -> PA –  Guest OS has no direct access to machine memory (to enforce

isolation) •  VMM maps guest physical memory to actual machine memory: PA -> MA

Virtual Memory

Physical Memory

VA

PA

VM 1 VM 2

Process 1 Process 2 Process 1 Process 2

Machine Memory

MA

Virtualizing Virtual Memory Shadow Page Tables

•  VMM builds “shadow page tables” to accelerate the mappings –  Shadow directly maps VA -> MA –  Can avoid doing two levels of translation on every access –  TLB caches VA->MA mapping –  Leverage hardware walker for TLB fills (walking shadows) –  When guest changes VA -> PA, the VMM updates shadow page tables

Virtual Memory

Physical Memory

VA

PA

VM 1 VM 2

Process 1 Process 2 Process 1 Process 2

Machine Memory

MA

2nd Generation Hardware Assist Nested/Extended Page Tables

VA MA TLB

TLB fill hardware

guest VMM

Guest PT ptr

Nested PT ptr

VA→PA mapping

PA→MA mapping

. . .

Hardware-assisted Memory Virtualization

0%

10%

20%

30%

40%

50%

60%

Apache Compile SQL Server Citrix XenApp

Efficiency Improvement

Efficiency Improvement

“Hosted” vs vSphere Virtualization Architecture

Host Operating System (Linux, Windows, MacOSX)

Guest Guest

Physical Hardware

VMware (Fusion, Workstation)

Guest Guest

Physical Hardware

VMware vSphere

vSphere Virtualization Architecture

Guest

Physical Hardware

Guest TCP/IP File

System

Virtual CPU abstraction is created by “monitor”

Each VM is an OS process

Monitor supports:  BT (Binary Translation)  HW (Hardware assist)  PV (Paravirtualization)

Memory is allocated by the OS and virtualized by the monitor

Network and I/O devices are emulated and proxied though native device drivers

vSphere

Monitor Monitor

Memory Allocator

NIC Drivers

Virtual Switch

I/O Drivers

File System Scheduler


Performance

Ability to satisfy Performance Demands

General Population Of Apps

ESX 2.x (2003)

Overhead:30-60% VCPUs: 2 VM RAM:3.6 GB Phys RAM:64GB PCPUs:16 core IOPS:<10,000 N/W:380 Mb/s Monitor Type: Binary Translation

VI 3.0 (2005)

Overhead:20-40% VCPUs:2 VM RAM:16 GB Phys RAM:64GB PCPUs:16 core IOPS:10,000 N/W:800 Mb/s Gen-1 HW Virtualization Monitor Type: VT / SVM

Mission Critical Apps

100%

VI 3.5 (2007)

Overhead:10-30% VCPUs:4 VM RAM:64GB Phys RAM:256GB PCPUs:64 core IOPS:100,000 N/W:9 Gb/s 64-bit OS Support Gen-2 HW Virtualization Monitor Type: NPT

vSphere 4.0 (2009)

Overhead:2-15% VCPUs:8 VM RAM:255GB Phys RAM:1 TB PCPUs:64 core IOPS:350,000 N/W:28 Gb/s 64-bit OS Support 320 VMs per host 512 vCPUs per host Monitor Type: EPT

High Throughput Web Workloads (SPECweb)

Overall response time is lower when CPU utilization is less than 100% due to multi-core offload

>95% of All Databases fit in a Virtual Machine

VMkernel

Guest

Physical CPUs

o Schedule virtual CPUs on physical CPUs

o Virtual time based proportional-share CPU scheduler

o Flexible and accurate rate-based controls over CPU time allocations

o NUMA/processor/cache topology aware

o Provide graceful degradation in over-commitment situations

o High scalability with low scheduling latencies

o Fine-grain built-in accounting for workload observability

o Support for VSMP virtual machines

Monitor

Scheduler

Guest

Monitor Monitor

Guest

CPUs and Scheduling

VM Scheduling: How will multiple VMs operate?

•  VM state –  running (%used) –  waiting (%twait) –  ready to run (%ready)

•  When does a VM go to “ready to run” state –  Guest wants to run or need to be woken up (to deliver an

interrupt) –  All available CPU is running other VMs

Run

Ready Wait

Resource Controls: Performance SLA

•  Reservation –  Minimum service level guarantee (in MHz) –  Even when system is overcommitted –  Needs to pass admission control

•  Shares –  CPU entitlement is directly proportional to VM's

shares and depends on the total number of shares issued

–  Abstract number, only ratio matters •  Limit

–  Absolute upper bound on CPU entitlement (in MHz) –  Even when system is not overcommitted

Limit

Reservation

0 Mhz

Total Mhz

Shares apply here

vSphere Memory Management

Guest A VM Size: 1GB 1GB

Guest B

400Mb used 1GB

Physical

200MB used

200MB used

Guest A 1GB

Guest B

1GB used 1GB Physical

1GB used

1GB used

Thin Provisioned

(Undercommited)

2GB of VMs on 1GB host is OK

(Overcommited) 1GB

Paging and Swapping to

Disk

Virtual Memory

guest

hypervisor

“machine” memory

“physical” memory

“virtual” memory

“virtual” memory

“physical” memory

“machine” memory

guest

hypervisor

Application

Operating System

Hypervisor

App

OS

Hypervisor

Application Memory Management

–  Starts with no memory –  Allocates memory through syscall to

operating system –  Often frees memory voluntarily through

syscall –  Explicit memory allocation interface

with operating system

Hypervisor

OS

App

Operating System Memory Management

–  Assumes it owns all physical memory

–  No memory allocation interface with hardware

•  Does not explicitly allocate or free physical memory

–  Defines semantics of “allocated” and “free” memory

•  Maintains “free” list and “allocated” lists of physical memory

•  Memory is “free” or “allocated” depending on which list it resides

Hypervisor

OS

App

Hypervisor Memory Management

–  Very similar to operating system memory management

•  Assumes it owns all machine memory

•  No memory allocation interface with hardware

•  Maintains lists of “free” and “allocated” memory

Hypervisor

OS

App

VM Memory Allocation

–  VM starts with no physical memory allocated to it

–  Physical memory allocated on demand

•  Guest OS will not explicitly allocate

•  Allocate on first VM access to memory (read or write)

Hypervisor

OS

App

VM Memory Reclamation

•  Guest physical memory not “freed” in typical sense –  Guest OS moves memory to its

“free” list –  Data in “freed” memory may

not have been modified

•  Hypervisor isn’t aware when guest frees memory –  Freed memory state

unchanged –  No access to guest’s “free” list –  Unsure when to reclaim “freed”

guest memory

OS

App

Guest free list

Hypervisor

VM Memory Reclamation Cont’d

•  Guest OS (inside the VM) –  Allocates and frees… –  And allocates and

frees… –  And allocates and

frees… "   VM

"   Allocates…

"   And allocates…

"   And allocates…

Hypervisor needs some way of reclaiming memory!

App

Guest free list

Inside the VM

OS

VM

Hypervisor

Ballooning

Guest OS

balloon

Guest OS

balloon

Guest OS

inflate balloon (+ pressure)

deflate balloon (– pressure)

may free buffers or page out to virtual disk

May grow buffers or page in from virtual disk

guest OS manages memory implicit cooperation

Java Memory Management (Hotspot)

Java Heap Usage

JVM Heap Size (-Xmx=)

VM Usage

Garbage Collection

VMware ESX and Java Memory Management Combined

Java Heap Usage – Without reservations VM Config Size

JVM Heap Size (-Xmx=)

VM Usage

Java Heap Usage – With VM Reservation VM Config Size

JVM Heap Size

VM Usage

Limit

Reservation

0 MB

Total MB

Performance Measurement in a Virtual World Traditionally, the OS was the authority

Operating system performs various roles –  Application Runtime Libraries –  Resource Management (CPU, Memory etc) –  Hardware + Driver management

"   Performance & Scalability of the OS was paramount

"   Performance Observability tools are a feature of the OS

Performance Measurement in a Virtual World The OS becomes the “Application Library”, and the Hypervisor becomes the authority

Important Notes about Measuring Performance •  Resources measured from within the Guest-OS may not

be accurate –  The OS is sharing physical resources with others –  CPU utilization is often under-reported (some CPU time is

stolen to other guest-Oses) •  Time measurements

–  Course grained time measurements are correct (if VMware tools are installed/enabled)

–  Fine grained measurements are subject to jitter (don’t try to measure sub-millisecond response times without special tools)

–  CPU steals will add to latency of non-CPU measured events (e.g. I/O response times)

Tools for Performance Analysis

•  Guest Tools: vmstat, mpstat, management tools •  VirtualCenter client (VI client):

–  Per-host and per-cluster stats –  Graphical Interface –  Historical and Real-time data

•  esxtop: per-host statistics –  Command-line tool found in the console-OS

•  Java SDK –  Allows you to collect only the statistics they want

Potential Impacts to Performance

•  Virtual Machine Contributors Latency: –  CPU Overhead can contribute to latency (but it’s small!) –  Scheduling latency (VM runnable, but waiting…) –  Waiting for a global memory paging operation –  Disk Reads/Writes taking longer

•  Virtual machine impacts to Throughput: –  Throughput ceiling if not enough resources allocated –  Throughput ceiling if not enough virtual CPU/Mem allocated

VMkernel

Physical Hardware

Memory Allocator

NIC Drivers

Virtual Switch

I/O Drivers

File System Scheduler


vSphere Instrumentation Points

vCPU

pCPU

HBA

Physical Disk

vNIC

Virtual Disk Guest

Monitor

Service Console

Monitor

TCP/IP File

System

pNIC

VMHBA cCPU

VI Client

Real-time vs. Historical

Rollup Stats type

Object

Counter type

Chart Type

CPU capacity

Ready time < used time

Used time

Ready time ~ used time

Some caveats on ready time   Used time ~ ready time: may

signal contention. However, might not be overcommitted due to workload variability

  In this example, we have periods of activity and idle periods: CPU isn’t overcommitted all the time

(screenshot from VI Client)

esxtop  What is esxtop ?

•  Performance troubleshooting tool for ESX host •  Displays performance statistics in rows and column format

Fields

Performance Summary

•  Use vSphere rather than Workstation/Fusion for any performance testing –  Better performance from Sched, I/O, Large Pages, etc,…

•  vSphere will provide near-native performance –  Ensure resources are available (under-commit or use

controls) –  If I/O intensive, ensure shared storage is configured with

enough capacity –  Ensure VMware tools are installed

•  Use the correct performance instrumentation –  vSphere or esxtop

SpringOne 2GX 2010. All rights reserved. Do not distribute without permission.

Q&A