Upload
hung
View
39
Download
0
Embed Size (px)
DESCRIPTION
Operating Systems Engineering Virtual Machines. By Dan Tsafrir, 25/5/2011. What’s a virtual machine?. A VM is a simulation of a full computer With its disk & NIC & OS & user-level apps, … Running as an application On some “ host ” computer Simulation is called a “ guest ”. - PowerPoint PPT Presentation
Citation preview
OSE 2011– OSE – virtual machines 1
Operating Systems Engineering
Virtual Machines
By Dan Tsafrir, 25/5/2011
OSE 2011– OSE – virtual machines 2
What’s a virtual machine?
A VM is a simulation of a full computer With its disk & NIC & OS & user-level apps, …
Running as an application On some “host” computer Simulation is called a “guest”
OSE 2011– OSE – virtual machines 3
VMs – requirements
Simulation needs to be accurate Emulate HW faithfully, handle weird quirks of kernels & such Reproduce bugs exactly
Simulation needs to be isolated Guest must not break out of VM SW inside guest might be faulty and/or malicious
Simulation needs to be fast Well, as fast as possible…
Simulation needs to be believable Guest shouldn’t be able to distinguish VM from real computer
The “blue pill” saga [ http://en.wikipedia.org/wiki/Blue_Pill_(malware) ]
In reality, if guests can accurately time stuff, they can know (And indeed, viruses often refuse to work when virtualized)
OSE 2011– OSE – virtual machines 4
VMs – origin
Late 1960s IBM used VMs to share mainframes
Late 1990s VMWare re-popularized VMs (for x86 HW) Economic boom: nowadays billions of $s business Everyone is playing
SW: Microsoft, IBM, Redhat, Oracle, … HW: Intel, AMD, ARM, IBM, Oracle, …
OSE 2011– OSE – virtual machines 5
VMs – why?
For developers & power users
One computer w/ multiple OSes My Win 7 laptop also runs Ubuntu My MacBook Pro @ home also runs XP (for office)
Kernel development Like QEMU, but performs reasonably
OSE 2011– OSE – virtual machines 6
VMs – why?
Business case: saves money! Server consolidation
Once we had underutilized machines per service… Reduces cost of HW, power consumption, cooling
Portability (why should Intel/AMD/IBM care about consolidation?) Decouples OS from HW and makes upgrades easy
Increased robustness Can backup entire machine + easily restore if HW breaks No need to reinstall all SW Can isolate important apps in their own VM (safety)
Makes cloud models possible Such as Amazon’s EC2 (“elastic cloud”)
Certain costly sys-admin chores made much easier Provisioning a new machine (just clone ready image)
OSE 2011– OSE – virtual machines 7
What’s in a name SW that runs the show (3 names referring to same thing):
VMM Virtual machine monitor
Hypervisor (Of IBM origin) Sometimes denoted “HV”
~Host
VMMs Citrix Xen, KVM, VMWare ESXi, MS HyperV, IBM pHyp,…
2 possible settings Next 2 slides…
OSE 2011– OSE – virtual machines 8
Hosted VMM (“type 2 hypervisor”)
• Like VMWare Workstation,Parallels, VirtualBox, QEMU,…
• Typically personal use
OSE 2011– OSE – virtual machines 9
Bare metal / native VMM(“type 1 hypervisor”)
• XenServer, VMWare ESXi, MS HyperV, IBM pHyp,…• Typically for servers, data centers, clouds
OSE 2011– OSE – virtual machines 10
VMM multiplexes HW
Just like an OS…
Divides memory among guests Related: de-duplication, balloon-ing
Time-shares CPU among guests Related: notion of VCPU vs. PCPU (can hot-plug)
Simulates per-guest virtual devices Disk Network, …
OSE 2011– OSE – virtual machines 11
Virtualization refinement
Paravirtualization Guest OS is aware it is being virtualized For performance purposes Paravirtualized devices
HW support Intel-VT AMD-V
OSE 2011– OSE – virtual machines 12
ASSUMING NO HW SUPPORTHow to virtualize x86…
OSE 2011– OSE – virtual machines 13
VMs – how?
SW interpretation, instruction by instruction Can do it, but much, much too slow
Idea1: when possible, execute VM’s instructions on real CPU Works fine for most instructions (e.g., add %eax %ebx) But what about isolation? (e.g., VM writes outside its memory)
Idea2: run VMs at CPL=3 Ordinary instructions work fine Writing to %cr3 traps to VMM
VMM examines guest’s page table VMM can manipulate page table if it wants Only then set %cr3 and resume VM
This virtualization model is called: “trap & emulate”
OSE 2011– OSE – virtual machines 14
VMM hides real machine
Virtual vs. real resources Virtual vs. real cr3
Virtual cr3: the VM (thinks it) sets the real cr3 Real cr3: exclusively managed (= virtualized) by VMM
Virtual vs. real machine-defined data structures Virtual page table: VM thinks it’s real Real page table: real page tables virtualized by VMM
VMM’s job Make guest see only virtual machine state Completely hide & protect real machine state
Problems Trap-&-emulate is tricky on x86
Not all privileged instructions trap at CPL=3 All those traps can be slow…
OSE 2011– OSE – virtual machines 15
x86 state we must virtualize
state reason for hiding it
CPL (low bits of CS)
always 3; guest sometimes expects it to be 0
GDT descriptors their DPL (descriptor priv level) is 3; guest may expect 0
gtdr points to “shadow” (real) GDT
IDT descriptors trap to VMM code, not guest kernel (VMM forwards or fakes interrupts to guest when necessary)
idtr points to “shadow” (real) IDT
page tables entries don’t map to expected physical address
cr3 points to “shadow” page table
IF in EFLAGS interrupts must always be on when in guest mode
cr0 can’t allow guest to go into real mode
…
OSE 2011– OSE – virtual machines 16
Terminology
Letters H = host G = guest P = physical V = virtual A = address
Combinations GVA = guest virtual address GP = guest physical HP = host physical …
OSE 2011– OSE – virtual machines 17
Providing guest with illusion of physical memory (simplistic)
Guest view Wants to start at PA=0 Wants to use all “installed” DRAM
Host opposing view Must support several guests, they can’t all start at 0 Must protect on VM’s memory from the others
Idea Fake a smaller DRAM size than real DRAM Ensure paging is enabled Rewrite guest’s PTEs
OSE 2011– OSE – virtual machines 18
Providing guest with illusion of physical memory (simplistic)
Example VMM allocates a guest phys mem 0x1000000 to 0x2000000 VMM gets trap if guest changes cr3 (guest @ CPL=3) VMM copies guest's page table to "shadow" page table While copying, VMM adds 0x1000000 to each PA in shadow tab VMM checks that each resulting HPA is < 0x2000000 Must copy the guest's page table
So guest doesn't see VMM's modifications to PAs
OSE 2011– OSE – virtual machines 19
Address translation (reminder)
Q
012
p0
511
4KB page-table page => 512 PTEs (8B each)
p0 p1 p2 p3 offset9bits 9bits 9bits 9bits 12bits
W
012
p1
511K
012
p2
511
012
p3
511
CR3
Q
W
K
48bit VA
PA
OSE 2011– OSE – virtual machines 20
Providing guest with illusion of physical memory (realistic)
Host allocates N pages to guest No need for them to be contiguous in phys mem Host maintains a GPA_to_HPA mapping (say, using a hash) GPAs are contiguous
What happens when guest changes cr3 Assume guest assigns GPA1 to cr3 A trap will occur and host will gain control Host’s goal:
Generate, on the fly, the shadow page table hierarchy From GVA to HPA There’s only one such shadow hierarchy at any given time
per core
OSE 2011– OSE – virtual machines 21
Providing guest with illusion of physical memory (realistic)
The host’s actions Saves GPA1 internally Allocates brand new zeroed page = root of the shadow hierarchy
Let base of new page be HPA1 Assigns HPA1 to cr3 Resumes guest, which immediately faults on GVA2
GVA2 = virtual address of 1st fetched command of guest Takes 9 most significant bits from GVA2
Assume 48bit VA = 4 levels hierarchy (9bits each) + 4KB page 8 bytes per PTE
Computes GPA_to_HPA(GPA1) + 9bits * 8 = HPA of 2nd-level guest’s hierarchy
…
OSE 2011– OSE – virtual machines 22
Providing guest with illusion of physical memory (realistic)
The host’s actions (cont.) … Continue like so with next 9bits, repeatedly,
Until reaching the HPA of the request page = HPA2 Now, there needs to be a GVA2=>HPA2 mapping in the
shadow hierarchy Adds the translation GVA2=>HPA2 to shadow hierarchy
Starting at HPA1 and allocating the rest of the levels in the hierarchy as needed
Resumes guest Repeats same procedure when next fault occurs
This continues until all address space is mapped Or until next context switch (=> need to start over)
OSE 2011– OSE – virtual machines 23
Providing guest with illusion of physical memory (realistic)
Building shadow page tables is costly
Can we cache? Yes, but need to write protect all pages involved
Will generate trap whenever pages are modified Host would be able to respond accordingly
The problem How do we know when to stop write-protecting?
Solution Must employ some heuristic Can be not perfect as long as maintains correctness
OSE 2011– OSE – virtual machines 24
Not all sensitive CPL=3 read/write trap
Push CS Will show CPL=3 (not 0) if guest reads pushed value
sgdt (save gdtr) Reveals real gdtr is guest reads it
pushf Pushes real IF Always on in guest mode (why?) Host injects interrupts to guest as needed
popf Ignores IF in CPL=3 => no trap => host won’t know if guest wants interrups
iret Invoked, e.g, after handling a system call No ring change => SS/ESP will not be restored
OSE 2011– OSE – virtual machines 25
How can we cope?
Solution: binary translation Rewrite guest code Change every problematic instruction to INT 3 Keep track of original instructions + emulate in VMM Note: INT 3 is 1-byte long => small enough to overwrite any inst
Must be done dynamically at runtime Need to know what if bytes are code or data Need to know where instructions start (x86 is CISC) Consequently, scan code only as executed
OSE 2011– OSE – virtual machines 26
Binary translation – example
Rewrite INT3 instead of Bad instructions (popf) First jump (jnz)
Then start guest kernel INT3 traps to host Emulates popf Look where jump could go
For each jump Translate upon the 1st
encounter of block Keep track of translated code Next time, replace INT3 with
original instructions if target is known (when j is direct)
Assume guest kernel starts like so:
pushl %ebp…popf…jnz x…j?? y
x:…j?? z
OSE 2011– OSE – virtual machines 27
BT: indirect jumps & ret
Same, but
Can’t replace INT3 with original jump Since we’re not sure address will be the same next time ret indirect jump via pointer on the stack must take trap every time (slow!)
Can we speed up? Yes, by write our own code rather than hack original
=> more aggressive translation, addresses change See VMWare’s
“A Comparison of Software and Hardware Techniques for x86 Virtualization”, by Adams & Agesen, in ASPLOS 2006http://www.vmware.com/pdf/asplos235_adams.pdf
Read it to make sure you know how!
OSE 2011– OSE – virtual machines 28
Intel/AMD HW support for VMs
Much easier to implement VMM w/ reasonable performance HW itself directly maintains per-guest virtual state
CS (w/ CPL), EFLAGS, idtr, etc. In-memory HW struct can be loaded/unloaded like context swt
HW knows it’s in guest mode Instructions directly modify virtual state Avoids lots of traps to VMM
HW basically adds a new privilege level VMM mode, CPL=0, ..., CPL=3 Guest-mode/CPL=0 isn’t fully privileged
No traps to VMM on system calls HW handles CPL transition
No need to shadow page Next slide…
OSE 2011– OSE – virtual machines 29
Nested paging
In guest mode, there are *2* page tables in effect Guest page table & host page table
Guest memory refs go through multiple lookups Guest tables hold GVA=>GPA translations HW knows this, so in every level of the hierarchy HW automatically translates GPA to HPA Continues the table walk process HW table walk can take ~20 memory refs => There’s a new “page table cache” (in addition to the TLB),
which caches partial parts of the GVA in an attempt to skip levels (shown to be very effective)
Thus, guest can directly modify its page table w/o VMM having to shadow it No need for VMM to write-protect guest page tables No need for VMM to track cr3 changes
OSE 2011– OSE – virtual machines 30
Nested paging
Is nested paging faster than shadow paging? Depends… (on what?)
OSE 2011– OSE – virtual machines 31
Devices
trap INB and OUTB DMA addresses are physical,
VMM must trust devices or utilize HW support (IOTLOB) Device nowadays is typically shared (=> virtualized)
If you want to share between multiple guests Each guest gets a part of the disk Each guest looks like a distinct Internet host Each guest gets an X window
VMM might mimic some standard (or legacy) devices Regardless of actual h/w on host computer
Guest might run paravirtualized drivers Typically aggregate messages before switching to VMM
For high-performance I/O => device assignment Sharing through SRIOV (new standard)