Upload
others
View
26
Download
0
Embed Size (px)
Citation preview
What is virtualization?
2
Creating a virtual version of somethingo Hardware, operating system, application, network, memory, storage
“The construction of an isomorphism between a guest system and a host” [Popek, Goldberg, ’74]
Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky
3
Example: virtual disk
Partition a single hard disk to multiple virtual disks
o Virtual disk has virtual tracks & sectors
Implement virtual disk by file
Map between virtual disk and real disk contents
Virtual disk write/read mapped to file write/read in host system
What is virtualization? (continued)
4
A way to run multiple operating systems (and their applications) on the same hardware (virtual machines)
Only virtual machine manager (a.k.a. hypervisor) has full system control
Virtual machines completely isolated from each other (or so we hope)
Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky
Basic concepts
Virtual Machine (VM)
Host
Guest
Hypervisor (type ||) / Virtual Machine Monitor
5Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky
Basic concepts
Virtual Machine (VM)
Host
Guest
Hypervisor (type ||) / Virtual Machine Monitor
6Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky
Basic concepts
Virtual Machine (VM)
Host
Guest
Hypervisor (type ||) / Virtual Machine Monitor
7Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky
Basic concepts
Virtual Machine (VM)
Host
Guest
Hypervisor (type ||) / Virtual Machine Monitor
8Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky
Types of virtualization
9
Full virtualization – guest OS runs unmodified
Para-virtualization – guest OS must be aware of virtualization, source-code modifications required
Hardware virtualization support may be used for both
Our focus is on full virtualization
Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky
Virtualization advantages
10
Cost-effectiveness – less hardware
o Multiple virtual machines / operating systems / services on single physical machine (server consolidation)
o Various forms of computation as a service
Isolationo Good for security
o Great for reliability and recovery: If VM crashes it can be rebooted, does not affect other services (fault containment)
o VM migration
Development toolo Work on multiple OS in parallel
o Develop and debug OS in user mode
o Origins of VMware as a tool for developers
Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky
Virtualization vs. Multi-Processing
11
HW (disk, NIC,…)
OS
Process1 Process2 ∙∙∙
Multi-processing
User space/ kernel separation
HW interface
Virtualization Real HW interface
HW (disk, NIC,…)
VMM/Hypervisor
Pr1 Pr2 ∙∙∙
OS1 OS2 ∙∙∙
Pr1 Pr2
Virtual HW interface
VM
Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky
Type 1 and type 2 hypervisors
12Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky
Figure 7-1. Location of type 1 and type 2 hypervisors.
VMware ESX, Microsoft Hyper-V, Xen VMware Workstation, Microsoft Virtual
PC, Sun VirtualBox, QEMU, KVM
Type 1 and type 2 hypervisors (continued)
13Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky
Figure 7-2. Examples of the various combinations of virtualization type and hypervisor. Type 1 hypervisors
always run on the bare metal whereas type 2 hypervisors use the services of an existing host
operating system.
What's required of a (classic) hypervisor
14Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky
Hypervisor should provide the following:
Safety: have full control of virtualized resources
Fidelity: program behavior on VM should be identical to its behavior on bare hardware
Efficiency: As much as possible, run directly on hardware without hypervisor intervention Full interpretation isn't efficient
Classic virtualization: trap and emulate
15
HW
VMM
VM1 VM2
Trap (1) Interrupt handler (2)
HW emulation
Return to process (3)
Emulation is the process of implementing the functionality/interface
of one system on a system having different functionality/interfaceOperating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky
Trap and emulate: difficulties on x86
16Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky
Sensitive instructions: Provide control over HW resources behave differently in kernel/supervisor and user modes I/O instructions, enable/disable interrupts, access CR3 register…
Privileged instructions: cause a trap if executed in user mode
Theorem [Popek and Goldberg, 1974]
A machine can be virtualized [using trap and emulate]
if every sensitive instruction is privileged.
Not supported by x86 processors prior to 2005In 2005, Intel/AMD introduced virtualization HW support.
What is sensitive?
CPU – some registers
MMU
o Page table
o Segments
Interrupts
Timers
IO devices
17Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky
X86 virtualization problem I
The x86 architecture (w/o virtualization extensions) can't be virtualized by trap and emulate.
Some sensitive instructions are not privileged.
Example: the popf instruction
o Pops 16 bits from stack to flags register
o One of the flags masks (i.e. disables) interrupts
o The instruction is not privileged
o What happens if the OS of a VM runs popf?
18Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky
Some instructions: push, pop, mov can have code segment selectors (cs, ds, ss) as arguments even in user mode, so they can be read
The selectors have two bits that are their current privilege levelo In x86 (beginning with 386), four privilege levels (ring 0 to ring 3)
o The two lower bits of the cs register are the Current Privilege Level (CPL) of the code.
o Guest OS thinks that it is in ring 0.
o Guest OS is actually in ring 1
Result - guest OS confusion.
19
X86 virtualization problem II
Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky
Implementation options
Avoid executing sensitive instructionso Interpretation (BOCHS, JSLinux).
o Binary translation – change executed code (VMware, QEMU).
Para-virtualization – re-compile guest OS (XEN, Denali).
Hardware assistance – Intel VT-x and AMD-V (used by KVM, XEN, Vmware).
20Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky
21
Concepts, classical CPU virtualization
o Binary translation
Memory virtualization
Outline
Operating Systems, Spring 2018, I. Dinur, D. Hendler and R. Iakobashvili
Binary translation
Binary translation is the process of translating one instruction set to another one.
Approach I: translate entire OS when loaded to VM
o Key problem – indirect control flow
22Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky
Dynamic binary translation
Approach II: translate code on the fly
Simplest approacho Keep table mapping old instructions to new instructions.
o Fetch old instruction.
o Use table to translate.
o Execute new instruction(s).
Problem: performanceo Overhead for every instruction similarly to interpretation.
23Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky
Dynamic BT with caching
Cache translated code region:o After translation run from cache.
o Translation occurs only once.
Static translation cannot handle dynamic control transfer, when:o Jump depending on content of memory address.
o Indirect function call (by function pointer).
Translation of dynamic control transfer must be done at execution time.
User code does not have to be translated
24Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky
25
Virtualization prior to HW support
Figure 7-4. The binary translation rewrites the guest operating system running in ring 1, while the hypervisor
runs in ring 0
Operating Systems, Spring 2018, I. Dinur, D. Hendler and R. Iakobashvili
VMWare binary translation: example
26
C code 64-bit binary
Binary (hex)
representation
Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky
VMWare binary translation: example
27
Translator reads guest memory at the address indicated by guest PC
Decodes instructions, creates Intermediate Representation - IR objects
Accumulates IR objects to translation units (TUs)o Basic blocks (BB), stops upon control flow
First TU Compiled code fragment (CCF)
Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky
VMWare binary translation: example
Translator reads guest memory at the address indicated by guest PC
Decodes instructions, creates Intermediate Representation - IR objects
Accumulates IR objects to translation units (TUs)o Basic blocks (BB), stops upon control flow
28
First TU
Identical
code
Compiled code fragment (CCF)
Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky
VMWare binary translation: example
Translator reads guest memory at the address indicated by guest PC
Decodes instructions, creates Intermediate Representation - IR objects
Accumulates IR objects to translation units (TUs)o Basic blocks (BB), stops upon control flow
29
First TU
Translation of
jump BBCompiled code fragment (CCF)
Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky
VMWare binary translation: example
Translator reads guest memory at the address indicated by guest PC
Parses instructions, creates Intermediate Representation - IR objects
Accumulates IR objects to translation units (TUs)o Basic blocks (BB), stops upon control flow
30
First TU
Translation of
fall through BBCompiled code fragment (CCF)
Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky
VMWare binary translation: example
31
C code 64-bit binary
Which basic block will be translated next?
Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky
VMWare binary translation: example
32
C code 64-bit binary
Which basic block will be translated next?
Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky
VMWare binary translation: example
33
C code 64-bit binary
Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky
VMWare binary translation example: output
34Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky
VMWare binary translation example: output
35
These continuations remain because
respective basic blocks were not executed
Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky
VMWare binary translation operation
36
Translation cache (TC) stores translations done so far
A hash table tracks the input-to-output correspondence
Chaining optimization allows one CCF to jump directly to another without calling out of the translation cache
As TC gradually captures guest's working set, proportion of translation decreases
User code does not have to be translated
Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky
Dealing with privileged instructions: example
37
The cli (clear interrupts) instruction is privileged
Translated to: “vcpu.flags.IP=0”
Much faster than source binary!
Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky
38
Concepts, classical CPU virtualization
o Binary translation
Memory virtualization
Outline
Operating Systems, Spring 2018, I. Dinur, D. Hendler and R. Iakobashvili
39
Memory allocation
Each VM usually receives a contiguous set of physical addresses.
o 1 Gbyte– 4 Gbyte are typical values.
As far as VM is concerned, this is the physical memory of the machine.
The guest OS allocates pages to guest processes.
40
Memory management
Assumptions of OS in VM:o Physical memory is a contiguous block of addresses from 0 to
some n.
o OS can map any virtual page to any page frame.
Hypervisor must:o Partition memory among VMs.
o Ensure virtual page mapping only to assigned page frames.
TLB miss: cache miss in HW-managed TLB (e.g. x86) causes HW to select a page from page table.
VM OS must not manage real page table.
Operating Systems, Spring 2018, I. Dinur, D. Hendler and R. Iakobashvili
41
Option 1: brute force
HW
TLBCR3
Guest OS
Page dir.
Page table
Hypervisor
VM memory layout
Define these pages as not R/W
CPU
Interrupt & VMM corrects address.
VMM SW
Operating Systems, Spring 2018, I. Dinur, D. Hendler and R. Iakobashvili
42
Brute force – description
Guest page tables are read and write protected in host system.
If guest OS reads page table (e.g. for page eviction), writes page table (e.g. after page fault), or changes CR3, the system traps.
The hypervisor then uses a VM memory layout to:
Return answers to VM
Update the layout
Hypervisor switches VM memory layout when new VM is scheduled.
Operating Systems, Spring 2018, I. Dinur, D. Hendler and R. Iakobashvili
43
Option 2: shadow page tables
HW
TLBCR3
Guest OS
Page dir.
Page table
Hypervisor
Shadow page table
CPU
Interrupt & VMM corrects page table.
VMM SW
G-CR3
Operating Systems, Spring 2018, I. Dinur, D. Hendler and R. Iakobashvili
44
Shadow page tables – description
Hypervisor maintains “shadow page tables”.
Guest page tables map: Guest VA (GVA) Guest PA (GPA)
Shadow tables map: Guest VA Host PA (HPA).
Hypervisor does not trap guest updates to its page table.o Result – inconsistent guest page table and shadow page table.
When guest process accesses virtual addresso The physical address is not in the guest page table, but in the
shadow page table.
o HW translates correctly, because it is aware only of shadow tables.
Operating Systems, Spring 2018, I. Dinur, D. Hendler and R. Iakobashvili
45
Shadow page tables – description (continued)
If address in TLB – TLB hit and no problem.
When guest process causes a page faulto Hypervisor begins execution.
o If required, hypervisor updates shadow page table.
Performance is as good as native execution as long as there are no page faults.
Shadow page tables should be cached so that once a VM is re-scheduled the page table does not have to be rebuilt from scratch.
Operating Systems, Spring 2018, I. Dinur, D. Hendler and R. Iakobashvili
46
Shadow page tables – page faults (continued)
Two scenarios when handling a page fault. Hypervisor ``walks’’ guest page table to determine which it is.
1. Guest page fault – No translation in guest page tables ``inject’’ page fault for guest to handle
2. Guest translation found update shadow table respectively
Operating Systems, Spring 2018, I. Dinur, D. Hendler and R. Iakobashvili
47
GuestPage Table
ShadowPage Table
GuestPage Table
GuestPage Table
ShadowPage Table
ShadowPage Table
Virtual CR3
Real CR3
Shadow page tables – updating CR3
Slide taken from a presentation by VMWare.
48
Shadow page tables – updating CR3
GuestPage Table
ShadowPage Table
GuestPage Table
GuestPage Table
ShadowPage Table
ShadowPage Table
Virtual CR3
Real CR3
Slide taken from a presentation by VMWare.
49
Shadow page tables – updating CR3
GuestPage Table
ShadowPage Table
GuestPage Table
GuestPage Table
ShadowPage Table
ShadowPage Table
Virtual CR3
Real CR3
Slide taken from a presentation by VMWare.
50
Undiscovered guest page table
GuestPage Table
ShadowPage Table
GuestPage Table
GuestPage Table
ShadowPage Table
ShadowPage Table
Virtual CR3
Real CR3
GuestPage Table
Slide taken from a presentation by VMWare.
51
Undiscovered guest page table
GuestPage Table
ShadowPage Table
GuestPage Table
GuestPage Table
ShadowPage Table
ShadowPage Table
Virtual CR3
Real CR3
GuestPage Table
ShadowPage Table
Slide taken from a presentation by VMWare.
52
Option 3: Extended/nested page tables
HW
TLBCR3
Guest OS
Page dir.
Page table
Hypervisor
CPU
VMM SW
EPTP
Host page table
Host page table
Host page table
Operating Systems, Spring 2018, I. Dinur, D. Hendler and R. Iakobashvili
53
Nested/extended page tables - description
The name implies having page tables within page tables.
The essence of the idea is a hardware assist.o Hardware has an extra pointer and the ability to walk an extra set
of page tables.
o Idea is called Extended Page Tables (EPT) by Intel
Guest page tables hold Guest VA Guest PA mapping, access by standard CR3
Extended page tables hold Host VA Host PA mapping, access by EPTP (EPT pointer).
Host VA=Guest PA
Operating Systems, Spring 2018, I. Dinur, D. Hendler and R. Iakobashvili
54
Walking extended page tables
Operating Systems, Spring 2018, I. Dinur, D. Hendler and R. Iakobashvili
55
Extended page tables – description (cont'd)
TLB as usual holds Guest VA Host PA
On memory accesso If found in TLB – no problem.
o If not in TLB, but no page fault, hardware walks both tables andupdates TLB.
o If page fault, then hypervisor gets host virtual page (guest physical page) and maps it to host physical page.
Operating Systems, Spring 2018, I. Dinur, D. Hendler and R. Iakobashvili
Sources
56
“Modern operating systems”, 4‘th edition, A. Tanenbaum and H. Bos
“Virtual machines”, J. E. Smith and R. Nair
A presentation by Niv Gilboa from CSE@BGU
“Formal requirements for virtualizable third generation architectures”, G. J. Popek and R. P. Goldberg, CACM, 1974
“A comparison of software and hardware techniques for x86 virtualization”, K. Adams and O. Ageson, ASPLOS 2006
A presentation by VMWare
Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky