56
Virtualization 1 Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky

What is virtualization? - BGUos192/wiki.files/OS19_virtualization.pdf · Types of virtualization 9 Full virtualization –guest OS runs unmodified Para-virtualization –guest OS

  • Upload
    others

  • View
    26

  • Download
    0

Embed Size (px)

Citation preview

Virtualization

1

Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky

What is virtualization?

2

Creating a virtual version of somethingo Hardware, operating system, application, network, memory, storage

“The construction of an isomorphism between a guest system and a host” [Popek, Goldberg, ’74]

Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky

3

Example: virtual disk

Partition a single hard disk to multiple virtual disks

o Virtual disk has virtual tracks & sectors

Implement virtual disk by file

Map between virtual disk and real disk contents

Virtual disk write/read mapped to file write/read in host system

What is virtualization? (continued)

4

A way to run multiple operating systems (and their applications) on the same hardware (virtual machines)

Only virtual machine manager (a.k.a. hypervisor) has full system control

Virtual machines completely isolated from each other (or so we hope)

Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky

Basic concepts

Virtual Machine (VM)

Host

Guest

Hypervisor (type ||) / Virtual Machine Monitor

5Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky

Basic concepts

Virtual Machine (VM)

Host

Guest

Hypervisor (type ||) / Virtual Machine Monitor

6Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky

Basic concepts

Virtual Machine (VM)

Host

Guest

Hypervisor (type ||) / Virtual Machine Monitor

7Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky

Basic concepts

Virtual Machine (VM)

Host

Guest

Hypervisor (type ||) / Virtual Machine Monitor

8Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky

Types of virtualization

9

Full virtualization – guest OS runs unmodified

Para-virtualization – guest OS must be aware of virtualization, source-code modifications required

Hardware virtualization support may be used for both

Our focus is on full virtualization

Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky

Virtualization advantages

10

Cost-effectiveness – less hardware

o Multiple virtual machines / operating systems / services on single physical machine (server consolidation)

o Various forms of computation as a service

Isolationo Good for security

o Great for reliability and recovery: If VM crashes it can be rebooted, does not affect other services (fault containment)

o VM migration

Development toolo Work on multiple OS in parallel

o Develop and debug OS in user mode

o Origins of VMware as a tool for developers

Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky

Virtualization vs. Multi-Processing

11

HW (disk, NIC,…)

OS

Process1 Process2 ∙∙∙

Multi-processing

User space/ kernel separation

HW interface

Virtualization Real HW interface

HW (disk, NIC,…)

VMM/Hypervisor

Pr1 Pr2 ∙∙∙

OS1 OS2 ∙∙∙

Pr1 Pr2

Virtual HW interface

VM

Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky

Type 1 and type 2 hypervisors

12Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky

Figure 7-1. Location of type 1 and type 2 hypervisors.

VMware ESX, Microsoft Hyper-V, Xen VMware Workstation, Microsoft Virtual

PC, Sun VirtualBox, QEMU, KVM

Type 1 and type 2 hypervisors (continued)

13Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky

Figure 7-2. Examples of the various combinations of virtualization type and hypervisor. Type 1 hypervisors

always run on the bare metal whereas type 2 hypervisors use the services of an existing host

operating system.

What's required of a (classic) hypervisor

14Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky

Hypervisor should provide the following:

Safety: have full control of virtualized resources

Fidelity: program behavior on VM should be identical to its behavior on bare hardware

Efficiency: As much as possible, run directly on hardware without hypervisor intervention Full interpretation isn't efficient

Classic virtualization: trap and emulate

15

HW

VMM

VM1 VM2

Trap (1) Interrupt handler (2)

HW emulation

Return to process (3)

Emulation is the process of implementing the functionality/interface

of one system on a system having different functionality/interfaceOperating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky

Trap and emulate: difficulties on x86

16Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky

Sensitive instructions: Provide control over HW resources behave differently in kernel/supervisor and user modes I/O instructions, enable/disable interrupts, access CR3 register…

Privileged instructions: cause a trap if executed in user mode

Theorem [Popek and Goldberg, 1974]

A machine can be virtualized [using trap and emulate]

if every sensitive instruction is privileged.

Not supported by x86 processors prior to 2005In 2005, Intel/AMD introduced virtualization HW support.

What is sensitive?

CPU – some registers

MMU

o Page table

o Segments

Interrupts

Timers

IO devices

17Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky

X86 virtualization problem I

The x86 architecture (w/o virtualization extensions) can't be virtualized by trap and emulate.

Some sensitive instructions are not privileged.

Example: the popf instruction

o Pops 16 bits from stack to flags register

o One of the flags masks (i.e. disables) interrupts

o The instruction is not privileged

o What happens if the OS of a VM runs popf?

18Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky

Some instructions: push, pop, mov can have code segment selectors (cs, ds, ss) as arguments even in user mode, so they can be read

The selectors have two bits that are their current privilege levelo In x86 (beginning with 386), four privilege levels (ring 0 to ring 3)

o The two lower bits of the cs register are the Current Privilege Level (CPL) of the code.

o Guest OS thinks that it is in ring 0.

o Guest OS is actually in ring 1

Result - guest OS confusion.

19

X86 virtualization problem II

Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky

Implementation options

Avoid executing sensitive instructionso Interpretation (BOCHS, JSLinux).

o Binary translation – change executed code (VMware, QEMU).

Para-virtualization – re-compile guest OS (XEN, Denali).

Hardware assistance – Intel VT-x and AMD-V (used by KVM, XEN, Vmware).

20Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky

21

Concepts, classical CPU virtualization

o Binary translation

Memory virtualization

Outline

Operating Systems, Spring 2018, I. Dinur, D. Hendler and R. Iakobashvili

Binary translation

Binary translation is the process of translating one instruction set to another one.

Approach I: translate entire OS when loaded to VM

o Key problem – indirect control flow

22Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky

Dynamic binary translation

Approach II: translate code on the fly

Simplest approacho Keep table mapping old instructions to new instructions.

o Fetch old instruction.

o Use table to translate.

o Execute new instruction(s).

Problem: performanceo Overhead for every instruction similarly to interpretation.

23Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky

Dynamic BT with caching

Cache translated code region:o After translation run from cache.

o Translation occurs only once.

Static translation cannot handle dynamic control transfer, when:o Jump depending on content of memory address.

o Indirect function call (by function pointer).

Translation of dynamic control transfer must be done at execution time.

User code does not have to be translated

24Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky

25

Virtualization prior to HW support

Figure 7-4. The binary translation rewrites the guest operating system running in ring 1, while the hypervisor

runs in ring 0

Operating Systems, Spring 2018, I. Dinur, D. Hendler and R. Iakobashvili

VMWare binary translation: example

26

C code 64-bit binary

Binary (hex)

representation

Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky

VMWare binary translation: example

27

Translator reads guest memory at the address indicated by guest PC

Decodes instructions, creates Intermediate Representation - IR objects

Accumulates IR objects to translation units (TUs)o Basic blocks (BB), stops upon control flow

First TU Compiled code fragment (CCF)

Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky

VMWare binary translation: example

Translator reads guest memory at the address indicated by guest PC

Decodes instructions, creates Intermediate Representation - IR objects

Accumulates IR objects to translation units (TUs)o Basic blocks (BB), stops upon control flow

28

First TU

Identical

code

Compiled code fragment (CCF)

Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky

VMWare binary translation: example

Translator reads guest memory at the address indicated by guest PC

Decodes instructions, creates Intermediate Representation - IR objects

Accumulates IR objects to translation units (TUs)o Basic blocks (BB), stops upon control flow

29

First TU

Translation of

jump BBCompiled code fragment (CCF)

Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky

VMWare binary translation: example

Translator reads guest memory at the address indicated by guest PC

Parses instructions, creates Intermediate Representation - IR objects

Accumulates IR objects to translation units (TUs)o Basic blocks (BB), stops upon control flow

30

First TU

Translation of

fall through BBCompiled code fragment (CCF)

Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky

VMWare binary translation: example

31

C code 64-bit binary

Which basic block will be translated next?

Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky

VMWare binary translation: example

32

C code 64-bit binary

Which basic block will be translated next?

Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky

VMWare binary translation: example

33

C code 64-bit binary

Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky

VMWare binary translation example: output

34Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky

VMWare binary translation example: output

35

These continuations remain because

respective basic blocks were not executed

Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky

VMWare binary translation operation

36

Translation cache (TC) stores translations done so far

A hash table tracks the input-to-output correspondence

Chaining optimization allows one CCF to jump directly to another without calling out of the translation cache

As TC gradually captures guest's working set, proportion of translation decreases

User code does not have to be translated

Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky

Dealing with privileged instructions: example

37

The cli (clear interrupts) instruction is privileged

Translated to: “vcpu.flags.IP=0”

Much faster than source binary!

Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky

38

Concepts, classical CPU virtualization

o Binary translation

Memory virtualization

Outline

Operating Systems, Spring 2018, I. Dinur, D. Hendler and R. Iakobashvili

39

Memory allocation

Each VM usually receives a contiguous set of physical addresses.

o 1 Gbyte– 4 Gbyte are typical values.

As far as VM is concerned, this is the physical memory of the machine.

The guest OS allocates pages to guest processes.

40

Memory management

Assumptions of OS in VM:o Physical memory is a contiguous block of addresses from 0 to

some n.

o OS can map any virtual page to any page frame.

Hypervisor must:o Partition memory among VMs.

o Ensure virtual page mapping only to assigned page frames.

TLB miss: cache miss in HW-managed TLB (e.g. x86) causes HW to select a page from page table.

VM OS must not manage real page table.

Operating Systems, Spring 2018, I. Dinur, D. Hendler and R. Iakobashvili

41

Option 1: brute force

HW

TLBCR3

Guest OS

Page dir.

Page table

Hypervisor

VM memory layout

Define these pages as not R/W

CPU

Interrupt & VMM corrects address.

VMM SW

Operating Systems, Spring 2018, I. Dinur, D. Hendler and R. Iakobashvili

42

Brute force – description

Guest page tables are read and write protected in host system.

If guest OS reads page table (e.g. for page eviction), writes page table (e.g. after page fault), or changes CR3, the system traps.

The hypervisor then uses a VM memory layout to:

Return answers to VM

Update the layout

Hypervisor switches VM memory layout when new VM is scheduled.

Operating Systems, Spring 2018, I. Dinur, D. Hendler and R. Iakobashvili

43

Option 2: shadow page tables

HW

TLBCR3

Guest OS

Page dir.

Page table

Hypervisor

Shadow page table

CPU

Interrupt & VMM corrects page table.

VMM SW

G-CR3

Operating Systems, Spring 2018, I. Dinur, D. Hendler and R. Iakobashvili

44

Shadow page tables – description

Hypervisor maintains “shadow page tables”.

Guest page tables map: Guest VA (GVA) Guest PA (GPA)

Shadow tables map: Guest VA Host PA (HPA).

Hypervisor does not trap guest updates to its page table.o Result – inconsistent guest page table and shadow page table.

When guest process accesses virtual addresso The physical address is not in the guest page table, but in the

shadow page table.

o HW translates correctly, because it is aware only of shadow tables.

Operating Systems, Spring 2018, I. Dinur, D. Hendler and R. Iakobashvili

45

Shadow page tables – description (continued)

If address in TLB – TLB hit and no problem.

When guest process causes a page faulto Hypervisor begins execution.

o If required, hypervisor updates shadow page table.

Performance is as good as native execution as long as there are no page faults.

Shadow page tables should be cached so that once a VM is re-scheduled the page table does not have to be rebuilt from scratch.

Operating Systems, Spring 2018, I. Dinur, D. Hendler and R. Iakobashvili

46

Shadow page tables – page faults (continued)

Two scenarios when handling a page fault. Hypervisor ``walks’’ guest page table to determine which it is.

1. Guest page fault – No translation in guest page tables ``inject’’ page fault for guest to handle

2. Guest translation found update shadow table respectively

Operating Systems, Spring 2018, I. Dinur, D. Hendler and R. Iakobashvili

47

GuestPage Table

ShadowPage Table

GuestPage Table

GuestPage Table

ShadowPage Table

ShadowPage Table

Virtual CR3

Real CR3

Shadow page tables – updating CR3

Slide taken from a presentation by VMWare.

48

Shadow page tables – updating CR3

GuestPage Table

ShadowPage Table

GuestPage Table

GuestPage Table

ShadowPage Table

ShadowPage Table

Virtual CR3

Real CR3

Slide taken from a presentation by VMWare.

49

Shadow page tables – updating CR3

GuestPage Table

ShadowPage Table

GuestPage Table

GuestPage Table

ShadowPage Table

ShadowPage Table

Virtual CR3

Real CR3

Slide taken from a presentation by VMWare.

50

Undiscovered guest page table

GuestPage Table

ShadowPage Table

GuestPage Table

GuestPage Table

ShadowPage Table

ShadowPage Table

Virtual CR3

Real CR3

GuestPage Table

Slide taken from a presentation by VMWare.

51

Undiscovered guest page table

GuestPage Table

ShadowPage Table

GuestPage Table

GuestPage Table

ShadowPage Table

ShadowPage Table

Virtual CR3

Real CR3

GuestPage Table

ShadowPage Table

Slide taken from a presentation by VMWare.

52

Option 3: Extended/nested page tables

HW

TLBCR3

Guest OS

Page dir.

Page table

Hypervisor

CPU

VMM SW

EPTP

Host page table

Host page table

Host page table

Operating Systems, Spring 2018, I. Dinur, D. Hendler and R. Iakobashvili

53

Nested/extended page tables - description

The name implies having page tables within page tables.

The essence of the idea is a hardware assist.o Hardware has an extra pointer and the ability to walk an extra set

of page tables.

o Idea is called Extended Page Tables (EPT) by Intel

Guest page tables hold Guest VA Guest PA mapping, access by standard CR3

Extended page tables hold Host VA Host PA mapping, access by EPTP (EPT pointer).

Host VA=Guest PA

Operating Systems, Spring 2018, I. Dinur, D. Hendler and R. Iakobashvili

54

Walking extended page tables

Operating Systems, Spring 2018, I. Dinur, D. Hendler and R. Iakobashvili

55

Extended page tables – description (cont'd)

TLB as usual holds Guest VA Host PA

On memory accesso If found in TLB – no problem.

o If not in TLB, but no page fault, hardware walks both tables andupdates TLB.

o If page fault, then hypervisor gets host virtual page (guest physical page) and maps it to host physical page.

Operating Systems, Spring 2018, I. Dinur, D. Hendler and R. Iakobashvili

Sources

56

“Modern operating systems”, 4‘th edition, A. Tanenbaum and H. Bos

“Virtual machines”, J. E. Smith and R. Nair

A presentation by Niv Gilboa from CSE@BGU

“Formal requirements for virtualizable third generation architectures”, G. J. Popek and R. P. Goldberg, CACM, 1974

“A comparison of software and hardware techniques for x86 virtualization”, K. Adams and O. Ageson, ASPLOS 2006

A presentation by VMWare

Operating Systems 2019, I. Dinur , D. Hendler and M. Kogan-Sadetsky