Upload
florence-underwood
View
240
Download
4
Embed Size (px)
Citation preview
Introduction to Xen
-A Hypervisor (on x86)
Advisor: Chih-Wen HsuehStudent: Tang-Hsun Tu
National Taiwan UniversityGraduate Institute of Networking and Multimedia
Wireless Networking and Embedded Systems Laboratory Real-Time System Software Group
April 20, 2023
/482National Taiwan University, Graduate Institute of Networking and Multimedia
Tang-Hsun Tu
Outline Introduction
What is Virtualization ? Why Virtualization is Difficult ? How to Virtualize ?
Xen Architecture Hypervisor CPU Virtualization Memory Virtualization I/O Device Virtualization Hardware-Assisted Virtualization
Conclusion
/483National Taiwan University, Graduate Institute of Networking and Multimedia
Tang-Hsun Tu
Outline Introduction
What is Virtualization ? Why Virtualization is Difficult ? How to Virtualize ?
Xen Architecture Hypervisor CPU Virtualization Memory Virtualization I/O Device Virtualization Hardware-Assisted Virtualization
Conclusion
/484National Taiwan University, Graduate Institute of Networking and Multimedia
Tang-Hsun Tu
What is Virtualization ?
etcetc
VirtualizationVirtualization
RunningApplications(x-platform)
RunningApplications(x-platform)
SecuritySecurity
SharingHardwareResource
SharingHardwareResource
Virtual Machine !
FullyUtilizingHardware
FullyUtilizingHardware
/485National Taiwan University, Graduate Institute of Networking and Multimedia
Tang-Hsun Tu
Why Virtualization is Difficult ? (1/2) OS is moved to ring1/3 On x86
Some instructions Sensitive Instructions Cannot be trapped
0/1/3 Ring, e.g. x86_32
0/3/3 Ring, e.g. x86_64, ARM
OS
OS
Critical Instructions Instructions
Sensitive Register Instructions
SGDT, SIDT, SLDT
SMSW
PUSHF(D), POPF(D)
Protection System Instructions
LAR, LSL, VERR, VERW
PUSH, POP
CALL, JMP, INT, RET
STR
MOV
Privileged Instructions
/486National Taiwan University, Graduate Institute of Networking and Multimedia
Tang-Hsun Tu
Why Virtualization is Difficult ? (2/2)
- Examples SGDT, SIDT and SLDT
SGDT m // save gdtr to memory SIDT m // save idtr to memory SLDT r/m16 // save ldtr to memory Only one gdtr, idtr and ldtr on a cpu !
POP POP ss // need to satisfy RPL=CPL=DPL CPL changes from 0 to 1 or 3 !
/487National Taiwan University, Graduate Institute of Networking and Multimedia
Tang-Hsun Tu
Binary translation Hypercall
How to Virtualize ? (1/2)Full Virtualization Para Virtualization Hardware Assisted Virtualization
Intel VT-x & AMD SVM
/488National Taiwan University, Graduate Institute of Networking and Multimedia
Tang-Hsun Tu
How to Virtualize ? (2/2) Hypervisor (VMM) Type
Type I + Microkernel Xen (open source, citrix), Microsoft Hyper-V
Type I + Integrated kernel VMware ESX, KVM (kernel-base VM)
Type II (Host OS + Guest OS) VMware GSX, workstation, Microsoft virtual PC, Microsoft virtual server, Sun Virtual Box
Type I
Type II
/489National Taiwan University, Graduate Institute of Networking and Multimedia
Tang-Hsun Tu
Outline Introduction
What is Virtualization ? Why Virtualization is Difficult ? How to Virtualize ?
Xen Architecture Hypervisor CPU Virtualization Memory Virtualization I/O Device Virtualization Hardware-Assisted Virtualization
Conclusion
/4810National Taiwan University, Graduate Institute of Networking and Multimedia
Tang-Hsun Tu
Xen Architecture (1/2)
Domain 0
Domain U
Hypervisor
/4811National Taiwan University, Graduate Institute of Networking and Multimedia
Tang-Hsun Tu
Xen Architecture (2/2)
Linux Xen
System Calls Hyper Calls
Signals Events
Interrupts Physical + Virtual Interrupts
CPU Physical + Virtual CPU
Filesystem XenStore
Virtual Memory 3-level memory
POSIX Shared Memory Grant Tables/Shared Pages
Compare to common Linux
/4812National Taiwan University, Graduate Institute of Networking and Multimedia
Tang-Hsun Tu
Xen Architecture Boot Hypervisor
Hyper Call & System Call Event Channel Grant Table
CPU Virtualization Virtual CPU Architecture Scheduling Interrupt
Memory Virtualization Shared Info Page Memory Architecture Translation
I/O Device Virtualization Split Device Driver Device I/O Ring
Build System Build Xen Build XCI
/4813National Taiwan University, Graduate Institute of Networking and Multimedia
Tang-Hsun Tu
Boot For paravirtualized guest OSes
Start in “protected mode” Use start info page
Start info page Put the address to “esi” register
For HVM guest OSes Start in “real mode” (emulated BIOS) With QEMU
/4814National Taiwan University, Graduate Institute of Networking and Multimedia
Tang-Hsun Tu
int 0x80 int 0x82
System Call
// xen/include/public/xen.h
#define __HYPERVISOR_set_trap_table 0#define __HYPERVISOR_mmu_update 1#define __HYPERVISOR_set_gdt 2#define __HYPERVISOR_stack_switch 3…
01020304050607
// linux/include/asm/unistd.h
#define __NR_restart_syscall 0#define __NR_exit 1#define __NR_fork 2#define __NR_read 3…
01020304050607
Hyper Call
Guest OS Hypervisor
int 82hhypercall
Hypercall_table
resume Guest OS
HYPERVOSIR_sched_op
do_sched_op
iret
Hypervisor - Hyper Call & System
Call (1/2)eax
/4815National Taiwan University, Graduate Institute of Networking and Multimedia
Tang-Hsun Tu
How system calls work with hyper calls ? HVM can use SYSENTER/SYSCALL
How to do hyper calls in applications ?
Guest OS
Hypervisor
User space
xm, xend ioctl()
privcmd
services
procfs
hyper call
Hypervisor - Hyper Call & System
Call (2/2)
ring3
User Space
Application
system call
ring1
OS
Servicering0
User Space
Application
Guest OS
Service
Hypervisor
system call
services
hyper call
exception
/4816National Taiwan University, Graduate Institute of Networking and Multimedia
Tang-Hsun Tu
Hypervisor - Grant Table
Grant reference (GR) Grant entry A request with an index
Use in communication Page mapping & Page transferring
Domain A Domain B
create GRsend GR
informrelease GR
map page
unmap page
access page
Domain A Domain B
transfer page
send GRcreate GR
release GR
receive pageinform
/4817National Taiwan University, Graduate Institute of Networking and Multimedia
Tang-Hsun Tu
Hypervisor - Event Channel
A lightweight signal mechanism Use “ports” as identifers (pending+mask)
Four major purposes
Guest OSGuest OS
Hypervisor
Hardware
Virtual CPUVirtual
MemoryScheduling
PhysicalCPU
PhysicalMemory
Eth1
…
…
…
Eth0
VCPU VCPU … VCPU VCPU …IPI
IDC
vIRQ pIRQ
IPI
015Event Channel
port 0port 1
…
/4818National Taiwan University, Graduate Institute of Networking and Multimedia
Tang-Hsun Tu
Architecture
2 scheduling algorithms (Non/Work Conserving) Simple Earliest Deadline First (SEDF) Credit
CPU Virtualization
Guest OS
VCPU VCPU
Guest OS
VCPU
…
…
PCPU PCPU PCPU …
App App
Hypervisor
Scheduling
/4819National Taiwan University, Graduate Institute of Networking and Multimedia
Tang-Hsun Tu
CPU Virtualization - Earliest Deadline First
Assign process priorities according to the deadlines of their current request
An example, two processes T1 = (slice, deadline) = (1, 2)
T2 = (2, 8)
T2T1 T1 T2 T1 T1 T1
d1: 2d2: 8
d1: Xd2: 8
d1: 4d2: 8
d1: Xd2: 8
d1: 6d2: X
T2
d1: 8d2: X
d1: 10d2: 16
0 1 2 3 4 5 6 7 8t
9 10
d1: Xd2: 16
/4820National Taiwan University, Graduate Institute of Networking and Multimedia
Tang-Hsun Tu
CPU Virtualization - SEDF
(slice, period, deadline)
Two queues
Cannot do load balancing on SMP e.g 3 domains (A:80%, B:80%, C:30%), 2 PCPUs
slice
period
VCPU1Run queue
Wait queue
VCPU2 VCPU3 VCPU4
VCPU1 VCPU2 VCPU3
d1 < d2 < d3 < d4 …
s1 < s2 < s3…
/4821National Taiwan University, Graduate Institute of Networking and Multimedia
Tang-Hsun Tu
CPU Virtualization - Credit
Each PCPU has a VCPU list Priority queue
Two priority states, over, under Over: consume > allocate Under: consume < allocate
If there is no “under” VCPU, hypervisor will select “under” VCPU from other PCPU
(weight, cap)creditunder or over
VCPU1 VCPU2 VCPU3 VCPU4Priority queueunder under under over
/4822National Taiwan University, Graduate Institute of Networking and Multimedia
Tang-Hsun Tu
CPU Virtualization - Interrupt (1/2)
8259A IOAPIC+LAPIC
PIT
Keyboard
RTC
/4823National Taiwan University, Graduate Institute of Networking and Multimedia
Tang-Hsun Tu
CPU Virtualization - Interrupt (2/2)
Physical interrupt For the hypervisor or for guest OSes
Virtual interrupt Ask guest OSes to do 8 for now (max is 24)
PIC
IRQn
Device
OS
Hardware
PIC
IRQn
Device
Guest OS
Hardware
Hypervisor
Guest OS …
ISR
event
/4824National Taiwan University, Graduate Institute of Networking and Multimedia
Tang-Hsun Tu
Two-level memory Three-level memory
Virtual, Pseudo-physical, Machine
Memory Virtualization- Memory Architecture
(1/2)
hypervisor
Application
OS
- Virtual Memory
-Physical Memory
Hypervisor-Machine Memory
Guest OS-Pseudo-Physical Memory
P2M M2P
/4825National Taiwan University, Graduate Institute of Networking and Multimedia
Tang-Hsun Tu
168M memory for hypervisor
Memory Virtualization- Memory Architecture
(2/2)
Area Size
MPT, Machine-to-Physical Translation Table (RO) 16M
Page-Frame Information 96M
MPT, Machine-to-Physical Translation Table (R/W) 16M
Linear Page Table 8M
Shadow Linear Page Table 8M
Per Domain Mappings 8M
Direct Map 12M
I/O Remap 4M0xFFFFFFFF
0xFC000000
0xFC400000
Heap
/4826National Taiwan University, Graduate Institute of Networking and Multimedia
Tang-Hsun Tu
4 mechanisms to manipulate page tables Paravirtualized page tables Write page tables (Only level 1 is writable) Shadow page tables Hardware-assisted paging (Intel:Extend, AMD: Nest)
Memory Virtualization- Translation (1/2)
Virtual Memory
Machine Memory
Pseudo-Physical Memory
Page TablePage Fault !
Shadow Page Table
P2M
(VM->PFN) (VM->MFN or VM->P2M)
Second Level PagingHAP
MMU
/4827National Taiwan University, Graduate Institute of Networking and Multimedia
Tang-Hsun Tu
Comparison
Memory Virtualization- Translation (2/2)
TypeSpace
OverheadComputation
OverheadGuest OS
ModificationRequiring HW
support
Paravirtualized page table
Low(N)
Low A lot No
Writable page table
Low(N)
High Some No
Shadow page table
High(2N)
High None No
Hardware-assisted paging
Medium(N+M)
Medium None Yes
N is the number of page tables in all guests.M is the number of all guests.
/4828National Taiwan University, Graduate Institute of Networking and Multimedia
Tang-Hsun Tu
Structure
Compare with start_info_page
Memory Virtualization - Shared Info Page
wall clock
event channel
Start Info Page Shared Info Page
Mapped by Domain Builder Guest OS
Information StaticDynamically Updated
MAX is 32 VCPUs
memory
TSC
/4829National Taiwan University, Graduate Institute of Networking and Multimedia
Tang-Hsun Tu
I/O Device Virtualization - Device Model
Hypervisor also provides three mechanisms to use devices.
Emulated Devices
Paravirtualized Driver
Pass-through
/4830National Taiwan University, Graduate Institute of Networking and Multimedia
Tang-Hsun Tu
I/O Device Virtualization - Emulated Devices
Implemented by QEMU e.g. sound card, ac97, sb16, etc
QEMU-DM
/4831National Taiwan University, Graduate Institute of Networking and Multimedia
Tang-Hsun Tu
I/O Device Virtualization - Paravirtualized Driver
Split Device Driver Model An example of sending packets
Front-End DriverBack-End Driver
Native Driver
/4832National Taiwan University, Graduate Institute of Networking and Multimedia
Tang-Hsun Tu
I/O Device Virtualization - I/O Ring
Without data, it only transfers request/reply A example with GR
Grant Table
Active Grant Table
Hypervisor
Dom U Dom 0
GR GR
GR
Device
I/O Channel
/4833National Taiwan University, Graduate Institute of Networking and Multimedia
Tang-Hsun Tu
I/O Device Virtualization - Pass-Through
Pass and directly use the device
Dom UDom 0
Hypervisor
Hardware
Virtual CPUVirtual
Memory Scheduling
PhysicalCPU
PhysicalMemory
Eth1
…
…
NativeDriver
…NativeDriver
Eth0
/4834National Taiwan University, Graduate Institute of Networking and Multimedia
Tang-Hsun Tu
Hardware Virtual Machine (1/3) Intel Virtualization Technology
Technology
Description Virtualization Implementation
VT-xRoot/NonRootExtended Page Tables
CPU, Memory Instructions Set
VT-i As VT-x, for Itanium
VT-d DMA, Interrupt Devices IOMMU (Chipset)
VT-c Classify Packets Network Devices VMDq, VMDc
/4835National Taiwan University, Graduate Institute of Networking and Multimedia
Tang-Hsun Tu
Hardware Virtual Machine (2/3) Architecture
Intel VT-x Support if CPUID.1:ECX.VMX[bit 5] = 1
Descriptions Instructions
En/Disabling VMX VMON, VMOFF
Launch/Resume MV VMLAUNCH, VMRESUME
Calling to VMM VMCALL
Controlling Virtual Machine Control Structure (VMCS)
VMPTRLD, VMPTRST
VMREAD, VMWRITE, VMCLEAR
Invalidate Translations INVEPT, INVVPID
ring0
ring1
ring3
non-root
root
Guest App
Guest OS
Hypervisor
Guest App
Guest OS
Hypervisor
VMLAUNCH
VMRESUME
/4836National Taiwan University, Graduate Institute of Networking and Multimedia
Tang-Hsun Tu
Hardware Virtual Machine (3/3) Use BIOS code from Bochs Replace several functions, e.g. SYSENTER HVM Device
QEMU-DM
/4837National Taiwan University, Graduate Institute of Networking and Multimedia
Tang-Hsun Tu
http://rswiki.csie.org/lxr/http/source/?v=xen-3.4.1
Build Xen - Xen Source Tree
hypervisor
QEMU-DM, Bootloader, xm, xend, …
A mini paravirtualized OS
/4838National Taiwan University, Graduate Institute of Networking and Multimedia
Tang-Hsun Tu
Build Xen - Screenshot
/4839National Taiwan University, Graduate Institute of Networking and Multimedia
Tang-Hsun Tu
Build Xen - A Simplest Xen Kernel
Headers to tell Xen loader
OS
#include <arch-x86_32.h>
.section __xen_guest.ascii "GUEST_OS=Hacking_Xen_Example".ascii ",XEN_VER=xen-3.0".ascii ",VIRT_BASE=0x0".ascii ",ELF_PADDR_OFFSET=0x0".ascii ",HYPERCALL_PAGE=0x2".ascii ",PAE=yes".ascii ",LOADER=generic".byte 0
0102030405060708091011
0x0
0x1000
0x2000
0x3000
…
hypercall_page
shared_info
_start
stack_start
_start: cld lss stack_start, %esp push %esi call start_kernel
0102030405
page number
void start_kernel( start_info_t *start_info){ HYPERVISOR_console_io( CONSOLEIO_write, 12, "Hello World\n"); while(1);}
0102030405060708
hypercall
/4840National Taiwan University, Graduate Institute of Networking and Multimedia
Tang-Hsun Tu
Build XCI - Xen Client Initiative
(1/2) Goals
Creating a minimal environment of Xen, i.e. Xen hypervisor + Linux domain 0, suitable for clients
Supporting more devices through ioemu
XCI consists three subprojects Hypervisor (original code + patches + new management
tools) ioemu (separating from original Xen source tree) Domain-0 Linux
/4841National Taiwan University, Graduate Institute of Networking and Multimedia
Tang-Hsun Tu
Build XCI - Xen Client Initiative
(2/2) Only x86, ia64 and arm in “arch” directory
Xen XCI
Hypervisor 482 KB 533 KB
Kernel Version 2.6.18.8 2.6.27.23
Kernel Source Diff 692,054 lines 5,790,133 lines
Kernel Size2.22 MB (Dom0)1.24 MB (DomU)
4.32 MB (Dom0)
Filesystem and Library
Up to youuClibc+ BusyboxTotal: 100M/33.9M
/4842National Taiwan University, Graduate Institute of Networking and Multimedia
Tang-Hsun Tu
Experimental Environment CPU: Intel Core2 U9400 1.4GHz (use one core) Memory: 512MB Network Interface Card: Atheros AR8131 (at 100MBps) Hypervisor: Xen 3.4.2 Dom-0: Linux 2.6.18.8 Guest OS: Windows XP CPU Benchmark Tools:
Chrome V8 Benchmark Suite SuperPI 1.1e
Hard Disk Drive Benchmark Tools HD Tune Pro v3.50
Network Benchmark Tools Iperf (Server: 2.0.4, Client: 1.7.0)
/4843National Taiwan University, Graduate Institute of Networking and Multimedia
Tang-Hsun Tu
CPU Benchmark (1/2)
8.3%
/4844National Taiwan University, Graduate Institute of Networking and Multimedia
Tang-Hsun Tu
CPU Benchmark (2/2)
5%
/4845National Taiwan University, Graduate Institute of Networking and Multimedia
Tang-Hsun Tu
Network Benchmark (1/2)Testing Time: 180 secondsBenchmark Deviation: 0.12%~0.26
59%
/4846National Taiwan University, Graduate Institute of Networking and Multimedia
Tang-Hsun Tu
Network Benchmark (2/2)
Sample Period: 2 seconds
Average: 9.82%
/4847National Taiwan University, Graduate Institute of Networking and Multimedia
Tang-Hsun Tu
Conclusion We introduce the techniques for how to
virtualize. i.e. full, para and hardware-assisted virtualization
We present the architecture of Xen. Several parts in Xen are also introduced.
Part Introductions
Hypervisor Boot, Hyper Call, Grant Table, Event Channel
CPU Virtualization VMLAUNCH, VMRESUME
Memory Virtualization Architecture, Translation, Shared Info Page
I/O Device VirtualizationDevice Model (Emulated, PV and Pass-Through), I/O Ring
Hardware Virtual Machine
Virtualization Technology
/4848National Taiwan University, Graduate Institute of Networking and Multimedia
Tang-Hsun Tu
Q & A