54
MinixVM: An Implementation of Virtual Memory in Minix 3 Carter Weatherly 710 Masters Project The College of William & Mary Submitted to the Department of Computer Science of the College of William & Mary in partial fulfillment of the requirements for the degree of Masters of Science April 24, 2009

MinixVM: An Implementation of Virtual Memory in Minix 3 · MinixVM: An Implementation of Virtual Memory in Minix 3 Carter Weatherly 710 Masters Project The College of William & Mary

  • Upload
    others

  • View
    24

  • Download
    1

Embed Size (px)

Citation preview

MinixVM: An Implementation of Virtual Memory in Minix 3

Carter Weatherly

710 Masters Project

The College of William & Mary

Submitted to the Department of Computer Science

of the College of William & Mary

in partial fulfillment of the requirements for the degree of

Masters of Science

April 24, 2009

Abstract

Minix 3 is the evolutionary development of Andrew Tanenbaum’s original Minix mi-

crokernel. While the purpose of Minix 1 was to serve as an educational tool for teaching

operating systems, the expressed purpose of Minix 3 is to provide a reliable, general-

purpose operating system based on the microkernel architecture [3]. Because of its

heritage, Minix 3 lacks many capabilities considered necessary for a modern, general-

purpose operating system. One such major deficiency is the lack of a virtual memory

implementation. Virtualizing memory is one of the most fundamental abstractions pro-

vided by a modern operating system. Without virtual memory, Minix 3 cannot transpar-

ently allocate more than the amount of physical memory, share discrete units of memory

between address spaces, or map files into memory. This report presents MinixVM, a mod-

ification of the Minix 3.1.3a source to include virtual memory, and discusses the issues

involved in integrating virtual memory into the microkernel architecture.

2

Contents

1 Introduction 7

2 Minix 3 Design 9

2.1 Microkernel Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.1 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.2 Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Minix 3 Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.1 Minix 3 Messaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2.2 Minix 3 Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3 Memory Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3.1 Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3.2 Executable Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3.3 Swapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 MinixVM Design 20

3.1 Actors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2 Pager Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2.1 Pager Server Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2.2 Pager Server Build . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2.3 Pager Server Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.3 MinixVM Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.3.1 Pager Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.3.2 Exec Notification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3.3 Page Fault Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4 MinixVM Implementation 30

4.1 Architectural Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.1.1 Memory Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.1.2 Hardware Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.1.3 Address Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3

4.1.4 Page Fault Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2 Modification Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.2.1 Kernel Modifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.2.2 PM Modifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2.3 VFS Modifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.3 New Server Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.3.1 New Kernel Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.3.2 New PM Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.3.3 New VFS Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.3.4 Pager Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5 Performance Analysis 43

5.1 General Program Runtime Tests . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.2 Sequential Scan Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

6 Future Work 47

6.1 Removing Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6.2 System Server Reloading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6.3 VM Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6.3.1 Page Swapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.3.2 Copy-On-Write Page Sharing . . . . . . . . . . . . . . . . . . . . . . . . 49

6.3.3 mmap() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.3.4 Dynamic Linking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

7 Conclusion 51

8 References 52

9 Appendix: MinixVM Patch Summary 53

4

List of Figures

1 Minix 3 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2 Minix 3 Boot Image Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Minix 3 Boot Image Layout, with Pager . . . . . . . . . . . . . . . . . . . . . . 22

4 Pager Initialization Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5 Exec Notification Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

6 Page Fault Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

7 Segmented Address Translation . . . . . . . . . . . . . . . . . . . . . . . . . . 31

8 Paged Address Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

9 General Program Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

10 Scan Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5

List of Tables

1 New Kernel Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2 New PM Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3 New VFS Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4 Pager Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6

1 Introduction

The world of operating systems is dominated by a single design. The monolithic kernel —

where all of the operating system services are contained in a single binary running at the

highest hardware-enforced privilege level — underpins most modern operating systems. This

approach was a natural extension of the pre-operating system days where application pro-

grams interacted directly with the hardware (treating the operating system effectively as a

“library” through which hardware resources are accessed). However, the monolithic design

is not the only one available to operating system developers.

The microkernel architecture was originally conceived as a notion to simplify operating

systems, which had become more and more complex as the capabilities of the hardware they

controlled increased. The fundamental idea was to move as much code out of the kernel as

possible, placing it in clearly delineated “servers” that run at a less privileged level than the

kernel.

Minix 3 employs the microkernel architecture, and is the focus of this report. While Minix

3 purports to be a general-purpose operating system, it lacks many of the operating system

primitives that developers have come to expect from modern operating systems. In particular,

Minix 3 lacks a virtual memory (VM) implementation. This report presents MinixVM, a design

and prototype implementation of VM in Minix 3.

The majority of this report is dedicated to the design and implementation of MinixVM;

however, the ultimate goal of this report is to explore the difficulties of programming a mi-

crokernel within the context of MinixVM. As such, much of the presentation of MinixVM is

interspersed with discussions on the microkernel architecture and how it sometimes forces

awkward design and implementation decisions.

The remainder of this report follows this organization:

• Section 2: Minix 3 Design

Presents a generic overview of the microkernel architecture; the design of Minix 3

as it relates to the design and implementation of MinixVM; and discusses some of

7

the advantages and disadvantages of the microkernel architecture.

• Section 3: MinixVM Design

Describes the design of MinixVM, and how this design is shaped by the underlying

design of Minix 3; describes the new system server, pager; and outlines the major

operations that MinixVM must support.

• Section 4: MinixVM Implementation

Describes the hardware architectural support for virtual memory; provides a sum-

mary of the material modifications made to existing Minix 3 components to im-

plement MinixVM; and lists the new messages added to support MinixVM.

• Section 5: Performance Analysis

Contains a brief performance analysis of the MinixVM implementation, focused on

the overhead required to handle page faults.

• Section 6: Future Work

Lists the items from the MinixVM design that are not complete in the MinixVM

implementation and suggests extensions to the MinixVM design.

• Section 7: Conclusion

Concludes the report, including a commentary on the viability of MinixVM and

microkernels in general.

Also included is an Appendix, which summarizes the changes made to the Minix 3.1.3a

source.

8

2 Minix 3 Design

This section describes the design of microkernels, the relevant portions of Minix 3 design,

and how the design of the operating system imposes limitations on the implementation.

2.1 Microkernel Architecture

A monolithic kernel is frequently referred to as being “vertical stacked” in that it places layer

upon layer of abstraction on the hardware in its creation of operating system services. A

microkernel, on the other hand, is called “horizontally stacked” in that it provides the same

services as a monolithic kernel, except that the majority of these services exist in userspace

as multiple interdependent binaries.

Like a monolithic kernel, a microkernel includes a binary that runs at the highest privilege

level, but this binary is designed to be as small as possible. Typically the “kernel” of a micro-

kernel only contains the architecture-specific code necessary to boot the system and interact

with the hardware, essential hardware drivers (e.g. the programmable interrupt controller or

clock driver), minimal scheduling primitives, and an efficient mechanism that the remaining

microkernel components use to communicate with each other (typically inter-process com-

munication or IPC). The remainder of the services typically provided by an operating system

are supplied by various “servers” which run in userspace. The level of access that these

servers are given to hardware is determined by their function (this necessarily creates a set

of trusted “system servers” without which the system would be effectively non-functional).

2.1.1 Advantages

The following is a list of the main advantages of the microkernel architecture over other

operating system architectures. Where relevant, examples from Minix 3 are provided.

• Functional Separation

From a pure design perspective, microkernels provide a perfect component-based

architecture, where each component has independent and clearly-defined func-

9

tional responsibilities. This should simplify systems development and mainte-

nance, much like the usage of shared libraries does for application development.

• Service Isolation

In a monolithic kernel, all operating systems services operate in the same address

space. If there is an error in any of this code, it could potentially corrupt the

address space of the kernel and halt the system unexpectedly. In a microkernel,

such a fault would be isolated to the offending userspace process.

• Fault Tolerance

Since operating system services are both isolated and component-based in the mi-

crokernel architecture, they lend themselves naturally to the mechanisms of fault

tolerance from distributed systems research. In fact, Minix 3 includes in its set of

system services a “Reincarnation Server” which will restart a service if ceases to

receive status notifications, making the protected components fault tolerant [2]. If

this could be extended to the entire operating system (it cannot, see Section 2.1.2)

then Minix 3 would be fault tolerant.

• Trusted Computing

The microkernel architecture provides an enticing target for creating a trusted op-

erating system, since current methods still rely heavily upon source code auditing

and the core components of microkernels are small. Also, the isolation of sys-

tem services to userspace limits the amount of damage an exploited service could

cause.

2.1.2 Disadvantages

The following is a list of the main disadvantages of the microkernel architecture over other

operating system architectures. Where relevant, examples from Minix 3 are provided.

10

• Operating System Development

The primary disadvantage of the microkernel architecture is that, in its attempt

to make it more difficult to propagate faults and define clear functional bound-

aries, it also complicates programming operating system functionality. Even the

most straightforward concepts require drastically different (and significantly more

complex) implementation than in a monolithic kernel. This is especially true with

VM and will become more apparent in Section 3. This complexity in design also

directly leads to difficulty in testing and debugging, effectively compounding the

typical problems in developing an operating system with the difficulties involved

in developing a distributed system. These issues, as well as the ready availability

of operating systems based upon monolithic kernels have been largest barrier to

microkernel adoption.

• Functional Coupling

The microkernel architecture is ideally composed of loosely-coupled communi-

cating components, but in practice this ideal is one of the first sacrificed when

attempting to provide service implementations. This is clearly evident in Minix 3,

where there is a hierarchy of “trusted” servers, or system servers (e.g. the process

manager and virtual file system manager). Without the set of system servers, the

operating system would not function. It is true that the components could be re-

placed (as is expected when the functionality has been abstracted), but only if the

replacement components make the same assumptions about the processing of data

— in a real system like Minix 3, components share data outside of their clearly de-

fined APIs (e.g. the data copying necessary for true message handling described

in Section 2.2.1). This extra-API processing prevents true functional separation in

Minix 3 (likely, in any full-featured microkernel).

• Data Fracturing and Coupling

Each service may need to manipulate the same abstraction in different ways. Data

11

copying between servers is relatively expensive, so in order to avoid that, micro-

kernels tend to structure such “shared” data in a fashion similar to a relational

database. All servers agree upon an identifier for the abstraction and link that

with their own set of relevant data. This makes it difficult to manage these data

structures, introduces data coupling between the servers, and adds further com-

munications overhead to synchronize these data structures. An example of this in

Minix 3 is the process: the kernel, PM, and VFS all maintain their own set of data

relevant to their processing for each process in the system.

• IPC Overhead

There have been claims, and indeed early evidence suggested, that microkernels

suffer from inherent performance deficiencies compared to monolithic kernels,

given the IPC overhead and the frequent kernel/user mode switches. These claims

have largely been refuted by [6]; however, it is true that microkernels will al-

ways incur more overhead in offering services than monolithic kernels (due to the

IPC). In the case of Minix 3 in particular, the server IPC mechanism simplifications

(described in Section 2.2.1) ease the creation of system servers (all internal oper-

ations are guaranteed atomic); however, it comes with a decrease in performance.

System servers are the brokers of shared resources, which means that multiplexing

the access to these resources is impossible. There are, however, ways around these

limitations, as will be discussed in Section 3.

• Fault Tolerance Myth

It seems as if the microkernel architecture would provide fault tolerance in the

way that distributed systems do, but in practice, this is very difficult to achieve

in the general case. In fact, in Minix 3, this goal is only partially achievable. If a

system server were to crash, it could theoretically be restarted; however, the new

instance would lack all of the state information of the previous instance. Without

this data, the system would cease to function (e.g. imagine having the process

12

table deleted from a running monolithic kernel). The alternative would be to treat

each operation on a critical system server as a transaction, mirroring the data in

a separate server (either as a data store or warm failover component). This has

the disadvantage of (likely significantly) decreasing performance, and not actually

solving the issue — it would turn a single system server failure to whole system

failure into a double order failure (system server and transactional backing store

failure). Because of these issues, Minix 3 only attempts to “reincarnate” device

driver servers [2]. While device drivers account for the majority of system crashes

in some modern operating systems [1], this partial implementation does not make

Minix 3 a true fault tolerant operating system.

• Reverse Dependencies

A reverse dependency, in the context of the microkernel architecture, occurs when

the kernel depends upon a userspace component to complete the processing of

a kernel-level operation. One such generic example would arise if the process

scheduler were implemented in a userspace server. While the mechanisms to per-

form the scheduling exist in the kernel, if the kernel needed to run the scheduler,

it would have to reach out to the userspace scheduler. This is a particularly in-

sidious form of functional coupling, which may not be entirely unavoidable in

the microkernel architecture (see Section 6.1 for a concrete example applicable to

MinixVM).

2.2 Minix 3 Components

Minix 3 roughly follows the typical microkernel design described in Section 2.1. Figure 1

illustrates the layout of Minix 3 (image taken from [2]).

13

Figure 1: Minix 3 Architecture

Of the components listed, the following are directly relevant to the design and implemen-

tation of MinixVM (much more detail will be given in Section 3 and Section 4).

• Microkernel

The microkernel is the only component of Minix 3 that runs in kernel mode (clear

from Figure 1). It manages the interrupt vectors and exception handlers and so

will be the first component in MinixVM to begin handling page faults. This portion

of Minix 3 will be referred to as the “kernel” since it is the only code run in kernel

mode. It is important to note that the kernel cannot dynamically allocate memory

(memory allocation is controlled by the process manager).

• Process Manager

The process manager, or PM, controls all aspects of a process in Minix 3, except

for the actual scheduling or low-level management (all of which is in the ker-

nel). All memory management in Minix 3 is controlled by PM (see Section 2.3 for

why). Since almost all process-related messaging routes through PM, it is central

to MinixVM.

14

• File Server

The file server is typically a collection of at least two services: the virtual file

system service (VFS) and the file system driver service. Almost all I/O operations

route through VFS, and it is this service that MinixVM must interact with in order

to perform its I/O requests (e.g. reading a page from disk into memory).

2.2.1 Minix 3 Messaging

The Minix 3 processes communicate through the Minix 3 IPC, or messaging system. The

messaging system resides in the kernel, since it needs to be able to access arbitrary memory

(copy the message contents from one address space to another). This system supports send-

ing fixed length messages to any process equipped to receive messages (processes must block

on a receive() in almost all cases for this condition to be true). Using fixed-length messages

simplifies the messaging mechanism, since the kernel need not allocate arbitrary memory to

deliver messages. As a further simplification, a receiver may process only one message at

a time (queue size one). If a process attempts to send a message to another process which

is handling a message already, the sender will block. With the fixed-length messages and

queues, the messaging mechanism implements the simplest form of reliable message deliv-

ery; however, it does so at the cost of decreased messaging throughput and server utilization.

Minix 3 works around the data-size limitation by sending pointers to larger data structures in

the message and having the receiver use the kernel to copy the memory to its address space.

Processes that may receive messages (almost all) are identified by message endpoint num-

bers that the kernel assigns to each process. These are guaranteed to be unique — in Minix

3, the kernel simply adds a constant to the process ID to generate this number, except with

the system servers where these endpoints are fixed. The endpoint number is required in all

message sending API functions.

There are three types of message sending semantics in Minix 3, listed here by the API

function call used to send a message using the specified semantics. There is only one way

for process to receive a message: by calling the receive() API function and blocking until

15

message receipt (the so-called stop-and-wait semantics). Receiving processes may specify

which processes are allowed to send messages to them by a parameter in the receive() call.

• send()

Most messages in Minix 3 are sent using the send() API call, as it minimizes mes-

saging overhead while still allowing for the delivery of data in the message. The

send() API call is synchronous. This means that the sender will block until the

kernel copies the message data from the sender’s address space to the receiver’s ad-

dress space. If the receiver is not currently blocked on a receive() call, the sender

will block until the receiver completes its processing and calls receive() again.

• sendrec()

When processing requires a remote resource respond to a request to complete,

sendrec() is used. The semantics of sendrec() are the same as traditional RPC

— the process blocks while a remote resource processes a request, which looks like

a regular function call in the original process, until the results of the operation are

known to the original process. Using sendrec() incurs the most messaging over-

head of all the sending methods; however, there are some circumstances where

these semantics are required. Again, most Minix 3 communications are designed

to use send() to avoid this overhead.

• notify()

The only true asynchronous messaging mechanism in Minix 3, notify() is used

rarely since it carries no payload. In terms if implementation, a notify() to a

process merely sets a bit (depending upon the type of notification) in a bitmap

contained in the process metadata. Including notify() in the messaging API is

necessary for passing asynchronous events from the kernel to userspace (hardware

interrupts and signals).

16

2.2.2 Minix 3 Image

As mentioned in Section 2.1, the microkernel design necessarily creates a hierarchy of servers

offering system services. There are certain servers without which Minix 3 would not function.

These services form the “system servers,” a set of trusted (in that they have access to certain

hardware resources) servers. Since the system servers are necessary for proper functioning,

they are compiled into the image that is loaded by the boot monitor, which contains the

kernel. (Note that while this deviates from the ideal of having a “bootstrap server” load

required services, the microkernel architecture is maintained since the system servers may be

replaced at compile-time.)

Figure 2 shows the typical output of a Minix 3 image, describing the layout of the micro-

kernel and system servers in the boot image (‘kernel’ is the microkernel and the remaining

components are the system servers).

text data bss size

29968 5516 54036 89520 ../kernel/kernel

27376 7612 83832 118820 ../servers/pm/pm

48672 11792 105148 165612 ../servers/vfs/vfs

21488 7448 47504 76440 ../servers/rs/rs

32560 8012 16180 56752 ../servers/ds/ds

30368 7748 111516 149632 ../drivers/tty/tty

11120 535368 16908 563396 ../drivers/memory/memory

9792 2268 68936 80996 ../drivers/log/log

31792 5832 4990632 5028256 ../servers/mfs/mfs

7008 2440 1356 10804 ../servers/init/init

------ ------ ------ -------

250144 594036 5496048 6340228 total

Figure 2: Minix 3 Boot Image Layout

The Minix 3 quirk of adding the system servers to the boot image becomes important

when deciding what will be paged and when paging will be activated. This will be described

more fully in Section 3.2.

17

2.3 Memory Management

2.3.1 Allocation

Minix 3 abstracts both the processor and memory using the same construct: the process.

Most modern operating systems use the process to abstract the CPU, for scheduling purposes,

but use some other finer-grained construct for abstracting memory. The reason for this that

while whole-process memory allocation (where a process is given all of the memory needed

to run during execution) simplifies the allocation process (one need only to find a large

enough chunk of contiguous physical memory) it complicates the efficient use of memory.

Whole-process memory allocation necessarily leads to internal fragmentation of memory,

since processes rarely use their entire memory allocation, and also will eventually lead to

external fragmentation of free memory “holes” that are difficult to allocate (an example of

the NP-complete problem of online bin packing).

Minix 3 does attempt to prevent wasted memory through duplication of read-only code

sections by only loading the text section of a binary into memory once per program, sharing

that memory region amongst the multiple program copies. However, it does suffer from the

aforementioned problems of internal and external fragmentation.

2.3.2 Executable Layout

To simplify loading, and due to the lack of VM support, Minix 3 executables must statically

declare required memory resources at compile time (in the Minix-native a.out header). This

constraint imposes a significant limitation upon dynamic memory allocation and runtime

stack size, as they must compete for statically allocated memory. Miscalculating resources in

this design must therefore result in either a processes being killed due to memory exhaustion

or significant internal fragmentation (memory allocated that is rarely used). This design also

makes porting existing programs to Minix 3 difficult, since most modern applications assume

a nearly limitless supply of memory (presuppose the presence of VM).

18

2.3.3 Swapping

Minix 3 has limited support for swapping out processes to disk in order to alleviate memory

pressure. This support is limited by the coarse allocation of memory in process-sized chunks.

In order to swap any part of a process, the entire process must be swapped. This approach

presents a very narrow window in which a system under significant memory pressure may

remain operational before thrashing, since the disk I/O required to read/write an entire

process would consume the majority of context-switch time.

19

3 MinixVM Design

The fundamental difference between Minix 3 and MinixVM is in the definition of the mem-

ory object. In Minix 3, memory objects are allocated to processes, in the size required by

the process. In MinixVM, memory is allocated in context-independent, but process-related

discrete pages. This further abstraction of memory demands further complexity in memory

management, but also provides a disproportionate amount of flexibility in this management.

This section will discuss the design of MinixVM, presenting the pager server and the

operations that must be implemented to support MinixVM.

3.1 Actors

In a VM implementation, there are typically five actors: the hardware, page fault handler,

page allocator, process manager, and file system interface. The page fault handler is the

routine called by the hardware when it detects a page fault; the page allocator is the logic

that determines which page to allocate to the faulting process; the process manager coalesces

the memory management information with the process; and the file system interface is used

to read data in from disk, as in demand paging executables.

Most of these roles can be supported by modifying pre-existing Minix 3 components. The

page fault handler can replace the default action in the kernel; the process manager (PM)

can be modified to support routing page fault-related messages, and VFS can be modified to

read executables on page fault. However, to fully support this design, a new system server is

required: pager.

3.2 Pager Server

The pager server has simple and clearly-defined responsibilities: it manages the address

spaces for all paged processes and determines whether or not a page fault represents a valid

memory access for a given process. Note that pager must assume the role of memory manager

since it must manipulate the physical address space directly.

20

3.2.1 Pager Server Data

In order to manage memory and handle page faults, pager requires a number of data struc-

tures.

• Process Metadata

There is certain information, such as the process ID and endpoint number, that

pager must know in order to be able to associate messages about processes with

its own data structures. This means that pager joins the set of kernel, PM, and

VFS, by containing its own data on running processes. This also means that pager

must be notified whenever a process is created (see Section 3.3.2).

• Hardware Page Tables

The hardware page tables are the data structures that are read directly by the

MMU hardware on each memory access when paged mode is activated (see Section 4.1

for the architectural details). The inclusion of the hardware page tables in a

userspace program is necessary since the kernel cannot do dynamic memory allo-

cation, and is a good example of the awkwardness imposed upon the design by

the microkernel architecture. Fortunately, the kernel can read from all physical

memory – it need only be told where the base address of the hardware page table

is located (see Section 3.3.2 for how this is done).

• Supplemental Page Tables

The hardware page tables only constitute the pages allocated to the respective

process that are currently loaded into memory. There are cases where pages may

be allocated but not in memory, such as when a page is swapped out to disk.

Pager must maintain this supplemental information in order to accurately manage

memory and determine whether or not a memory access is valid.

21

3.2.2 Pager Server Build

As mentioned in Section 2.2.2, the Minix 3 microkernel is combined with the critical system

servers into a single binary that is loaded by the boot monitor on boot. Figure 3 shows the

layout of the boot image with the pager server included.

text data bss size

30656 5732 54456 90844 ../kernel/kernel

27744 8076 81904 117724 ../servers/pm/pm

49520 12212 107548 169280 ../servers/vfs/vfs

21488 7448 47552 76488 ../servers/rs/rs

32560 8012 16180 56752 ../servers/ds/ds

30368 7748 111516 149632 ../drivers/tty/tty

11120 535368 16908 563396 ../drivers/memory/memory

9792 2268 68936 80996 ../drivers/log/log

31792 5832 4990632 5028256 ../servers/mfs/mfs

6176 2176 118624 126976 ../servers/pager/pager

7008 2440 1356 10804 ../servers/init/init

------ ------ ------ -------

258224 597312 5615612 6471148 total

Figure 3: Minix 3 Boot Image Layout, with Pager

It does not matter where pager is included in this binary, since system servers are not

paged (they may run before the pager runs – see Section 3.2.3). In an ideal implementation,

as in most modern operating systems, a small amount of bootstrap code would run before

any of the system servers are run, enabling VM and loading them into memory in a way that

pager expects. Minix 3 does not currently use this approach, as it does not need to (since

memory slots are merely “occupied” by running programs).

It was the lack of support for laying out processes in the system image in a VM-friendly

way that forced the creation of a separate pager server, rather than an extension to PM. As a

separate server, it may be loaded into memory by PM (more on this in Section 6.3) in such

a way that it could be paged. This would be necessary to support the potentially large data

structures required for managing the separate virtual address spaces for each process.

22

3.2.3 Pager Server Limitations

The pager server has several limitations imposed upon it, mostly due to the underlying design

of Minix 3.

• Certain processes cannot be paged

Pager supports paged operation of all processes not including the system servers.

This is a consequence of how system servers are loaded (as a bulk image, by the

boot monitor) and the fact that pager does not run before the other system servers

do (see Section 6.2 for a general solution to this problem).

• Limited heap space

As mentioned in Section 3.2.2, since pager is a system server, it has the same

limitation of sharing its heap space and stack space from a statically, compile-time

allocated space.

3.3 MinixVM Operations

This section describes the control flow of the major operations that MinixVM must support:

pager initialization, exec notifications, and page fault handling. Since a microkernel is a

collection of loosely coupled functional components, these interactions will be presented as

a sequence of messages. In some cases these are function calls, but in most cases, they are

Minix 3 messages; the latter cases will be noted.

3.3.1 Pager Initialization

Pager initialization is both the most infrequent operation required by MinixVM (once at boot)

and the operation requiring the most change to Minix 3 (see Section 4.2.2 for implementation

details).

More detail on the messages can be found in Section 4.3.

23

Figure 4: Pager Initialization Processing

1. At power on, the boot monitor takes control of the hardware, loads the boot image, and

vectors control into the image (kernel).

2. The kernel performs some hardware initialization, and then transfers control to PM,

which resides in the boot image.

3. PM begins initializing itself. As a part of this initialization, PM gets the map of physical

memory from be boot monitor (since the boot monitor loaded the boot image, it knows

which parts of memory are currently in use).

Note that this step cannot be moved to Step 7 — PM’s initialization must complete

before pager’s initialization begins. This is because as a part of PM’s initializa-

tion, PM will redefine the segment descriptors for the system servers, effectively

24

changing the memory layout in such a way that it will not agree with the map

contained by the boot monitor. Pager cannot simply query the boot monitor when

it initializes.

4. The boot monitor responds with the physical memory map.

5. PM returns control to the kernel, since its initialization is complete.

6. Some time later (after allowing other system servers to initialize themselves), the kernel

transfers control to pager’s initialization code.

7. As a part of pager’s initialization, it will read the updated memory map from PM (so

that pager can initialize its memory management). This is done by sending PM a

PM PAGER INIT message.

8. PM responds to pager with the memory map.

9. Pager returns control back to the kernel, now that it is done initializing. If no more

system servers remain to be initialized, the kernel starts init.

3.3.2 Exec Notification

The following figure shows the message sequence required to implement the exec notification

scheme in MinixVM. The exec notification is required so that pager knows to set up its process

image once a new process is created. Note that the two other actors who must update their

process images, the kernel and VFS, are notified with the SYS EXEC and PM EXEC messages,

respectively.

Also note that a corresponding exit notification is not necessary. PM will send the mes-

sages to free the allocated memory once it receives the exit notification, and the old page

data will simply be overwritten once pager receives another exec notification for the same

process ID (this assumes PM is properly managing processes).

More detail on the messages can be found in Section 4.3.

25

Figure 5: Exec Notification Processing

1. The user initiates an exec() by sending a message to PM (though a library-wrapped

exec() call). This is a blocking call since the requesting process cannot continue until

the exec is complete.

2. PM sends a PAGER EXEC message to pager, to notify pager of the new process creation.

This is a blocking call because PM needs the page directory base address from pager

before continuing.

3. Pager responds to PM with the page directory base register of the new process.

The response from the exec notification is also a convenient place to pass the page

directory base address (see Section 4.1), to the kernel, since PM must already

communicate with the kernel to finish the exec.

4. PM sends a SYS EXEC message to the kernel (through the sys exec() system call) to

complete the exec. This call must be blocking since Step 1 was a blocking call (PM must

synchronously respond to the requesting process).

5. The kernel responds with the exec status.

26

6. PM passes the exec status to the requesting process.

Note that only Steps 2 and 3 are new in MinixVM (the remainder were included either

for completeness or because they were modified).

3.3.3 Page Fault Handling

The following figure shows the message sequence required to handle a page fault in MinixVM

(only the go-path is shown). Page fault handing is at the core of the MinixVM implementation

and is the most complicated new operation that it supports.

More detail on the messages can be found in Section 4.3.

Figure 6: Page Fault Processing

1. The hardware senses a page fault exception and vectors control into the location speci-

fied by the kernel at boot (the page fault handler).

27

2. Kernel sends non-blocking notification (HARD INT message) to PM.

This is implemented by sending PM a HARD INT (hardware interrupt) message,

since it is the only available asynchronous messaging mechanism the kernel may

use to contact PM.

3. PM processes the page fault notification from Step 2 and sends a SYS PAGEFAULT mes-

sage to the kernel to get the page fault details. PM blocks until it receives a response.

4. Kernel replies to PM with the page fault details.

5. PM sends a PAGER PGFCHECK message to pager, to check the validity of the page fault.

PM will block until pager receives the message (it will not wait for pager to fully process

the message – see Section 2.2.1).

6. Pager sends PM a PM PGFCHECK REPLY message with the results of the validity check. If

the result is negative, PM will report a bad page fault to the kernel.

7. PM sends a PM PGFAULT message to VFS, requesting that it read the page from disk into

memory. PM will block until VFS receives the message.

8. VFS sends a PM PGFAULT REPLY message to VFS, including the response status.

9. PM sends a PAGER PGFDONE message to pager, to let it know the status of the page fault

processing.

This is required since pager would have know way of knowing whether or not Step

7 failed. If there is a failure after its last communication with PM, page tables and

the like could become corrupted (e.g. list an entry that does not exist).

10. PM sends a SYS PAGEFAULT message to the kernel, with the status of the page fault

processing.

Note that in this design, the send() semantics were used as often as possible (Steps 5,

6, 7, 8, 9), since this maximizes the throughput without overly complicating the message

passing sequence (see Section 2.2.1 for the details on send() semantics).

28

There is one case where the asynchronous notification mechanism must be used. In

Step 2, where the kernel first communicates the page fault to PM. Since this notification is

sent from the page fault handler, it is executed in interrupt context. This means that no

progress will be made on the CPU until the handler terminates. Using notify() is the fastest

mechanism Minix 3 offers to the kernel to communicate with PM.

29

4 MinixVM Implementation

This section describes the implementation of MinixVM, following the design listed in Section 3.

4.1 Architectural Support

Virtual memory is something that necessarily requires support from hardware to be feasi-

ble from a performance perspective, given that the address translation must occur on every

memory access. Minix 3 only supports the Intel IA-32 architecture — this section provides a

brief overview of what the operating system must do in order to use the hardware memory

management support in the IA-32 architecture.

The information from this section was taken largely from Intel’s comprehensive system

programming guides ([4, 5]).

4.1.1 Memory Modes

When an IA-32 machine first boots, it is in so-called “real-mode,” where memory addresses

correspond directly to the physical addresses, in a limited 16-bit wide address space. To take

advantage of the full 32-bit address space, and to provide protection of one chunk of code

from another, IA-32 provides a “protected-mode.” Protected-mode is entered soon after boot

in Minix 3. Once in protected-mode all code segments must be given segment descriptors,

which are later used for address translation and privilege checks, in hardware.

4.1.2 Hardware Data Structures

The following special purpose registers are important to the MinixVM implementation:

• CR0 – Paging is enabled when bit 31 of this register is set.

• CR2 – Upon receiving a page fault exception, the hardware will fill this register with the

faulting linear address.

30

• CR3 – This register contains the address for the base of the hardware page tables and

must be changed on each process context switch.

4.1.3 Address Translation

The first step in address translation in IA-32 protected mode is to take a segment-relative

address given to the hardware for memory access and convert into a linear address, which

may be used to access memory. To perform this translation, the segmentation hardware uses

the segment descriptors to translate addresses as summarized in Figure 7.

Figure 7: Segmented Address Translation

One important thing to note from this architectural addressing requirement is that the

segment descriptor gives the base for the linear address. Because of this, binaries targeted

for segmented architectures typically address with absolute addresses, starting at 0 (since the

base for the segment will be added during any memory access).

IA-32 also provides paging hardware, which may be used to implement paging. Where

segmentation is an absolute requirement in protected-mode, paging is not. However, paging

hardware is necessary to implement VM efficiently (since the translation occurs on every

memory access). The IA-32 architecture translates linear addresses to physical addresses as

shown in Figure 8.

31

Figure 8: Paged Address Translation

The important thing to note from the paging support is that it is always used in conjunc-

tion with segmentation (since segmentation is activated when protected-mode is entered).

This can be seen from Figure 8, where the input to the page translation hardware is the linear

address produced by the segmentation hardware.

4.1.4 Page Fault Handling

A page fault exception is raised by the paging hardware when it is given a linear address

for which there is no translation for the currently loaded page tables. In processing the

exception, the hardware will load the CR2 register with the faulting address and then vector

into the page fault handler provided by the operating system. The page fault exception

could be caused by either a legitimate access that does not have a translated address or an

illegitimate access (no valid translation, write to a read-only page, or an unprivileged read).

The latter case can only be positively determined by the software that manages processes.

The paging hardware aids this determination by pushing a bitfield onto the interrupt frame

containing flags for: present/access rights violation, read/write, kernel/user access.

32

4.2 Modification Summary

This section summarizes the modification necessary to existing Minix 3 components in order

to implement MinixVM. The new server messages required are detailed in Section 4.3.

4.2.1 Kernel Modifications

• Register Access

As mentioned in Section 4.1, on a page fault, the Intel architecture loads the CR2

register with the linear address that raised the fault. An assembly routine read cr2

was added so that this value could be read in the page fault handler.

• Page Fault Exception Handler

The default action upon receiving a page fault exception in Minix 3 is to send a

SIGSEGV to the offending processes. This is consistent with the fact that a page

fault exception should never be thrown in Minix 3 (since memory is segmented

a wild memory access would result in a general protection fault). In MinixVM,

the page fault handler records the page fault information (i.e. reads the faulting

address out of the CR2 register) and notifies PM of the page fault.

• Virtual Memory Mapping

A few new functions were added to easily support the mapping of virtual addresses

to physical addresses.

• Exec System Call

The exec system call SYS EXEC was modified to record the page directory base ad-

dress into the kernel’s process data structure (a new entry added for this purpose).

The page directory base address must be in the kernel, so that the CR3 register can

be set to its value during a context switch.

33

• Process Switch

As mentioned in the last item, the kernel must load the CR3 register with the base

address of the hardware page tables. The process switch code was modified to

make this change.

4.2.2 PM Modifications

Of all the Minix 3 components, PM was modified most heavily. This is because PM is designed

to be the center of most MinixVM processing, in order to minimize the impact of the MinixVM

implementation on Minix 3 processing (most of the memory management operations are

routed through PM in Minix 3).

• Memory Management

The largest change to PM was in removing memory management from the PM

core. Memory management is only represented by allocation in Minix 3, so only

allocation needed to be moved to pager. The reasoning behind this is simple:

pager requires complete control of the physical address space. The three main

memory allocation functions, allocate, free, and copy free memory holes were

converted from function calls in PM to messages to pager: PAGER MEM ALLOC,

PAGER MEM FREE, and PAGER MEM HOLES COPY, respectively. This way, no other ap-

plications need be modified — their memory requests are simply routed through

pager. In addition to message pass-through, the initialization of PM was changed

significantly. Rather than reading the memory layout at boot and setting up the

memory allocation data structures, it merely records this information for pager to

process.

• Pager Initialization

As mentioned in the previous point, memory management was moved to pager

and memory initialization was deferred to pager as well. However, pager cannot

34

initialize the memory map until it has the bootstrap information, namely, where

the kernel and system servers are loaded into memory (to avoid allocating that

memory to other processes). PM is the first process to run at boot, and as a part

of its initialization, it records the memory map (by querying the boot monitor)

for pager. When pager first runs, it sends a PM PAGER INIT message to PM. PM

responds with a pointer to this recorded information, which pager copies into its

own address space, initializing memory much in the same way that PM does in

Minix 3.

• Page Fault Processing

Despite the new server, pager, in MinixVM the center of control remains in PM

(this is clear from Figure 6). PM was modified to support this processing:

∗ Page Fault Exception – As noted in Section 4.1.4, a page fault exception is

routed through the kernel to PM. The handler for the HARD INT notification

was added to PM to begin the page fault processing. HARD INT, however awk-

ward in this scenario, was chosen since PM in Minix 3 does not receive any

hardware interrupts, is unlikely to need to, and because that was the one

remaining notification mechanism available to the kernel.

∗ Page Fault Validation – In the page fault handling, PM is required to validate a

memory access with pager. Since this is implemented using send() semantics,

support for receiving pager’s response was added to PM.

∗ Page Reading – Once the access is determined valid, PM must request that VFS

read the specified page from disk. Since this call is also implemented using

send() semantics, support was added to process VFS’s response.

• Exec Processing

The processing within PM for executing a new process was slightly modified so

that PM would not load the executable from disk (leaving the executable to be

35

demand-paged later), to add the exec notification message to pager (Section 3.3.2),

and to support the new SYS EXEC signature.

4.2.3 VFS Modifications

This section describes the VFS modifications, all of which were made to support VFS’s role in

page fault handling (Section 3.3.3).

• Executable Data Addition

In order to support the MinixVM processing (namely reading pages from disk – see

the next point), the VFS image of a process was modified to include a few more

fields relating to executables.

∗ fp exec ino – The inode of executable. Without this, the VFS-portion of the

page fault handling would not know where the executable resides on the file

system (or would need to perform an expensive lookup on the file name every

page fault).

∗ fp exec fs e – The endpoint of the file system driver which controls the file

system where the executable resides. To uniquely identify a file in Minix 3,

the inode and the file system driver endpoint are needed, since all operations

require knowing which driver to communicate with and since the inode is

only unique on a single file system.

∗ fp exec hdrlen – The length of the executable header. This is needed to

properly calculate the offset into the executable file when reading portions of

it into memory (see the following points).

∗ fp exec textlen – The length of the text section for the executable.

∗ fp exec datalen – The length of the initialized data section for the exe-

cutable.

∗ fp exec bsslen – The length of the uninitialized data section for the exe-

cutable.

36

Adding these fields required modification of the exec-handling code, simply to

make sure that the values were recorded the first time VFS reads the executable

header.

• Page Reading

Since VFS must read a page into memory from disk to support page fault handling

(Section 3.3.3), code was added to VFS in order to do this. This code will request

that the relevant file system driver read either a page-sized chunk or to the extent

of the binary, whichever is smaller, into the page-aligned address it is given. The

only twist to this scheme is in how the data section is handled. In the case where

the page encompasses all initialized data, the entire page is read from disk. If the

page encompasses all BSS data, the page is zeroed. If, however, the page contains

a combination of initialized and uninitialized data, the initialized data is read in

and the remainder of the page is zeroed. In order to make the determination

of initialized/uninitialized data, this code requires the modifications to the VFS

process metadata. VFS assumes that the memory write is legal, which it should be

after the verification from pager.

4.3 New Server Messages

This section describes the new server messages and message parameters required by MinixVM

implementation.

4.3.1 New Kernel Messages

Message Parameters Description

SYS PAGEFAULT PGF TYPE The type of the message (one of PGF GET, PGF DONE)

PGF PROCNR The number of the faulting process

PGF ENDPT The endpoint number of the faulting process

Continued on next page

37

Table 1 – continued from previous page

Message Parameters Description

PGF PADDR The physical address associated with the virtual address

PGF VADDR The virtual address (the faulting address)

The SYS PAGEFAULT system call message is used by the pager to both get

the page fault information when it has been notified of a page fault by PM,

and to report back to the kernel about the validity of the page fault.

Table 1: New Kernel Messages

Note: new messages to the kernel are represented by system calls.

4.3.2 New PM Messages

Message Parameters Description

PM PAGER INIT NONE N/A

The PM PAGER INIT message is sent by pager during its ini-

tialization, so that it may get PM’s image of memory usage

(since pager is the memory allocator). PM replies to pager with

PAGER INIT REPLY

PM PGFAULT REPLY PM PGFAULT ENDPT The endpoint number for the

faulting process

PM PGFAULT VADDR The faulting (virtual) address

PM PGFAULT BASE The base address of the segment

where the page fault occurred

PM PGFAULT SEG The segment within which the

page fault occurred (text, data,

or stack)

Continued on next page

38

Table 2 – continued from previous page

Message Parameters Description

The PM PGFAULT REPLY message is sent by VFS once it is done

processing the page fault (started when PM sends PM PGFAULT

to VFS).

PM PGFCHECK REPLY PM PGFCHECK PROCNR The number of the process that

raised the page fault

PM PGFCHECK ENDPT The endpoint number of the pro-

cess that raised the page fault

PM PGFCHECK VADDR The (virtual) faulting address

PM PGFCHECK ANS The result of validity check

The PM PGFCHECK REPLY message is sent by pager once it has

completed determining the validity of of the page fault ad-

dress access. This message is sent in response to PM sending

PAGER PGFCHECK to pager.

PM PAGER EXEC REPLY PM PAGER EXEC PDBR The address for the base of the

hardware page tables

The PM PAGER EXEC REPLY message is sent by pager once it has

completed updating its process tables, in response to PM send-

ing pager PAGER EXEC.

PM PAGER ALLOC REPLY PM PAGER ALLOC BASE The base address for the page or

set of pages allocated

The PM PAGER ALLOC REPLY message is sent by pager once

it has completed allocating memory in response to a

PAGER MEM ALLOC message from PM. This functionality is in PM

in Minix 3 and in pager in MinixVM, since pager must do the

memory management.

Continued on next page

39

Table 2 – continued from previous page

Message Parameters Description

PM PAGER ALLOC REPLY PM PAGER ALLOC BASE The base address for the page or

set of pages allocated

The PM PAGER ALLOC REPLY message is sent by pager once

it has completed allocating memory in response to a

PAGER MEM ALLOC message from PM.

PM PAGER HOLES COPY REPLY PM PAGER HOLES COPY STATUS The status code returned by the

copy operation

PM PAGER HOLES COPY BYTES The number of bytes the hole list

occupies

PM PAGER HOLES COPY HI The high watermark for alloca-

tion

The PM PAGER HOLES COPY REPLY message is sent by pager

once it has completed copying the hole list to the specified lo-

cation (this is a throwback to the way Minix 3 manages mem-

ory and would not be present if VM were completely imple-

mented). This functionality is in PM in Minix 3 and in pager in

MinixVM, since pager must do the memory management.

Table 2: New PM Messages

4.3.3 New VFS Messages

Message Parameters Description

PM PGFAULT PM PGFAULT ENDPT The endpoint number for the faulting process

PM PGFAULT VADDR The faulting (virtual) address

Continued on next page

40

Table 3 – continued from previous page

Message Parameters Description

PM PGFAULT BASE The base address of the segment where the page fault

occurred

PM PGFAULT SEG The segment within which the page fault occurred (text,

data, or stack)

The PM PGFAULT message is sent by PM when it needs VFS to process a page

fault (read in a page from disk at the faulting address.

Table 3: New VFS Messages

4.3.4 Pager Messages

Message Parameters Description

PAGER INIT REPLY PAGER INIT CHUNKS A pointer to the memory chunk data struc-

ture in PM’s address space

The PAGER INIT REPLY message is sent by PM in response to pager

sending a PM PAGER INIT message. Pager copies the data structure

from PM’s address space.

PAGER EXEC PAGER EXEC PID The process ID of the newly exec’ed pro-

cess

PAGER EXEC ENDPT The endpoint number of the newly

exec’ed process

PAGER EXEC MEM A pointer to the process memory map (in

PM’s address space)

The PAGER EXEC message is sent by PM to pager on an exec(), so

that pager may populate its process image.

PAGER PGFCHECK PAGER PGFCHECK PROCNR The number of process to check

Continued on next page

41

Table 4 – continued from previous page

Message Parameters Description

PAGER PGFCHECK ENDPT The endpoint of the process

PAGER PGFCHECK VADDR The (virtual) faulting address to check

PAGER PGFCHECK ANS The result of the validity check

The PAGER PGFCHECK message is sent by PM to ask pager if a faulting

address is valid or not.

PAGER PGFDONE PAGER PGFDONE STAT The page fault processing status

The PAGER PGFDONE message is sent by PM to notify pager of the

status for the entire page fault process (so that pager may keep its

data structures consistent.

PAGER MEM ALLOC PAGER MEM ALLOC CLICKS The number of clicks (pages) to allocate

The PAGER MEM ALLOC message is sent by PM to ask pager to allocate

the specified amount of memory.

PAGER MEM FREE PAGER MEM BASE The base address of the pages to free

PAGER MEM CLICKS The number of clicks (pages) to free

The PAGER MEM FREE message is sent by PM to ask pager to free the

specified amount of memory at the specified location.

PAGER MEM HOLES COPY PAGER MEM HOLE A pointer to a block of memory (in PM’s

address space) into which the hole list can

be copied

The PAGER MEM HOLES COPY message is sent by PM to ask pager to

copy the memory hole list to the specified location. (This is a throw-

back to the way Minix 3 manages memory and would not be present

if VM were completely implemented.)

Table 4: Pager Messages

42

5 Performance Analysis

This section presents a brief, two-part performance analysis of the MinixVM prototype. The

first part analyzes the performance impact of paging in several common scenarios. The sec-

ond part examines the performance impact of paging as a function of the amount of memory

paged (i.e. number of page faults handled) for a fabricated sequential-scan test.

For all of the tests, the POSIX-specified time utility (provided by Minix 3) was used to

measure the amount of wall-clock time that elapsed from the beginning to the end of each

test. Each result is the average of three different wall-clock measurements, and the measure-

ment variance is small enough to be elided from the analysis. Also, for each test, only the

program tested is paged. The MinixVM prototype is capable of co-existing with the segmented

memory management scheme is such a way that one can specify a list of programs to be

paged — all others use the segmented memory management scheme. For example, during

the bzip2 performance test, only the bzip2 binary will be paged after being exec’ed.

5.1 General Program Runtime Tests

The general program runtime tests were designed to reflect a small set of common workloads,

including compression, decompression, archiving, and compiling.

For all of the general program runtime tests excluding the kbuild test, the subject file was

a 180 megabyte tarball containing 4242 files (the complete MinixVM source and supporting

project files). For these tests, the timed operation is the default action of the program listed

on the subject file (e.g. the gunzip test decompresses the subject tarball that was previously

compressed with gzip). The kbuild test measures the amount of time required to build

the system servers and bootable microkernel image from a clean MinixVM source tree. The

program actually paged in this test was cc, the C compiler from the Amsterdam Compiler Kit

(the compiler suite used to compile Minix 3).

Figure 9 shows the results from these tests:

43

0

50

100

150

200

250

gzip gunzip bzip2 bunzip2 tar untar kbuild

Tim

e (s

)

Program

Run Time

not pagedpaged

Figure 9: General Program Runtime

As would be expected, there is a measurable overhead with the paged versions of the tests.

On average, the paged versions are 1.08 times slower than their non-paged counterparts.

Interesting to note is the significant difference in the performance for the tar test (1.33

times slower in the paged version). This is likely due to the memory access pattern of tar —

one-use sequential. This is an effect that will be come more apparent in the binary scan test

described in the next section.

5.2 Sequential Scan Tests

While the tests of common workloads given in the previous section are informative, they fail

to describe the performance overhead as a function of program memory requirements. To

test this relationship, a set of sequential scan tests were created.

There are two types of the sequential scan test. Both tests follow the same algorithm:

44

statically declare an array of sizes ranging from 2 to 64 in power-of-two increments, initialize

the array, reduce it to a scalar, and then print the result. The difference between the two

tests is that the first (the binary test) is written in C, compiled to an executable, and then the

entire executable is paged. The second test (the perl test), however, is written in perl and

interpreted (the perl binary is paged in this test).

Figure 10 summarizes the results from these two tests:

0

0.5

1

1.5

2

2.5

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

Tim

e (s

)

Scan Size (pages)

Binary Scan Size vs. Run Time

not pagedpaged

(a) Binary Scan

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0 10 20 30 40 50 60 70

Tim

e (s

)

Scan Size (pages)

Perl Scan Size vs. Run Time

not pagedpaged

(b) Perl Scan

Figure 10: Scan Performance

From Figure 10, it is clear that there is a far more significant performance impact for the

binary test than the perl test. More specifically, on average, the paged version of the binary

test is 4.7 times slower than the non-paged version (ranging from 3 to 6 times slower), where

the paged version for the perl test is only 1.07 times slower. This is most likely an artifact

of the fact that the second test is interpreted — the binary must load all of the statically

allocated memory in unique pages, eventually, where the perl interpreter can reuse memory

pool data to manipulate the array, resulting in fewer page faults. The slight difference in

the performance of the paged and non-paged versions in the perl test is likely due to the

overhead associated with the page faults that the paged version could not avoid.

Also of note is the observation that the runtime for the paged version of the binary test

increases much faster than its non-paged counterpart. This is to be expected, since the paged

45

version must pay the cost of several expensive mode switches (in the form of the page fault

handling) with each new page, whereas the non-paged version need only pay this cost once

(during initial load).

5.3 Summary

It is not surprising that all paged versions of the tests in this section suffered a performance

penalty due to the paging overhead. This overhead is not due to the extra processing required

so much as the frequent mode switches (from user to kernel mode and vice versa) required

by the page fault handling algorithm (described in Section 3.3.3). This is most apparent in

binary test from Figure 10, where the runtime of the paged version grows in proportion to

the number of page faults handled.

These data demonstrate that, because the mode switches are required by the microker-

nel architecture, a VM implementation in a microkernel must necessarily be slower than a

comparable VM implementation in a monolithic kernel.

46

6 Future Work

Due to time constraints, a full MinixVM implementation was not completed. This section

enumerates the largest outstanding issues as well as extensions that could be added on to

MinixVM.

6.1 Removing Segmentation

The MinixVM implementation as presented in this report is not a complete VM implementa-

tion. In order to fully support VM, the address space of a process must be presented as the

maximum possible (e.g. 4GB on a machine with 32-bit addressing), not limited by segment

boundaries.

There are a number of issues that prevented the easy removal of segmentation:

• Binary Layout

Binaries in Minix 3 are generated starting with address offset zero in each section.

The Minix 3 loader depends upon the architectural segment descriptor to apply

segment offsets when creating the linear address. Ordinarily, an operating system

has an established loading convention, so that binaries have their sections tagged

with addresses that the loader expects in VM. In the MinixVM implementation

presented is this report, this is impossible, since segments encompass the entire

program and these physical addresses are not known until runtime. (In a VM

environment, it does not matter that two programs “load themselves at the same

location” since they are loading at the same virtual addresses.)

In order to support VM, either the binaries would need to be regenerated to have

the agreed upon virtual addresses, or the binaries would need to be dynamically

translated (addresses rewritten) at runtime. The former would probably be best

approach, since it only occurs the one-time compilation penalty.

47

• Memory Copy Operations

Minix 3 makes heavy use of memory copy operations, since its message system is

only capable of sending fixed-length messages. When a message in Minix 3 has

a payload that exceeds the fixed-length, it instead sends a pointer to the data in

the sender’s address space. The receiver uses a system call to have the kernel copy

the data into the receiver’s address space. This is straightforward to implement

in segmented memory management, since a byte-by-byte copy is available. This

process is not straightforward in a VM environment. First, the address would need

to be translated to the physical pages (page by page). In MinixVM this must either

occur in pager or the kernel must read pager memory (since that is where the

page tables are located). If the memory copy refers to pages that are swapped out,

the translation must occur in pager (since the hardware page tables are just the

subset of pages allocated to a process that are currently in memory). Therefore,

if swapping is enabled, MinixVM will have a reverse dependency between the

kernel and pager — the kernel would depend upon a userspace program in order to

perform a kernel operation. Without being able to do dynamic memory allocation

in the kernel, this problem is not solvable in the general case. This is yet another

example of where the the “ideal” microkernel architecture must be violated to

provide even the most fundamental OS abstraction.

6.2 System Server Reloading

Ideally, the system servers would be reloaded into virtual memory address spaces at boot, by

some bootstrap code, so that they could be paged like any other process.

6.3 VM Extensions

There are a number of extensions that naturally fall out of a VM implementation with minimal

effort. All of these extensions are only valid in a MinixVM environment — the memory model

of Minix 3 makes them impossible.

48

At the very least, pager should be reloaded in order to avoid issues with allocating space

for the page table data structures. In lieu of reloading it in a paged manner, it would just

have its stack section moved, to the same effect.

6.3.1 Page Swapping

Swapping refers to writing out pages from memory to disk, in order to increase the available

amount of memory. This is an artificial increase, but critical to maintaining the abstraction

that processes may use the entire space of available memory.

6.3.2 Copy-On-Write Page Sharing

Copy-On-Write (COW) page sharing is a technique of memory sharing by which a fork()

does not necessarily result in the copying of all memory from the parent process to the child

(as it does by default in Minix 3). Instead, pages are mapped into the page tables of both of

the processes, used until one writes to a shared page. Once written to, both processes get

their own copies. This technique results in efficient memory utilization in the common case

where the child replaces its memory map with an exec() soon after the fork().

6.3.3 mmap()

Memory mapping through mmap() is one of the core functionalities of the POSIX specification

(also known as the Single Unix Specification) [8], and is missing from Minix 3 — because

memory mapping is extremely difficult to implement in a purely segmented memory manage-

ment scheme. Memory mapping is core to many modern applications, making it impossible

to port applications from other environments (e.g. Linux) without having to rewrite portions

of non-trivial application code. In some cases, it is impossible to replicate the functionality of

memory mapping, making the porting impossible.

49

6.3.4 Dynamic Linking

Dynamic linking — reading common library code into memory as needed during execution —

is a common technique used in modern operating systems to amortize the memory footprints

of running programs and to reduce on-disk library storage. In order to reduce memory usage,

dynamic linking only loads that code which is necessary, and as a further optimization, since

most library code only writes to the stack (or in the calling program’s heap) library code

pages can be mapped into many running processes simultaneously. To reduce the on-disk

profile of programs, dynamic linking does not require that every binary effectively be self-

contained – duplicating shared functionality. Dynamic linking also cuts down on load time,

since oftentimes, the required pages are already loaded into memory.

50

7 Conclusion

The design and implementation of MinixVM presented in this report demonstrates that it

is possible to implement VM in a microkernel that cannot do dynamic memory allocation

in the kernel. It also clearly demonstrates that the design is overly complicated in relation

to the conceptual representation of VM. This should not be surprising, since most of the

operating system primitives that developers have come to expect were originally designed

for monolithic operating systems.

As for the fate of the microkernel architecture, Tanenbaum famously stated in 1992 that

Linux, and by extension the monolithic approach to implementing operating systems, was

obsolete [9]. The current proliferation of Linux and other monolithic kernels compounded

with the relative obscurity of Minix 3 perhaps provides a more realistic appraisal of the

monolithic kernel’s obsolescence.

While in terms of high-level design microkernels seem to provide significant advantages

over their monolithic counterparts, it is an inescapable fact that programming microkernels

is hard. Even noted kernel hacker and open source evangelist Richard Stallman admitted

this in his failure to create the GNU Hurd microkernel operating system [7]. All of the com-

plicated and in some cases intractable problems associated with building distributed systems

are wrapped up in a microkernel. This is in addition to the fact that developing an operating

system is already a sufficiently hard engineering problem.

These difficulties not withstanding, it would seem as if the failure of the microkernel

architecture is built into its design. Developing a real-world complex system is an exercise in

constant compromise. The problem with the microkernel design is that most of its advantages

are based upon the rigid adherence to the boundaries imposed by the design. As soon as these

boundaries are violated, many of the advantages evaporate, and all that remains is a difficult

to program hybrid operating system.

It is easy to argue that progress has been made in using the microkernel design to increase

fault tolerance and provide more trusted computing environments, but as demonstrated by

MinixVM, this comes at the cost of providing even the most basic operating system services.

51

8 References

[1] A. Chou, J. Yang, B. Chelf, S. Hallem, and D. Engler, An Emperical Study of Operating

System Errors, In Proc. 18th ACM Symp, on Oper. Syst. Prin., pages 73–88, 2001.

[2] Jorrit N. Herder, Herbert Bos, Ben Gras, Philip Homburg, and Andrew S. Tanenbaum

“Construction of a Highly Dependable Operating System”, Proc. 6th European Depend-

able Comp. Conf., Oct 2006.

[3] Jorrit N. Herder, Herbert Bos, Ben Gras, Philip Homburg, and Andrew S. Tanenbaum

MINIX 3: A Highly Reliable, Self-Repairing Operating System, ACM SIGOPS Operating

Systems Review, vol. 40, nr. 3, pp. 80–89, July 2006.

[4] Intel 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A: System Pro-

gramming Guide, Part 1, November 2008.

[5] Intel 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B: System Pro-

gramming Guide, Part 2, November 2008.

[6] J. Liedtke. On µ-Kernel Construction. In Proc. 15th ACM Symp. on Oper. Syst. Prin.,

pages 237250, Dec. 1995.

[7] J.T.S. Moore (writer, director, producer), “Revolution OS”, Documentary film on the

free software movement, available at: <http://www.revolution-os.com/>, 2001.

[8] The Single UNIX Specification, Version 3, ISO/IEC 9945:2003, IEEE Std 1003.1-2001.

[9] Andrew Tanenbaum, “LINUX is obsolete”, Posted on comp.os.minix, archived

at: <http://groups.google.com/group/comp.os.minix/browse thread/thread/

c25870d7a41696d2/f447530d082cd95d?tvc=2>, 1992.

52

9 Appendix: MinixVM Patch Summary

The following is a summary of the MinixVM patch (for Minix version 3.1.3a), created using

the diffstat utility.

etc/usr/rc | 5

include/minix/com.h | 72 +++++++

include/minix/const.h | 6

include/minix/syslib.h | 2

kernel/Makefile | 2

kernel/arch/i386/.depend | 2

kernel/arch/i386/Makefile | 1

kernel/arch/i386/exception.c | 47 ++++

kernel/arch/i386/klib386.s | 12 +

kernel/arch/i386/memory.c | 90 +++++++++

kernel/arch/i386/mpx386.s | 2

kernel/arch/i386/proto.h | 1

kernel/arch/i386/sconst.h | 1

kernel/config.h | 1

kernel/pagefault.h | 9

kernel/proc.h | 1

kernel/proto.h | 3

kernel/system.c | 12 +

kernel/system.h | 5

kernel/system/.depend | 37 +++

kernel/system/Makefile | 7

kernel/system/do_exec.c | 8

kernel/system/do_pagefault.c | 75 +++++++

kernel/system/do_vm.c | 11 +

kernel/table.c | 2

lib/syslib/sys_exec.c | 4

servers/Makefile | 3

servers/pager/.depend | 103 ++++++++++

servers/pager/Makefile | 34 +++

servers/pager/alloc.c | 256 ++++++++++++++++++++++++++

servers/pager/alloc.h | 12 +

servers/pager/entries.h | 56 +++++

servers/pager/exec.c | 76 +++++++

servers/pager/exec.h | 11 +

servers/pager/page.h | 12 +

servers/pager/pager.c | 126 +++++++++++++

servers/pager/pager.h | 25 ++

servers/pager/pagetable.c | 120 ++++++++++++

servers/pager/pproc.c | 3

53

servers/pager/pproc.h | 26 ++

servers/pager/version.c | 1

servers/pm/.depend | 81 +++++++-

servers/pm/Makefile | 7

servers/pm/alloc.c | 410 ++-----------------------------------------

servers/pm/exec.c | 49 ++++-

servers/pm/globalpage.c | 131 +++++++++++++

servers/pm/main.c | 59 +++++-

servers/pm/pagefault.c | 151 +++++++++++++++

servers/pm/proto.h | 16 -

servers/pm/queue.h | 215 ++++++++++++++++++++++

servers/pm/type.h | 1

servers/vfs/.depend | 80 ++++++++

servers/vfs/Makefile | 3

servers/vfs/exec.c | 18 +

servers/vfs/fproc.h | 8

servers/vfs/main.c | 13 +

servers/vfs/misc.c | 2

servers/vfs/page.c | 145 +++++++++++++++

servers/vfs/pagefault.c | 33 +++

servers/vfs/proto.h | 8

tools/Makefile | 1

tools/revision | 1

62 files changed, 2274 insertions(+), 440 deletions(-)

Download the full MinixVM patch (for Minix 3.1.3a).

54