OSE 2011– OS Singularity 1 Operating Systems Engineering Language/OS Co-Design: “Singularity” By Dan Tsafrir, 4/5/2011

OSE 2011– OS Singularity1

Operating Systems Engineering

Language/OS Co-Design:

“Singularity”

By Dan Tsafrir, 4/5/2011


Singularity

An experimental research OS by Microsoft Since 2003 Lots of high profile publications (last one in 2009) Open source since 2008, can be used by academia Lots of people working on it

Motivation: Official:

Rethink SW stack, as many decisions are from late 1960s & early 1970s (in all OSes they’ve stayed ~ the same)

Speculated: Windows isn’t the best OS you can imagine…

(UNIX-like isn’t that different…)


Goals

Answer the question “What would a SW platform look like if it was designed from

scratch with the primary goal of improved dependability and trustworthiness”?

Attempt to eliminate the following obvious shortcomings of existing systems & SW stacks Widespread security vulnerabilities Unexpected interactions among applications Failure caused by errant

Extensions Plug-ins, add-ons Drivers

Perceived lack of robustness


Strategies to achieve goals

1. Pervasive employing safe programming language Kernel (where possible) & user space Eliminates such defects as buffer overflow Allows reasoning about the code

2. Using sound program verification tools Further guaranties entire classes of programming errors are

eliminated early in the development cycle Impose constraints to make verification feasible

3. Improved system architecture & design Stops propagation of runtime errors at well-defined boundaries => easier to achieve robust & correct behavior


Kernel

Mostly written in type-safe, garbage collected language 90% of the kernel written in Sing# (extension of C#) 6% C++; rest: C, & little assembly snippets

Still, significant unsafe parts Garbage collector (in unsafe Sing#; accounts for 1/2 of

unsafe code); memory management; I/O access subsystems Microkernel structure

Factored services into user space: NIC, TCP/IP (Israeli connection…), file system, disk driver

Still, 192 sys call! (“my appear shockingly large”) But actually, much simpler than number suggests

No sys calls with complex semantics (such as ioctl) No UNIX compatibility => allows to avoid pitfalls (e.g. tocttou)


Kernel – cont.

Provides 3 base abstractions (the architectural foundation) Software-isolated processes (SIPs) Manifest-based programs (MBPs) Contract-based channels


Software-isolated processes (SIPs)

Like ordinary processes Holder of processing resources Provides context for program execution Threads container

One fundamental, somewhat surprising difference The most radical part of the design… They all run in one address space!

(paging turned off, no use of segments) Both kernel and all processes! User code runs with full hardware privileges (CPL=0)!


Why might that be useful?

Performance Fast process switching

No page table to switch No need to invalidate TLBs In fact, can run in real mode (if memory <= 4GB) In which case no need for TLB at all

Fast system calls CALL rather than INT

Fast IPC No need to copy, it’s already in your address space

Direct user-program access to HW E.g., device drivers of the microkernel


Some performance numbers

Cost (in CPU cycles)

APIcall

Thread yield

message ping/pong

Process creation

Singularity 80 365 1,040 388,00FreeBSD 878 911 13,300 1,030,000Linux 437 906 5,800 719,000Windows 627 753 6,340 5,380,00

IPCswitch


But we’re interested in robustness!

Recall that performance is not the main goal Rather, robustness & security Is not using page-table protection consistent with our goal?

For starters, note that Our OSes do use page tables and they’re not very robust… Unreliability very often comes from extensions

Browser plug-ins & add-ons, loadable kernel modules,… HW VM protection is in any case irrelevant in these cases

Can we just do without HW protection? If we can solve the problem of extensions, then yes! It’d be the same solution for processes…


The idea

Extensions (device drivers, new network protocol , plugin) Would employ a different processes Communicate via IPC

The challenge? Prevent evil/buggy processes from writing

To each other To kernel

Be able to cleanly kill/exit Avoid entangling memory spaces Would allow GC to work

The solution: SIPs Software-isolated processes


SIP philosophy: “sealed”

No modifications from outside No shared memory No signals Only IPC

No modification from within No JIT No class loader No dynamically loaded libraries


SIP rules

Only pointer to your own data No pointers into other SIPs No pointers into the Kernel => No sharing despite shared address space!

SIP can allocate pages of memory from kernel Different allocations are not contiguous


Why un-modifiable?

Why is it crucial that SIPs can’t be modified? Can’t even modify themselves…

What are the benefits? No code insertion attacks

E.g., <script>..</script> injected from the outside into a browser SIP will not work

Easier to reason about (prove things, typically statically) Easier to optimize

E.g., inline (“whole program optimization” can nearly eliminate overheads of virtualization in OOP)

E.g., delete unused functions


Why un-modifiable? – cont.

Singularity vs. JVM (isn’t it “safe” too?…) JVM can load classes in runtime

=> Precludes previous slide JVM allows to share memory

=> Entanglement for when killing(With singularity cleanup is much simpler)

JVM allows other ways to sync (other than explicit messages) Typically impossible to reason about locks without knowing

the program’s semantics JVM = one runtime + one GC

But it has been shown that different programs require different GCs

Singularity supplies a few GCs (trusted code)


Why un-modifiable? – cont.

Bottom line Improved performance makes the pervasive use isolation

primitives feasible


Enforcing memory isolation

How to keep SIPs from reading/writing other SIPs? We must make sure that SIP exclusively reads/writes

from/to memory the kernel has given to it

Shall we have compiler generate code to check every access? (“Does this point to memory the kernel gave us?“) Would slow code down significantly

(especially since memory isn't contagious) Actually, we don't even trust compiler!


Enforcing memory isolation – cont.

Overall structure: during installation Compile to bytecode Verify bytecode statically (confirms to SIP constraints + does

not conflict with any other already existing SIP) Compile bytecode to machine code Later, run the verified machine code w/ trusted runtime

Verifier verifies SIP only uses reachable pointers SIP cannot create new pointers

Only trusted runtime can create pointers Thus, if kernel/runtime never supply out-of-SIP pointer

=> Verified SIP can use its memory only



Specifically, the verifier verifies that SIP doesn’t cook up pointers

Only uses pointers given to it (by the runtime) SIP doesn’t change its mind about type

E.g., don’t interpret int as pointer SIP doesn’t use pointer after it was freed

Enforced with (trusted) GC No other tricks exist

…



Bytecode verification seems to do more than Singularity needs E.g. cooking up pointers might be OK, as long as within

SIP's memory Thus, verifier might forbid some programs that might have

been OK on Singularity, no? No.

Benefits of full verification Fast execution, often don't need runtime checks at all

Though some still needed: array bounds, OO casts Prove IPC contracts (a few slides down the road) …


Trusted vs. Untrusted

Trusted SW If it has bugs => can crash Singularity or wreck other SIPs

Untrusted SW If it has bugs => can only wreck itself

Let’s consider some examples Compiler? Compiler output? Verifier? Verifier output? GC?


Price of SW & HW isolation

9

resources that implement processes, such as the MMU or interrupt controllers. These mechanisms have non -trivial performance costs that are largely hidden because there has been no widely used alternative approach for comparison .

To explore the design trade -offs of hardware protection versus software isolation, recent work in Singularity augments SIPs, which provide isolation through language safety and static verification, with protection domains [1] , which can provide an additional level of hardware -based protection around SIPs . The lower run -time cost of SIPs makes their use feasible at a finer granularity than conventional proces ses, but hardware isolation remains valuable as a defense -in -depth against potential failures in software isolation mechanisms or to make available needed address space on 32 -bit machines .

Protection domains are hardware -enforced protection boundaries , wh ich can host one or more SIPs. Each protection domain consists of a distinct virtual address space. The processor’s MMU enforces memory isolation in a conventional manner. Each domain has its own exchange heap, which is used for communications between SIPs within the domain. A protection domain that does not isolate its SIPs from the kernel is called a kernel domain . All SIPs in a kernel domain run at the processor’s supervisor privilege level (ring 0 on the x86 architecture), and share the kernel’s exchang e heap, thereby simplifying transitions and communication betwee n the processes and the kernel . Non -kernel domains run at user privilege level (ring 3 on the x86).

Communication within a protection domain continues to use Singularity’s efficient reference -passing scheme. However, because each protection domain resides in a separate address space, communication across domains requires data copying or copy -on-write page mapping . The message -passing semantics of Singularity channels makes the implementations indistinguishable to application code (except for performance).

A protection domain could, in principle, host a single process containing unverifiable code written in an unsafe language such as C++. Although very useful for running legacy code, w e have not yet explored this possibility. Currently , all code within a protection domain is also contained within a SIP, which continues to provide an isolation and failure containment boundary.

Because multiple SIPs can be hosted within a protection domain, domains can be employed selectively to provide hardware isolation between specific processes, or between the kernel and processes. The mapping of SIPs to protection domains is a run -time decision . A Singularity system with a distinct protection domain for each SI P is analogous to a fully hardware -isolated

microkernel system , such as MINIX 3 [12] (see Figure 4a). A Singularity system with a kernel domain hosting the kernel, device drivers, and services is analogous to a conventional, monolithic operating syste m, but with more resilience to driver or service failures (see Figure 4b). Singularity also supports unique hybrid combinations of hardware and softwa re isolation, such as selection of kernel domains based on signed code (see F igure 4c).

4.2.1 Quantifying the Un safe Code Tax Singularity offers a unique opportunity to quantify the costs of hardware and software isolation in an apples -to-apples comparison . Once the costs are understood, individual systems can choose to use hardware isolation when its benefits outwe igh the costs.

Hardware protection does not come for free, though its costs are dif fuse and difficult to quantify . Costs of hardware protection include maintenance of page tables, soft TLB misses, cross -processor TLB maintenance, hard paging exceptions, a nd the additional cache pressure caused by OS code and data supporting hardware protection . In addition, TLB access is on the critical path of many processor designs [2, 15] and so might affect both processor clock speed and pipeline depth . Hardware protection increases the cost of calls into the kernel and process context switches [3] . O n processors with an untagged TLB, such as most current implementations of the x86 architecture, a process context switch requires flushing the TLB, whi ch incurs refill costs.

Figure 5 graphs the normalized execution time for the WebFiles benchmark in six different configurations of hardware and software isolation . The WebFiles benchmark is an I/O intensive benchmarks based on SPECweb99 . It consists of three SIPs: a

Figure 4a. Micro -kernel configuration (like MINIX 3). Dotted lines mark protection domains; dark domains are user-level, light are kernel -level.

Figure 4b. Monolithic kernel and monolithic applica tion configuration .

Figure 4c. Configuration with distinct policies for signed and unsigned code.

Kernel

WebApp

HTTPServer

WebBrowser

WebPlug -In

FileSystem

TCP/IPStack

DiskDriver

NICDriver

UnsignedDriver

AppSignedPlug -in

SignedDriver Kernel

UnsignedPlug -in

S ignedApp

UnsignedApp

UnsignedDriver

SignedApp

SignedPlug -in

SignedDriver Kernel

UnsignedPlug -in

UnsignedApp

Figure 5. Normalized WebFiles execution time .

-4.7 %+6.3%

+18.9%+33.0% +37.7%

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

No runtimechecks

PhysicalMemory

Add4KB

Pages

AddSeparateDomain

AddRing 3

FullMicrokernel

Safe Code Tax

Unsafe Code Tax

WebFiles benchmark normalized exe time; 3 SIPs: (1) client issues random reads acorsfiles, (2) FS, (3) disk driver; no checks => runtime checks => TLB turned on (same address space) => client separate address space (ring 0) => client ring 3 => all SIPshave separate address space


IPC: contract-based channels

Contract-based Contract defines a state machine

Verifier statically proves it statically at install time Example: Listing 1 of

G. Hunt & J Larus, “Singularity: rethinking the SW stack”OSR 2007 (a survey on singularity)

Homework: read & understand it Very Fast

Zero copy (see slide 9) SIP can only have one pointer to an object (verified)

Though the “exchange stack” Channels are capabilities

E.g., an open file is a channel received from the file server


Manifest-based programs (MBPs)

No code is allowed to run without a “manifest” By default, a SIP can do nothing Manifest describes SIP’s

Capabilities, required resources, dependencies upon other programs

When program/MBP installed, verifying that It meets all required safety properties All its dependencies are met Doesn’t interfere with already-installed MBPs

Example Manifest of driver provides enough information to prove it

will not access HW of previously installed drivers

Documents

OSE 2011– OS Singularity 1 Operating Systems Engineering Language/OS Co-Design: “Singularity” By Dan Tsafrir, 4/5/2011