48
PLOS 2017 October 28 Kevin Boos Lin Zhong Rice Efficient Computing Group Rice University Theseus: A State Spill-free Operating System

Theseus: A State Spill-free Operating Systemkevinaboos/docs/Theseus PLOS 2017 Slides.pdfPLOS 2017 October 28 Kevin Boos Lin Zhong Rice Efficient Computing Group Rice University Theseus:

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

PLOS 2017 October 28

Kevin Boos Lin Zhong

Rice Efficient Computing Group

Rice University

Theseus: A State Spill-free Operating System

Problems with Today’s OSes• Modern single-node OSes

are vast and very complex • Results in entangled web

of components • Nigh impossible to decouple

• Difficult to maintain, evolve, update safely, and run reliably

2

Easy Evolution is Crucial• Computer hardware must

endure longer upgrade cycles[1] • Exacerbated by the (economic)

decline of Moore’s Law • Evolutionary advancements

occur mostly in software • Extreme example: DARPA’s

challenge for systems to “remain robust and functional in excess of 100 years”[2]

3[2] DARPA seeks to create software systems that could last 100 years. https://www.darpa.mil/news-events/2015-04-08.[1] The PC upgrade cycle slows to every ve to six years, Intel’s CEO says. PCWorld article.

What do we need?Easier evolution by reducing complexity without reducing size (and features).

We need a disentangled OS that: • allows every component to

evolve independently at runtime • prevents failures in one component

from jeopardizing others

4

– the astute audience member

“But surely existing systems have solved this already?”

5

Existing attempts to decouple systems1. Traditional modularization

2. Encapsulation-based

3. Privilege-level separation

4. Hardware-driven

6

Existing attempts to decouple systems1. Traditional modularization

2. Encapsulation-based

3. Privilege-level separation

4. Hardware-driven

7

• Decompose large monolithic system into smaller entities of related functionality (separation of concerns)

Achieves some goals: code reuse

• Often causes tight coupling, which inhibits other goals

• Evidence: Linux is highly modular, but requires substantial effort to realize live update [3, 4, 5, 6] and fault isolation [7, 8, 9]

[3]  J. Arnold and M. F. Kaashoek. Ksplice: Automatic rebootless kernel updates. EuroSys, 2009. [4] M. Siniavine and A. Goel. Seamless kernel updates. DSN) 2013. [5]  G. Altekar, I. Bagrak, P. Burstein, and A. Schultz. OPUS: Online patches and updates for security. USENIX Security, 2005. [6]  K. Makris and K. D. Ryu. Dynamic and adaptive updates of non-quiescent subsystems in commodity OS. EuroSys, 2007.

[7] M. M. Swift, M. Annamalai, B. N. Bershad, and H. M. Levy. Recovering device drivers. OSDI, 2004. [8] M. M. Swift, B. N. Bershad, and H. M. Levy. Improving the reliability of commodity operating systems. SOSP, 2003. [9]  C. Jacobsen, et al. Lightweight capability domains: Towards decomposing the Linux kernel. SIGOPS Oper. Syst. Rev., 2016.

Existing attempts to decouple systems1. Traditional modularization

2. Encapsulation-based

3. Privilege-level separation

4. Hardware-driven

8

• Group related code and data together into a single entity

• Strict boundaries between entities, e.g., classes in OOP

Achieves better maintainability and adaptability

• Similar problems as traditional modularization, i.e., inextricably coupled entities that are difficult to interchange [10, 11]

[10]  C. A. Soules, et al.. System support for online reconfiguration. Usenix ATC, 2003.[11]  F. M. David, E. M. Chan, J. C. Carlyle, and R. H. Campbell. CuriOS: Improving reliability through operating system structure. OSDI, 2008.

Existing attempts to decouple systems1. Traditional modularization

2. Encapsulation-based

3. Privilege-level separation

4. Hardware-driven

9

• Aims to decouple entities by forcing them into separate domains with boundaries based on privilege levels

• Microkernels, virtual machines

Achieves fault isolation

• Coarse spatial granularity [12]

• Evolution remains difficult because microkernel userspace servers must still closely collaborate [13]

[12]  J. N. Herder, H. Bos, B. Gras, P. Homburg, and A. S. Tanenbaum. MINIX 3: A highly reliable, self-repairing operating system. ACM OS Review, 2006. [13]  C. Giuffrida, A. Kuijsten, and A. S. Tanenbaum. Safe and automatic live update for operating systems. ASPLOS, 2013.

Existing attempts to decouple systems1. Traditional modularization

2. Encapsulation-based

3. Privilege-level separation

4. Hardware-driven

10

• Choose entity bounds based on the underlying hardware architecture (cores, coherence domains)

• Barrelfish [14], Helios [15], fos [16], K2 [17]

Achieves scalable and energy-efficient performance

• Does not facilitate evolution, runtime flexibility, or fault isolation

[14]  A. Baumann, et al. The multikernel: A new os architecture for scalable multicore systems. SOSP, 2009. [15]  E. B. Nightingale, et al. Helios: Heterogeneous multiprocessing with satellite kernels. SOSP, 2009.

[16] D. Wentzlaff, et al. An operating system for multicore and clouds. SoCC, 2010. [17]  Felix Lin, et al. K2: a mobile OS for heterogeneous coherence domains. ASPLOS, 2014

Key Insight:

state spill is the root cause of entanglement within OSes

overlooked by existingdecoupling strategies

What is state spill? • When one software entity’s state undergoes a lasting change

as a result of handling an interaction with another entity.

12

• Prevalent and deeply ingrained in modern system software

• Causes entanglement • Individual entities cannot be

easily interchanged • Multiple entities share fate • Hinders other goals as well

[EuroSys’17]

Kevin Boos, et al. A Characterization of State Spill in Modern Operating Systems. EuroSys, 2017.

– the skeptical audience member

“Why is state spill a useful concept?”

13

Modern OSes under the light of state spill

14

task mgmt

nano-core

kernel

console

input event mux

keyboar

d indirection

layer

scheduler

CFQpolicy

FCFSpolicy

RRpolicy

mouseindirection layer

VGA indirection layer

graphics mux

filesystem

PIC IRQ

PIT clock IRQ

event dispatcher

syscall dispatcher

syscall

indirection

layer

heap allocator

frame allocatorstack

allocator

userspaceprocesses

PIC IRQ

scheduler

filesystem

PIC IRQ

PIC IRQ

filesystem

filesystem

(a) Monolithic Kernel (b) Microkernel OS (c) Theseus Kernel

module

submodule

entanglement via state spill

scheduler

filesystem

scheduler

• Web of interacting modules spill state into each other • Larger modules contain submodules that cannot be managed independently

task mgmt

nano-core

kernel

console

input event mux

keyboar

d indirection

layer

scheduler

CFQpolicy

FCFSpolicy

RRpolicy

mouseindirection layer

VGA indirection layer

graphics mux

filesystem

PIC IRQ

PIT clock IRQ

event dispatcher

syscall dispatcher

syscall

indirection

layer

heap allocator

frame allocatorstack

allocator

userspaceprocesses

PIC IRQ

scheduler

filesystem

PIC IRQ

PIC IRQ

filesystem

filesystem

(a) Monolithic Kernel (b) Microkernel OS (c) Theseus Kernel

module

submodule

entanglement via state spill

scheduler

filesystem

scheduler

Theseus: a Disentangled OS

15

task mgmt

nano-core

kernel

console

input event mux

keyboar

d indirection

layer

scheduler

CFQpolicy

FCFSpolicy

RRpolicy

mouseindirection layer

VGA indirection layer

graphics mux

filesystem

PIC IRQ

PIT clock IRQ

event dispatcher

syscall dispatcher

syscall

indirection

layer

heap allocator

frame allocatorstack

allocator

userspaceprocesses

PIC IRQ

scheduler

filesystem

PIC IRQ

PIC IRQ

filesystem

filesystem

(a) Monolithic Kernel (b) Microkernel OS (c) Theseus Kernel

module

submodule

entanglement via state spill

scheduler

filesystem

scheduler

task mgmt

nano-core

kernel

console

input event mux

keyboar

d indirection

layer

scheduler

CFQpolicy

FCFSpolicy

RRpolicy

mouseindirection layer

VGA indirection layer

graphics mux

filesystem

PIC IRQ

PIT clock IRQ

event dispatcher

syscall dispatcher

syscall

indirection

layer

heap allocator

frame allocatorstack

allocator

userspaceprocesses

PIC IRQ

scheduler

filesystem

PIC IRQ

PIC IRQ

filesystem

filesystem

(a) Monolithic Kernel (b) Microkernel OS (c) Theseus Kernel

module

submodule

entanglement via state spill

scheduler

filesystem

scheduler

Implementedfrom scratch using Rust

Runtime composable

Inspired by distributed computing

Our Namesake: Ship of TheseusThe ship wherein Theseus and the youth of Athens returned from Crete had thirty oars, and was preserved by the Athenians down even to the time of Demetrius Phalereus, for they took away the old planks as they decayed, putting in new and stronger timber in their places, in so much that this ship became a standing example among the philosophers, for the logical question of things that grow;

one side holding that the ship remained the same, and the other contending that it was not the same.

   — Plutarch (Theseus)16

TODO: picture of ship

Theseus Directives

17

Primary Directive no state spill

Secondary Directive elementary modules

eliminate state spill above all else, e.g., performance, ease of programming

no submodules; modules as small as possible

Design Principles

Design PrinciplesDecoupling entities based on state spill (pairwise)1. No Encapsulation 2. Stateless Communication Composing a Disentangled OS (multi-entity) 3. Universal, connectionless interfaces 4. Generic pattern reuse

19

P1: No Encapsulation• Encapsulation: code & data

are bundled together • Direct cause of state spill

20Standard Encapsulation Opaque Exportation + Stateless Communication

Client state Server state

config(c)

func1()

func2(r1)

c

c, s1

c, s1, s2

r1

r1, r2

void

r1

r2

a’

c

r2

Client state Server state

r1

r1, r2

c

c, s1

c, s1, s2

c

c, s1

config(c)

func1()

func2(r1)

void

r1

r2

c, s1

c

c

c, s1

c, s1, s2

c, s1, s2

start

end

• Instead, eschew encapsulation • Client should maintain its own

progress with the server • Must preserve information hiding

P2: Stateless Communication• An interaction from client to server must be self-sufficient,

containing everything that the server requires to handle it • No assumption of prior interactions or state

• Implications: • Server entity is effectively stateless • No hidden global dependencies

21

fn0( )

fn1( )

fn2()

void

a’

c

r2

( )

stateful

Select Design and Implementation Decisions

State Management via Opaque Exportation• P1 + P2 –> opaque exportation

• Server returns sealed representation of progress to client

• Preserves information hiding • Client cannot inspect/modify state

• Allows stateless communication • Handled transparently by compiler

• Leverage Rust’s affine type system to avoid high overhead

23

Standard Encapsulation Opaque Exportation + Stateless Communication

Client state Server state

config(c)

func1()

func2(r1)

c

c, s1

c, s1, s2

r1

r1, r2

void

r1

r2

a’

c

r2

Client state Server state

r1

r1, r2

c

c, s1

c, s1, s2

c

c, s1

config(c)

func1()

func2(r1)

void

r1

r2

c, s1

c

c

c, s1

c, s1, s2

c, s1, s2

start

end

– the slightly persnickety audience member

“But can you always eliminate state spill?”

24

… some states must exist somewhere• State spill is not always avoidable, especially in low-level

OS code that interfaces directly with hardware • e.g., frame allocator, interrupt handler

25

Server

Directive 1: no state spill (above all else)

Directive 2: elementary modules

Flat Module Architecture● Submodules contribute to complex entanglement

○ Extract submodules into first-order modules

● Simplifies module logic → nano_core manages all● Permits communication and compositional hierarchy

Theseus: a State Spill-free Operating SystemKevin Boos and Lin Zhong

OSes are complex and entangled

● Existing OSes are a web of entangled entities ○ Cannot treat entities independently

● OS components should be easily interchangeable at runtime, for fluid system evolution○ Goal: Runtime Composability

● Prior decoupling strategies are insufficient; entanglement remains between system entities○ Modularization○ Encapsulation (OOP)○ Privilege-level separation (μkernel)○ Hardware-driven (Barrelfish)

Theseus design principles

1. No traditional encapsulation○ Client A should maintain the state representing

its progress with server B, instead of B○ Preserve information hiding: A cannot inspect

or modify state from B

2. Stateless interactions○ An interaction from A → B must include

everything B needs to handle it○ Implication: B can be practically stateless

3. Universal, connectionless communication○ All entities are accessible in a uniform way○ Do not assume ongoing existence of interfaces

4. Re-use of generic, spill-free patterns○ Implement common OS design patterns once in

a spill-free way, then re-use across system

Design & implementation decisions

State Spill is the root cause

● Scenario: source entity A (“client” role) communicates with destination B (“server role).

● State spill occurs when B’s state undergoes a lasting change after handling an interaction from A.

Main goals of Theseus

Caller entity A(“client”)

Callee entity B (“server”) Initial state

pre-interaction

Changed state mid-interaction

Lasting changes post-interaction

statespill

Encapsulation causes state spill

Standard Encapsulation Opaque Exportation + Stateless Communication

Client state Server state

config(c)

fn1()

fn2(r1)

c

c, s1

c, s1, s2

r1

r1, r2

void

r1

r2

Client state Server state

r1

r1, r2

c

c, s1

c, s1, s2

c

c, s1

config(c)

fn1()

fn2(r1)

void

r1

r2

c, s1

c

c

c, s1

c, s1, s2

c, s1, s2

task mgmt

kernel console

input event mux

keyboard

indirection layer

scheduler

CFQpolicy

FCFSpolicy

RRpolicy

mouseindirection layer

VGA indirection layergraphics mux

filesystem

PIC IRQ

PIT clock IRQ

event dispatcher

syscall dispatcher

syscall indirection layer

heap allocator

frame allocatorstack allocator

PIC IRQ

filesystem

PIC IRQ

PIC IRQfilesystem

filesystem

Monolithic / Microkernel OS Theseus

For disentanglement, we focus only on states and how they propagate

throughout entities in the system.

scheduler

Current status and future work

● Done: baseline OS from scratch, all in Rust● Now: analyze & rethink modules and interfaces

to remove state spill● Far: no user/kernel distinction: “bag of modules”

Software-only Isolation and Safety● Modules are separate binaries: namespace isolation ● Augment Rust compiler to permit minimal subset of

unsafe code necessary for basic OS functionality● Error handling is mandatory, using Option & Result

○ Panics are disallowed and transformed into errors

State Management ● At some point, some entities must hold some state

System-wide entity (e.g., hardware resource)

Client Client Client

Multi-client state

● Export multi-client state as data blob jointly owned by all clients

● Clientless states are owned by state_db metamodule○ Entity caches a

weak reference to it

nano_core

statedb

W

W

• Multi-client states are exported as blobs jointly owned by every client

• Clientless states are owned by state_db metamodule,module caches weak reference

Software-only Safety & Isolation• Problem: we need isolation between modules

• For fault isolation but also for interchangeability • Easy solution: rely on Rust’s memory safety guarantees

• Why not just use existing techniques? (Singularity, SPIN, Vino) • Challenge: many modules, modules are in kernel core

• No support for processes, SIPs, extensions, etc

26

Building Modules in Isolation• Each module is a separate Rust crate

• Compiled into individual binaries, isolated into private “namespaces”

27

CompilerCompiler/Linker

Theseus

Easy to extricate a

single crate due to clear boundaries

Standard OS

No true distinction between

modules, or blurry lines

Many modules, all at the core✓ Naming isolation done • Problem: Hardware protection or

spatial multiplexing is infeasible for 100+ low-level modules

• Solution: shared resources • Challenge: modules implement core

kernel functionality • They need to execute unsafe code • Unsafe code can do … well, anything

28

Unsafe code is a necessary evil• Some unsafe code is okay

• Port I/O & MMIO • Register access • Interrupts & descriptor tables

• Most unsafe code is bad! • Dereferencing arbitrary pointers • Random type reinterpretations

• Solution: augment Rust compiler to permit minimal subset of necessary unsafe code • Principal of least privilege, per-module

29

fn keyboard_irq_handler() { let scan_code: u8 = unsafe { in_byte(0x60) }; log!("scan_code: {}", scan_code); // notify end of interrupt unsafe { out_byte(0x20, 0x20); } }

fn bad_boy_bad_boy() { unsafe { *(0xFFFF1234 as *mut u32) = 0xDEADF00D; } }

task mgmt

nano_core

kernel

console

input event mux

keyboar

d indirection

layer

scheduler

CFQpolicy

FCFSpolicy

RRpolicy

mouseindirection layer

VGA indirection layer

graphics mux

filesystem

PIC IRQ

PIT clock IRQ

event dispatcher

syscall dispatcher

syscall

indirection

layer

heap allocator

frame allocatorstack

allocator

userspaceprocesses

PIC IRQ

filesystem

PIC IRQ

PIC IRQ

filesystem

filesystem

(a) Monolithic Kernel (b) Microkernel OS (c) Theseus

module

submodule

entanglement via state spill

scheduler

Module Management, flattened• Submodules contribute to entanglement

• Must be separated out into standalone, first-order modules with spill-free public interfaces

• nano_core can then manage all modules indiscriminately

Two important clarifications: 1. State spill freedom does not preclude arbitrary module interaction 2. A module’s flat code structure does not preclude

compositional hierarchy amongst modules30

Why Rust?

Beneficial Features of Rust• High-level constructs with low-level flexibility

32

fn clear_vga_screen() { range(0, 80*25, |i| { *((0xb8000 + i * 2) as *mut u16) = VgaChar::new(Black) << 12; }); }

Beneficial Features of Rust• High-level constructs with low-level flexibility

33

let &mut MemoryManagementInfo { ref mut page_table, ref mut vmas, ref mut stack_allocator } = current_mmi; match page_table { &mut PageTable::Active(ref mut active_table) => { let mut frame_allocator = FRAME_ALLOCATOR.lock(); if let Some((stack, stack_vma)) = stack_allocator.alloc_stack( ... ) { vmas.push(stack_vma); Ok(stack) } } _ => { Err("MemoryManagementInfo::alloc_stack: failed to allocate stack!") } }

Beneficial Features of Rust• High-level constructs with low-level flexibility • Functional & imperative

34

p3.and_then(|p3| p3.next_table(page.p3_index())) .and_then(|p2| p2.next_table(page.p2_index())) .and_then(|p1| p1[page.p1_index()].pointed_frame()) .or_else(huge_page) .map(|frame| { frame.number * PAGE_SIZE + offset })

sections.filter(|s| !s.is_allocated()) .map(|s| s.addr) .min()

Beneficial Features of Rust• High-level constructs with low-level flexibility • Functional & imperative • Strongly typed + type inference

35

Beneficial Features of Rust• High-level constructs with low-level flexibility • Functional & imperative • Strongly typed + type inference • Clear ownership/borrowing semantics • Explicit lifetimes

36

Beneficial Features of Rust• High-level constructs with low-level flexibility • Functional & imperative • Strongly typed + type inference • Clear ownership/borrowing semantics • Explicit lifetimes

37

let y: &u32 = { let x = 5; &x }; println!("{}", y);

error: `x` does not live long enough | 3 | &x | - borrow occurs here 4 | }; | ^ `x` dropped here while still borrowed ... 10 | println!(“{}”, y); | - borrowed value needs to live until here

Beneficial Features of Rust• High-level constructs with low-level flexibility • Functional & imperative • Strongly typed + type inference • Clear ownership/borrowing semantics • Explicit lifetimes • Compile-time checking & guarantees

38

Concluding Remarks

Current & Future Work• Applying the herein described design to transform our

existing baseline OS into state spill-free design • Evaluate and demonstrate runtime composability

Related projects using Theseus• Exploring multi-entity resource accounting with state spill • Massive MU-MIMO LTE/5G Basestation system software

• Goal: bring reliability, safety, flexibility, scalability to the edge • Refactoring network stack for network provenance

and tolerance of DoS attacks40

Theseus in conclusion• Eliminate state spill above all else • In pursuit of runtime composability

for easy long-term OS evolution • Implemented from scratch in Rust

• Will be open-sourced soon!

41

Rice Efficient Computing Group recg.org

42

Backup Slides

Rust disadvantages• Lifetime checking isn’t great

• Copious overusage of panic • Must translate into normal error responses

• Allocations are not as obvious as in C • Many more in paper

44

let tasklist = task::get_tasklist().read(); let mut curr_task = tasklist.get_current().unwrap().write(); let curr_mmi = curr_task.mmi.as_ref().unwrap(); let mut curr_mmi_locked = curr_mmi.lock(); curr_mmi_locked.map_dma_memory(paddr, 512, PRESENT | WRITABLE);

P3: Universal, Connectionless Interfaces• All entities should be easily accessible through a

uniform invocation and management interface • Promotes easy interchangeability • Avoids module-specific logic

• No expectation of an interface’s ongoing availability • Interactions cannot be stateful, must be “connectionless”

45

Pattern Reuse• Common recurring OS design patterns should be

implemented only once and reused throughout the OS • Examples: multiplexers, dispatchers, indirection layers

• Pattern must enforce the absence of state spill regardless of specialization

• Should be instantiable at compile time • Lowers development risk of adding new features

46

Modules must not jeopardize evolutionRegular Modules• In general, easy to update because modules are stateless • The nano-core relieves modules that do contain states from the burden of

implementing their own state-saving logic • (Tedious approach, commonly used in live updates for reliable systems, e.g., MINIX 3 [40]) • In Theseus, compiler is the sole manager of these exported states;

it produces control routines to move them in and out of a module’s bounds, a guaranteed-safe operation because they are not accessible at the source level

• Updates can occur on demand without the modules’ cooperation or knowledge

Metamodules • Must also be easily updatable • state_db “recalls” lent-out states and externalizes them temporarily

47

Related work (abridged)• Rust OSes with different goals: Tock, Redox • Static Composability: Flux OSKit, Think, Taligent

• Componentized interfaces: OpenCOM, Knit • Microservices architecture and serverless

48