Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

capOS Documentation

capOS is a research operating system where every kernel service and every cross-process service is a typed Cap’n Proto capability invoked through a shared-memory ring. There is no ambient authority, no global path namespace, and the only remaining syscalls are cap_enter and exit. The current implementation boots on x86_64 QEMU, loads a Cap’n Proto boot manifest, starts manifest-declared services, and exercises ring-native IPC, capability transfer, and init-driven spawning through QEMU smoke binaries.

Use this book as the current system manual. It separates implemented behavior from proposals, research notes, and operational planning files. What capOS Is has the short version of what makes the design unusual.

Start Here

  • What capOS Is describes the implemented system model and the main authority boundaries.
  • Current Status lists what works today, what is partial, and what remains future work.
  • Build, Boot, and Test gives the commands used to build the ISO, boot QEMU, and run host-side validation.
  • Repository Map maps the main subsystems to source files.

Deeper References

Operational planning still lives outside the book in ROADMAP.md, WORKPLAN.md, and REVIEW_FINDINGS.md. Treat those as live planning and review records, not stable architecture pages.

What capOS Is

A research kernel that boots on x86_64 QEMU. The rest of this page is about why it looks the way it does — the specific design bets behind the code — not a feature inventory. For the feature-by-feature matrix, see Current Status.

Status: Partially implemented.

What Makes capOS Different

capOS is a research vehicle for a few specific design bets. Each is unusual on its own; the combination is the point.

  • Everything is a typed capability. System resources are accessed through Cap’n Proto interfaces defined in schema/capos.capnp. There is no ambient authority — no global path namespace, no open-by-name, no implicit inherit. A process can only invoke objects present in its local capability table.
  • The interface IS the permission. Instead of a parallel READ/WRITE/EXEC rights bitmask (Zircon, seL4), attenuation is a narrower capability: a wrapper CapObject exposing fewer methods, or an Endpoint client facet that cannot RECV/RETURN. The kernel just dispatches; policy lives in interfaces. See Capability Model.
  • io_uring-style shared-memory ring for every call. Every process owns a submission/completion queue page. Userspace writes SQEs with a normal memory store; the kernel processes them through cap_enter. New operations are SQE opcodes (CALL, RECV, RETURN, RELEASE, NOP), not new syscalls. The remaining syscall surface is cap_enter and exit.
  • Release is transport, not an application method. Dropping the last owned handle in capos-rt submits a CAP_OP_RELEASE SQE; the kernel removes the slot. No close() method on every interface, no mutable table self-reference during dispatch.
  • Capability transfer is first-class. Copy and move descriptors ride sideband on CALL/RETURN SQEs. Move reserves the sender slot until the receiver accepts and preflight checks pass, then commits or rolls back atomically — no lost, duplicated, or half-inserted authority.
  • Cap’n Proto wire format end-to-end. The same encoding describes the boot manifest, runtime method calls, and future persistence/remote transparency. The CQE log is itself a serialized capnp message stream, which opens the door to record/replay, audit, and migration as OS primitives rather than external tooling.
  • Host-testable pure logic. Cap-table, frame-bitmap, ELF parser, frame ledger, lazy buffers, and the ring model all live in capos-lib / capos-config and run under cargo test-lib, Miri, Loom, Kani, and proptest without any kernel scaffolding. Kernel glue stays thin.
  • Schema-first boot. system.cue is compiled to a Cap’n Proto SystemManifest embedded as the single Limine boot module. The manifest carries binaries, capability grants, exports, badges, and restart metadata as typed structured data — not shell scripts or baked environment variables.

Execution Model

Each process owns an address space, a local capability table, a mapped capability-ring page, and a read-only CapSet page that enumerates its bootstrap handles. The kernel enters Ring 3 with iretq and returns through cap_enter or the timer. Ordinary capability calls progress only via cap_enter; timer-side polling handles non-CALL ring work and call targets that are explicitly safe for interrupt dispatch. Details in Process Model, Capability Ring, and Scheduling.

Boot Flow

The kernel receives exactly one Limine module — a Cap’n Proto SystemManifest compiled from system.cue — validates it, loads the referenced ELFs, builds per-service capability tables and CapSet pages, and starts the scheduler. The default boot still wires the service graph in the kernel; the selected milestone is to move generic manifest execution into init through ProcessSpawner. Full walkthrough in Boot Flow and Manifest and Service Startup.

Authority Boundaries

Authority is carried by cap-table hold edges with generation-tagged CapIds. Ring 0 ↔ Ring 3, capability table ↔ kernel object, endpoint IPC, copy/move transfer, manifest/boot-package, and process spawn are the boundaries reviewers care about; each one fails closed at hostile input. See Trust Boundaries for the boundary table and Authority Accounting for the transfer and quota invariants.

What capOS Is Not

A POSIX clone, a microkernel-shaped Linux replacement, or a production OS. It is a place to try the above choices and see which ones survive contact with real workloads. See Build, Boot, and Test to run it.

Current Status

This page describes current repository behavior, not the full long-term design. For operational priority and open review items, read WORKPLAN.md and REVIEW_FINDINGS.md.

Implemented

Boot and Kernel Baseline

  • Limine boots the x86_64 kernel in QEMU.
  • The kernel initializes serial output, GDT, IDT, PIC, PIT, syscall MSRs, memory management, page tables, heap allocation, and the global capability registry.
  • The kernel creates its own page tables with per-section permissions and keeps the higher-half direct map for physical memory access.
  • SMEP/SMAP are enabled when the QEMU CPU advertises support.

Code: kernel/src/main.rs, kernel/src/arch/x86_64/, kernel/src/mem/.

Validation: cargo build --features qemu, make run.

Process and Userspace Runtime

  • Processes have isolated address spaces, per-process kernel stacks, CapSet bootstrap pages, capability rings, and local capability tables.
  • ELF loading supports static no_std userspace binaries and TLS setup.
  • capos-rt owns the userspace entry path, allocator initialization, ring-client access, typed clients, result-cap parsing, and owned-handle release.

Code: kernel/src/spawn.rs, kernel/src/process.rs, capos-rt/src/, init/src/main.rs, demos/.

Validation: make capos-rt-check, make run, make run-spawn.

Capability Ring and IPC

  • The shared ring ABI supports CALL, RECV, RETURN, RELEASE, and NOP transport operations.
  • cap_enter processes submissions and can block until completions arrive or a timeout expires.
  • Endpoints route ring-native IPC between processes.
  • Direct IPC handoff lets a blocked receiver run before unrelated round-robin work after a matching CALL arrives.
  • Transport errors and application exceptions are surfaced through CQEs and typed runtime client errors.

Code: capos-config/src/ring.rs, kernel/src/cap/ring.rs, kernel/src/cap/endpoint.rs, capos-rt/src/ring.rs, capos-rt/src/client.rs.

Validation: cargo test-ring-loom, make run.

Capabilities

Implemented kernel capabilities include:

  • Console for serial output.
  • FrameAllocator for physical frame grants.
  • Endpoint for IPC rendezvous.
  • VirtualMemory for anonymous user page map, unmap, and protect operations.
  • ProcessSpawner and ProcessHandle for init-driven child process creation and wait semantics.

Code: kernel/src/cap/console.rs, kernel/src/cap/frame_alloc.rs, kernel/src/cap/endpoint.rs, kernel/src/cap/virtual_memory.rs, kernel/src/cap/process_spawner.rs.

Validation: make run, make run-spawn, cargo test-lib.

Capability Transfer and Release

  • IPC CALL and RETURN support sideband transfer descriptors.
  • Copy and move transfer are implemented.
  • Move transfer reserves the sender slot until destination insertion and commit.
  • Transfer result caps carry interface ids to userspace.
  • CAP_OP_RELEASE removes local capability-table slots and is integrated with runtime owned-handle drop.

Code: kernel/src/cap/transfer.rs, kernel/src/cap/ring.rs, capos-lib/src/cap_table.rs, capos-rt/src/ring.rs.

Validation: cargo test-lib, make run.

Manifest Tooling and Smokes

  • tools/mkmanifest turns system.cue into a Cap’n Proto boot manifest.
  • The build uses repo-pinned Cap’n Proto and CUE tool paths through the Makefile; direct mkmanifest invocation also rejects missing, unpinned, or version-mismatched CUE compilers.
  • Default QEMU smoke services cover CapSet bootstrap, Console paths, ring corruption handling, reserved opcodes, NOP, ring fairness, TLS, VirtualMemory, FrameAllocator cleanup, Endpoint cleanup, and cross-process IPC.
  • system-spawn.cue drives the ProcessSpawner smoke where init spawns endpoint, IPC, VirtualMemory, and FrameAllocator cleanup children and checks hostile spawn inputs.

Code: tools/mkmanifest/, system.cue, system-spawn.cue, demos/.

Validation: cargo test-mkmanifest, make generated-code-check, make run, make run-spawn.

Partially Implemented

Init-Owned Service Startup

init can use ProcessSpawner in the spawn smoke, but default make run still uses the kernel boot path to create all manifest services. The selected milestone is to make default boot use init to validate and execute the service graph.

Current blockers are tracked in WORKPLAN.md and REVIEW_FINDINGS.md. Manifest schema-version guardrails, BootPackage authority exposure to init, init-side manifest graph validation, ProcessSpawner badge attenuation, direct manifest-tool CUE pin enforcement, generic manifest spawning, and child-local FrameAllocator/VirtualMemory grants are in place for the spawn-manifest path. Remaining milestone work is legacy kernel service-graph retirement for the default boot path.

Hardware and Networking

The QEMU virtio-net path has legacy PCI config-space enumeration and a make run-net boot target. A virtio-net driver, smoltcp integration, ICMP, and TCP smoke are not implemented.

Code: kernel/src/pci.rs, kernel/src/arch/x86_64/pci_config.rs, tools/qemu-net-harness.sh.

Validation: make run-net, make qemu-net-harness for the existing PCI smoke path.

Security and Verification Track

The repo has Miri, proptest, fuzz, Loom, Kani, generated-code, dependency policy, trusted-build-input, panic-surface, and DMA-isolation work. Coverage is not complete for every trust boundary.

References: Trusted Build Inputs, Panic Surface Inventory, DMA Isolation, and Security and Verification Proposal.

Future Work

Future architecture includes generic manifest execution in init, service restart policy, capability-scoped system monitoring, notification objects, promise pipelining, epoch revocation, shared-buffer capabilities, scheduling-context donation, session quotas, SMP, storage and naming, userspace networking, cloud boot support, user identity, policy enforcement, boot-to-shell authentication, text shell launch, and broader language/runtime support.

Design references:

Build, Boot, and Test

The commands below are the current local workflow for the x86_64 QEMU target. The root Cargo configuration defaults to x86_64-unknown-none, so host tests must use the repo aliases instead of bare cargo test.

Prerequisites

Expected host tools:

  • Rust nightly from rust-toolchain.toml
  • make
  • qemu-system-x86_64
  • xorriso
  • curl, sha256sum, and standard build tools for pinned tool downloads
  • Go, used by the Makefile to install the pinned CUE compiler when needed
  • Optional policy and proof tools for extended checks: cargo-deny, cargo-audit, cargo-fuzz, cargo-miri, and cargo-kani

The Makefile pins and verifies:

  • Limine at the commit recorded in Makefile
  • Cap’n Proto compiler version 1.2.0
  • CUE version 0.16.0

Pinned tools are installed under the clone-shared .capos-tools directory next to the git common directory.

Build the ISO

make

This builds:

  • the kernel with the default bare-metal target;
  • init as a standalone release userspace binary;
  • release-built demo service binaries under demos/;
  • the capos-rt-smoke binary;
  • manifest.bin from system.cue;
  • capos.iso with Limine boot files.

Relevant files: Makefile, limine.conf, system.cue, tools/mkmanifest/.

Boot QEMU

make run

This builds the ISO with the qemu feature, boots QEMU with serial on stdio, and uses the isa-debug-exit device so a clean kernel halt exits QEMU with the expected status.

The default smoke path should print kernel startup diagnostics, manifest service creation, demo output, and final halt. Current default smokes include CapSet bootstrap, capos-rt-smoke, Console paths, ring corruption recovery, reserved opcode handling, NOP, ring fairness, TLS, VirtualMemory, FrameAllocator cleanup, Endpoint cleanup, and cross-process IPC.

Spawn Smoke

make run-spawn

This boots with system-spawn.cue. Only init is boot-launched by the manifest; init uses ProcessSpawner to launch endpoint, IPC, VirtualMemory, and FrameAllocator cleanup demo children, wait for ProcessHandles, and exercise hostile spawn inputs.

This is the current validation path for init-driven process creation. It is not yet the default manifest executor.

Networking and Measurement Targets

make run-net
make qemu-net-harness
make run-measure
  • make run-net attaches a QEMU virtio-net PCI device and exercises current PCI enumeration diagnostics.
  • make qemu-net-harness runs the scripted net smoke path.
  • make run-measure enables the separate measure feature for benchmark-only counters and cycle measurements. Do not treat it as the normal dispatch build.

Formatting and Generated Code

make fmt
make fmt-check
make generated-code-check
  • make fmt formats the kernel workspace plus standalone init, demos, and capos-rt crates.
  • make fmt-check verifies formatting without modifying files.
  • make generated-code-check verifies checked-in Cap’n Proto generated code against the repo-pinned compiler path.

Host Tests

cargo test-config
cargo test-ring-loom
cargo test-lib
cargo test-mkmanifest
make capos-rt-check
  • cargo test-config runs shared config, manifest, ring, and CapSet tests on the host target.
  • cargo test-ring-loom runs the bounded Loom model for SQ/CQ protocol invariants.
  • cargo test-lib runs host tests for pure shared logic such as ELF parsing, capability tables, frame allocation, and related property tests.
  • cargo test-mkmanifest runs host tests for manifest generation.
  • make capos-rt-check builds the standalone runtime smoke binary with the userspace relocation flags used by the boot image.

Extended Verification

make dependency-policy-check
make fuzz-build
make fuzz-smoke
make kani-lib
cargo miri-lib

These require optional tools. Use them when changing dependency policy, manifest parsing, ELF parsing, capability-table/frame logic, or proof-covered shared code. See the Security and Verification Proposal for the rationale behind the extended verification tiers.

Validation Rule

For behavior changes, a clean build is not enough. The relevant QEMU process must exercise the behavior and print observable output that proves the path works. make run is the default end-to-end gate; make run-spawn, make run-net, or make run-measure are additional gates for their specific features.

Repository Map

This map names the main source locations for the current system. It is not an ownership file; use it to find the code behind architecture and validation claims.

Root Files

  • README.md gives the compact project overview.
  • ROADMAP.md records long-range stages and broad feature direction.
  • WORKPLAN.md records the current selected milestone and implementation ordering.
  • REVIEW_FINDINGS.md records open review findings and verification history.
  • REVIEW.md defines review expectations.
  • Makefile builds pinned tools, userspace binaries, manifests, ISO images, QEMU targets, formatting checks, generated-code checks, and policy checks.
  • rust-toolchain.toml pins the Rust toolchain.
  • .cargo/config.toml sets the default bare-metal target and useful cargo aliases.

Schema and Shared ABIs

  • schema/capos.capnp defines capability interfaces, manifest structures, exceptions, ProcessSpawner, ProcessHandle, and transfer-related schema.
  • capos-config/src/manifest.rs defines the host and no_std manifest model.
  • capos-config/src/ring.rs defines CapRingHeader, SQE/CQE structures, opcodes, flags, and transport error constants shared by kernel and userspace.
  • capos-config/src/capset.rs defines the read-only bootstrap CapSet ABI.
  • capos-config/src/cue.rs supports evaluated CUE-style manifest data.
  • capos-config/tests/ring_loom.rs models bounded ring protocol behavior with Loom.

Validation: cargo test-config, cargo test-ring-loom, make generated-code-check.

Shared Pure Logic

  • capos-lib/src/elf.rs parses ELF64 images for kernel loading and host tests.
  • capos-lib/src/cap_table.rs implements CapId, capability-table storage, stale-generation checks, grant preparation, transfer transaction helpers, commit, and rollback.
  • capos-lib/src/frame_bitmap.rs implements the host-testable physical frame bitmap core.
  • capos-lib/src/frame_ledger.rs tracks outstanding FrameAllocator grants.
  • capos-lib/src/lazy_buffer.rs provides bounded lazy buffers used by ring scratch paths.

Validation: cargo test-lib, cargo miri-lib, make kani-lib, fuzz targets under fuzz/fuzz_targets/.

Kernel

  • kernel/src/main.rs is the boot entry point, hardware setup sequence, manifest parsing path, and boot-launched service creation path.
  • kernel/src/spawn.rs loads user ELF images, creates process state, maps bootstrap pages, and enqueues spawned processes.
  • kernel/src/process.rs defines Process, process states, kernel stacks, and initial userspace CPU context.
  • kernel/src/sched.rs implements the single-CPU scheduler, timer-driven preemption, blocking cap_enter, direct IPC handoff, and deferred cancellation wakeups.
  • kernel/src/serial.rs implements COM1 output and kernel print macros.
  • kernel/src/pci.rs implements the current QEMU virtio-net PCI enumeration smoke path.

Validation: cargo build --features qemu, make run, make run-spawn, make run-net.

Kernel Architecture

  • kernel/src/arch/x86_64/gdt.rs sets up kernel/user segments and TSS state.
  • kernel/src/arch/x86_64/idt.rs handles exceptions and timer interrupts.
  • kernel/src/arch/x86_64/syscall.rs implements syscall MSR setup and entry.
  • kernel/src/arch/x86_64/context.rs defines timer context-switch state.
  • kernel/src/arch/x86_64/pic.rs and pit.rs configure legacy interrupt hardware.
  • kernel/src/arch/x86_64/smap.rs enables SMEP/SMAP and brackets user memory access.
  • kernel/src/arch/x86_64/tls.rs handles FS-base/TLS support.
  • kernel/src/arch/x86_64/pci_config.rs provides legacy PCI config I/O.

Kernel Memory

  • kernel/src/mem/frame.rs wraps the shared frame bitmap with Limine memory map initialization and global kernel access.
  • kernel/src/mem/paging.rs manages page tables, address spaces, permissions, user mappings, W^X enforcement, and address-space teardown.
  • kernel/src/mem/heap.rs initializes the kernel heap.
  • kernel/src/mem/validate.rs validates user buffers before kernel access.

Related docs: DMA Isolation, Trusted Build Inputs.

Kernel Capabilities

  • kernel/src/cap/mod.rs initializes kernel capabilities and resolves manifest service capability tables.
  • kernel/src/cap/table.rs re-exports shared capability-table logic and owns the kernel-global table.
  • kernel/src/cap/ring.rs validates and dispatches ring SQEs.
  • kernel/src/cap/transfer.rs validates transfer descriptors and prepares transfer transactions.
  • kernel/src/cap/endpoint.rs implements Endpoint CALL, RECV, RETURN, queued state, cleanup, and cancellation behavior.
  • kernel/src/cap/console.rs implements serial Console.
  • kernel/src/cap/frame_alloc.rs implements FrameAllocator.
  • kernel/src/cap/virtual_memory.rs implements per-process anonymous memory operations.
  • kernel/src/cap/process_spawner.rs implements ProcessSpawner and ProcessHandle.
  • kernel/src/cap/null.rs implements the measurement-only NullCap.

Related docs: Capability Model, Authority Accounting.

Userspace

  • init/ is the standalone init process. In the spawn smoke, it uses ProcessSpawner, grants initial child capabilities, waits on ProcessHandles, and checks hostile spawn inputs.
  • capos-rt/src/entry.rs owns the runtime entry path and bootstrap validation.
  • capos-rt/src/alloc.rs initializes the userspace heap.
  • capos-rt/src/syscall.rs provides raw syscall wrappers.
  • capos-rt/src/capset.rs provides typed CapSet lookup helpers.
  • capos-rt/src/ring.rs implements the safe single-owner ring client, out-of-order completion handling, transfer descriptor packing, and result-cap parsing.
  • capos-rt/src/client.rs implements typed clients for Console, ProcessSpawner, and ProcessHandle.
  • capos-rt/src/bin/smoke.rs is the runtime smoke binary packaged by the default manifest.

Validation: make capos-rt-check, make run, make run-spawn.

Demo Services

demos/ is a nested userspace smoke-test workspace. Each demo is a release-built service binary packaged into the boot manifest:

  • capset-bootstrap
  • console-paths
  • ring-corruption
  • ring-reserved-opcodes
  • ring-nop
  • ring-fairness
  • unprivileged-stranger
  • tls-smoke
  • virtual-memory
  • frame-allocator-cleanup
  • endpoint-roundtrip
  • ipc-server
  • ipc-client

Shared demo support lives in demos/capos-demo-support/src/lib.rs.

Validation: make run, make run-spawn.

Manifest and Tooling

  • system.cue is the default manifest source.
  • system-spawn.cue is the ProcessSpawner smoke manifest source.
  • tools/mkmanifest/ evaluates manifest input, embeds binaries, validates manifest shape, and writes Cap’n Proto bytes.
  • tools/check-generated-capnp.sh verifies checked-in generated schema output.
  • tools/qemu-net-harness.sh runs the current QEMU net harness.
  • fuzz/ contains fuzz targets for manifest Cap’n Proto decoding, mkmanifest JSON conversion/validation, and ELF parsing.

Validation: cargo test-mkmanifest, make generated-code-check, make fuzz-build, make fuzz-smoke.

Documentation

  • docs/capability-model.md is the current capability architecture reference.
  • docs/*-design.md files record targeted implemented or accepted designs.
  • docs/proposals/ contains accepted, future, exploratory, and rejected designs.
  • docs/research.md and docs/research/ summarize prior art.
  • docs/proposals/mdbook-docs-site-proposal.md defines the documentation site structure and status vocabulary used by these Start Here pages.

Boot Flow

Boot flow defines the trusted path from firmware-owned machine state to the first user processes. It establishes memory management, interrupt/syscall entry, capability tables, process rings, and the boot manifest authority graph.

Status: Partially implemented. Limine boot, kernel initialization, manifest parsing, ELF loading, process creation, and QEMU halt-on-success are implemented. The current default boot still lets the kernel interpret the whole service graph. The selected milestone in WORKPLAN.md is to move manifest graph execution into init.

Current Behavior

Firmware loads Limine, Limine loads the kernel and exactly one module, and the kernel treats that module as a Cap’n Proto SystemManifest. The kernel rejects boots with any module count other than one.

kmain initializes serial output, x86_64 descriptor tables, memory, paging, SMEP/SMAP, the kernel capability table, the idle process, PIC, and PIT. It then parses and validates the manifest, loads each service ELF into a fresh AddressSpace, builds per-service capability tables and read-only CapSet pages, enqueues the processes, and starts the scheduler.

flowchart TD
    Firmware[UEFI or QEMU firmware] --> Limine[Limine bootloader]
    Limine --> Kernel[kmain]
    Limine --> Module[manifest.bin boot module]
    Kernel --> Arch[serial, GDT, IDT, syscall MSRs]
    Kernel --> Memory[frame allocator, heap, paging, SMEP/SMAP]
    Kernel --> Manifest[parse and validate SystemManifest]
    Manifest --> Images[parse and map service ELFs]
    Manifest --> Caps[build CapTables and CapSet pages]
    Images --> Processes[create Process structs and rings]
    Caps --> Processes
    Processes --> Scheduler[start round-robin scheduler]
    Scheduler --> User[enter first user process]

The invariant is that no user service starts until manifest binary references, authority graph structure, and bootstrap capability source/interface checks have passed.

Design

The boot path is deliberately single-shot. The kernel receives a single packed manifest and validates the graph before creating any process. ELF parsing is cached per binary name, but each service gets its own address space, user stack, TLS mapping if present, ring page, and CapSet mapping.

The default manifest (system.cue) packages init, capos-rt-smoke, and the demo services directly. The spawn manifest (system-spawn.cue) packages only init as an initial service and grants it ProcessSpawner plus endpoint caps; init then spawns selected child services.

Future behavior is narrower: the kernel should start only init with fixed bootstrap authority and a manifest or boot-package capability. init should validate and execute the service graph through ProcessSpawner.

Invariants

  • Limine must provide exactly one boot module, and that module is the manifest.
  • Manifest validation must complete before any declared service is enqueued.
  • Service ELF load failures roll back frame allocations before boot continues or fails.
  • Kernel page tables are active and HHDM user access is stripped before SMEP/SMAP are enabled.
  • The kernel passes _start(ring_addr, pid, capset_addr) in RDI, RSI, and RDX.
  • CapSet metadata is read-only user memory; the ring page is writable user memory.
  • QEMU-feature boots halt through isa-debug-exit when no runnable processes remain.

Code Map

  • kernel/src/main.rs - kmain, manifest module handling, validation, service image loading, process enqueue, halt path.
  • kernel/src/spawn.rs - ELF-to-address-space loading, fixed user stack, TLS mapping, Process construction helpers.
  • kernel/src/process.rs - process bootstrap context, ring page mapping, CapSet page mapping.
  • kernel/src/cap/mod.rs - manifest capability resolution and CapSet entry construction.
  • capos-config/src/manifest.rs - manifest decode, schema-version guardrails, and graph/source/binary validation.
  • tools/mkmanifest/src/lib.rs - host-side manifest validation and binary embedding.
  • system.cue and system-spawn.cue - default and spawn-focused boot graphs.
  • limine.conf and Makefile - bootloader config, ISO construction, QEMU targets.

Validation

  • make run validates the default manifest, kernel-side service startup, process creation, scheduler entry, and clean QEMU halt.
  • make run-spawn validates init-owned spawning through ProcessSpawner for the current transition path.
  • cargo test-config covers manifest decode, roundtrip, and validation logic.
  • cargo test-mkmanifest covers host-side manifest conversion and embedding checks.
  • make generated-code-check verifies checked-in Cap’n Proto generated output.

Open Work

  • Move default service graph interpretation from the kernel into init.
  • Retire kernel-side service graph wiring after default make run proves the init-owned path.

Process Model

The process model defines how capOS represents isolated user programs, how they receive authority, how they enter and leave the scheduler, and how a parent can observe a child.

Status: Partially implemented. Processes, isolated address spaces, ELF loading, fixed bootstrap ABI, exit cleanup, process handles, and init-driven child spawning are implemented. Restart policy, kill, generic post-spawn grants, and init-side manifest graph execution remain open.

Current Behavior

A Process owns a user address space, a per-process capability table, a ring scratch area, a kernel stack, a saved CPU context, a mapped capability ring, and an optional read-only CapSet page. Process IDs are assigned by an atomic counter.

ELF images are loaded into fresh user address spaces. PT_LOAD segments are mapped with page permissions derived from ELF flags, the user stack is fixed at 0x40_0000, and PT_TLS data is mapped into a per-process TLS area below the ring page. The process starts from a synthetic CpuContext that returns to Ring 3 with iretq.

ProcessSpawner lets a holder spawn packaged boot binaries, grant selected caps to the child, and receive a non-transferable ProcessHandle result cap. ProcessHandle.wait either completes immediately for an already-exited child or registers one waiter.

Design

Process construction separates image loading from capability-table assembly. The kernel first maps all boot-launched service images, then builds capability tables for all services so service-sourced caps can resolve against declared exports. Spawned children use the same image loading and Process creation helpers, but their grants are supplied by the calling process through ProcessSpawner.

Each process starts with three machine arguments:

  • RDI - fixed ring virtual address (RING_VADDR).
  • RSI - process ID.
  • RDX - fixed CapSet virtual address, or zero if no CapSet is mapped.

Exit releases authority before the Process storage is dropped. The scheduler switches to the kernel page table before address-space teardown, cancels endpoint state for the exiting pid, completes any pending process waiter, and defers the final process drop until execution is on another kernel stack.

Future process lifecycle work should keep authority transfer explicit: parents should not gain ambient access to child internals, and child grants should come from named caps plus interface checks.

Invariants

  • A process cannot access a resource unless its local CapTable holds a cap.
  • Bootstrap CapSet metadata is immutable from userspace.
  • A stale CapId generation must not name a reused cap-table slot.
  • ProcessSpawner raw grants require a copy-transferable cap or an endpoint owner cap; client-endpoint grants attenuate endpoint authority.
  • ProcessSpawner kernel-source grants are limited to fresh child-local address-space-bound caps; they cannot be badged or exported from init.
  • ProcessHandle caps are non-transferable.
  • At most one waiter may be registered on a ProcessHandle.
  • Process exit releases cap-table authority before the kernel stack frame is freed.

Code Map

  • kernel/src/process.rs - Process, bootstrap CPU context, ring/CapSet mapping, exit capability cleanup.
  • kernel/src/spawn.rs - ELF mapping, stack mapping, TLS mapping, process construction helpers.
  • kernel/src/sched.rs - process table, process handles, wait completion, exit path.
  • kernel/src/cap/process_spawner.rs - ProcessSpawnerCap, ProcessHandleCap, spawn grant validation, child-local kernel grants, child CapSet construction.
  • capos-lib/src/cap_table.rs - CapId generation and cap-table operations.
  • capos-config/src/capset.rs - fixed CapSet page ABI.
  • schema/capos.capnp - ProcessSpawner, ProcessHandle, and CapGrant.
  • init/src/main.rs - current init-side spawn smoke and hostile spawn checks.

Validation

  • make run validates kernel-launched service processes, CapSet bootstrap, exit cleanup, and clean halt.
  • make run-spawn validates ProcessSpawner, ProcessHandle.wait, child grants, init-spawned IPC demos, and hostile spawn failures.
  • cargo test-lib covers CapTable generation, stale-slot, and transfer primitives.
  • cargo test-config covers CapSet and manifest metadata used to build process grants.
  • cargo build --features qemu verifies the kernel and QEMU-only paths compile.

Open Work

  • Make default boot launch only init and execute the validated service graph through init.
  • Add lifecycle operations such as kill and post-spawn grants only after their authority semantics are explicit.
  • Implement restart policy outside the kernel-side static boot graph.

Capability Model

How capabilities work in capOS.

Status: Partially implemented. Generation-tagged cap tables, typed schema interface IDs, manifest/CapSet grants, badges, transport-level release, and Endpoint copy/move transfer are implemented. Revocation propagation, persistence, and bulk-data capabilities remain future work.

What is a Capability

A capability in capOS is a reference to a kernel object that carries:

  • An interface (what methods can be called), defined by a Cap’n Proto schema
  • A permission (the object it references, enforced by the kernel)
  • A wire format (Cap’n Proto serialized messages for all invocations)

A process can only access a resource if it holds a capability to it. There is no ambient authority – no global namespace, no “open by path” syscall, no implicit resource access.

Schema as Contract

Capability interfaces are defined in .capnp schema files under schema/. The schema is the canonical interface definition. Currently defined:

interface Console {
    write @0 (data :Data) -> ();
    writeLine @1 (text :Text) -> ();
}

interface FrameAllocator {
    allocFrame @0 () -> (physAddr :UInt64);
    freeFrame @1 (physAddr :UInt64) -> ();
    allocContiguous @2 (count :UInt32) -> (physAddr :UInt64);
}

interface VirtualMemory {
    map @0 (hint :UInt64, size :UInt64, prot :UInt32) -> (addr :UInt64);
    unmap @1 (addr :UInt64, size :UInt64) -> ();
    protect @2 (addr :UInt64, size :UInt64, prot :UInt32) -> ();
}

interface Endpoint {}

interface ProcessSpawner {
    spawn @0 (name :Text, binaryName :Text, grants :List(CapGrant)) -> (handleIndex :UInt16);
}

interface ProcessHandle {
    wait @0 () -> (exitCode :Int64);
}

interface BootPackage {
    manifestSize @0 () -> (size :UInt64);
    readManifest @1 (offset :UInt64, maxBytes :UInt32) -> (data :Data);
}

# Management-only introspection. Ordinary handle release uses the system
# transport opcode CAP_OP_RELEASE, not a method here.
interface CapabilityManager {
    list @0 () -> (capabilities :List(CapabilityInfo));
    # grant is planned for Stage 6 (IPC and Capability Transfer)
}

Each interface has a unique 64-bit TYPE_ID generated by the Cap’n Proto compiler. TYPE_ID is the schema constant. interface_id is the runtime metadata used by CapSet/bootstrap descriptions and endpoint delivery headers. Method dispatch uses the interface assigned to the capability entry plus method_id; method_id selects a method inside that schema.

This is not capability identity. A CapId is the authority-bearing handle in a process table, analogous to an fd. Multiple capabilities can expose the same interface:

  • cap_id=3 -> serial-backed Console
  • cap_id=4 -> log-buffer-backed Console
  • cap_id=5 -> Console proxy served by another process

All three use the same Console TYPE_ID, but they are different objects with different authority. The manifest/CapSet should record the expected schema TYPE_ID as interface metadata for typed handle construction. Normal CALL SQEs do not need to repeat it because the kernel or serving transport can derive it from the target capability entry. CapSqe keeps reserved tail padding for ABI stability.

The kernel exposes the initial CapSet to each process as a read-only 4 KiB page mapped at capos_config::capset::CAPSET_VADDR and passes its address in RDX to _start. The page starts with a CapSetHeader { magic, version, count } and is followed by CapSetEntry { cap_id, name_len, interface_id, name: [u8; 32] } records in manifest declaration order. Userspace looks up caps by the manifest name rather than by numeric index (capos_config::capset::find), so grants can be reordered in system.cue without breaking clients. The mapping is installed without WRITABLE so userspace cannot mutate its own bootstrap authority map.

Security invariant: a CapTable entry exposes one public interface. If the same backing state must be available through multiple interfaces, mint multiple capability entries, each wrapping the same state with a narrower interface. Do not grant one handle that accepts unrelated interface_id values; that makes hidden authority easy to miss during review.

Invocation Path

Capabilities are invoked via a shared-memory capability ring (io_uring- inspired). Each process has a submission queue (SQ) and completion queue (CQ) mapped into its address space. Two invocation paths exist:

Caller builds capnp params message
    → serialize to bytes (write_message_to_words)
    → write CALL SQE to SQ ring (pure userspace memory write)
    → advance SQ tail
    → caller invokes cap_enter for ordinary capability methods
      (timer polling only runs explicitly interrupt-safe CALL targets)
    → kernel reads SQE, validates user buffers
    → CapTable.call(cap_id, method_id, bytes)
    → kernel writes CQE to CQ ring
    ... caller reads CQE after cap_enter, or spin-polls only for
        interrupt-safe/non-CALL ring work ...
    → caller reads CQE result

CapObject::call does not receive a caller-supplied interface ID. The cap table derives the invoked interface from the target entry before invoking the object. The SQE carries only the capability handle and method ID because each capability entry owns one public interface:

#![allow(unused)]
fn main() {
pub trait CapObject: Send + Sync {
    fn interface_id(&self) -> u64;
    fn label(&self) -> &str;
    fn call(
        &self,
        method_id: u16,
        params: &[u8],
        result: &mut [u8],
        reply_scratch: &mut dyn ReplyScratch,
    ) -> capnp::Result<CapInvokeResult>;
}
}

All communication goes through serialized capnp messages, even when caller and callee are in the same address space. This ensures the wire format is always exercised and makes the transition to cross-address-space IPC seamless.

The result buffer is supplied by the caller (the user-validated SQE result region). Implementations serialize directly into it and return the number of bytes written, so the kernel’s dispatch path does not allocate an intermediate Vec<u8> per invocation.

Capability Table

Each process has its own capability table (CapTable), created at process startup. The kernel also maintains a global table (KERNEL_CAPS) for kernel-internal use. Each table maps a CapId (u32) to a boxed CapObject.

CapId encoding: [generation:8 | index:24]. The generation counter increments when a slot is freed, so stale CapIds (from a previous occupant of the slot) are rejected with CapError::StaleGeneration rather than accidentally referring to a different capability.

Operations:

  • insert(obj) – register a new capability, returns its CapId
  • get(id) – look up a capability by ID (validates generation)
  • remove(id) – revoke a capability, bumps slot generation
  • call(id, method_id, params) – dispatch a method call against the interface assigned to the capability entry

Each service receives capabilities from cap::create_all_service_caps(), which runs a two-pass resolution over the whole manifest: pass 1 materializes each service’s kernel-sourced caps as Arc<dyn CapObject> and records its declared exports; pass 2 assembles each service’s CapTable in declaration order, cloning the exported Arc when another service’s CapRef resolves via CapSource::Service. Declaration order is preserved because numeric CapIds are assigned by insertion order and smoke tests depend on specific indices. CapRef.source is a structured capnp union, not an authority string:

struct CapRef {
    name @0 :Text;
    expectedInterfaceId @1 :UInt64;
    union {
        unset @2 :Void; # invalid; keeps omitted sources fail-closed
        kernel @3 :KernelCapSource;
        service @4 :ServiceCapSource;
    }
}

enum KernelCapSource {
    console @0;
    endpoint @1;
    frameAllocator @2;
    virtualMemory @3;
}

struct ServiceCapSource {
    service @0 :Text;
    export @1 :Text;
}

The source selector chooses the object or authority to grant. The expectedInterfaceId value is a schema compatibility check against the constructed object, not the authority selector itself. This distinction matters because different objects can implement the same interface.

Transport-Level Capability Lifetime

Cap’n Proto applications do not usually model capability lifetime as an application method on every interface. The RPC transport owns capability reference bookkeeping.

The standard Cap’n Proto RPC protocol is stateful per connection. Each side keeps four tables: questions, answers, imports, and exports. Import/export IDs are connection-local, not global object names. When an exported capability is sent over the connection, the export reference count is incremented. When the importing side drops its last local reference, the transport sends Release to decrement the remote export count. Implementations may batch these releases. If the connection is lost, in-flight questions fail, imports become broken, and exports/answers are implicitly released. Persistent capabilities, when implemented, are a separate SturdyRef mechanism and should not be treated as owned pointers.

References:

This distinction matters for capOS:

  • close() is application protocol. A File.close() method can flush dirty state, commit metadata, or tell a server that a session should end.
  • Release / cap drop is transport protocol. It removes one reference from the caller’s local capability namespace and eventually lets the serving side reclaim the object if no references remain.
  • Process exit is bulk transport cleanup. Dropping the process must release all caps in its table, cancel pending calls, and wake peers waiting on those calls.

capOS therefore needs a system transport layer in the userspace runtime (capos-rt / later language runtimes), not just raw SQE helpers. That transport should own typed client handles, local reference counts, promise-pipelined answers, and broken-cap state. When the last local handle is dropped, it should submit a transport-level release operation to the kernel ring.

Ordinary handle release is a transport concern, not an application method. The target design: the generated client drops the last local handle (RAII / GC / finalizer), the runtime transport submits the CAP_OP_RELEASE ring opcode, and the kernel removes the caller’s CapTable slot with mutable access to that table. Encoding release as a regular method call on CapabilityManager was rejected because it would mutate the same table used to dispatch the call; CapabilityManager is therefore management-only (list(), later grant()), not the default release path. CAP_OP_FINISH remains reserved in the same transport opcode namespace for application-level “end of work” signals that the transport must deliver reliably, so the kernel can tell them apart from a truly malformed opcode.

Current status: the kernel dispatches CAP_OP_RELEASE as a local cap-table slot removal and fails closed for stale or non-owned cap IDs. capos-rt bootstrap handles remain explicitly non-owning, while adopted owned handles queue CAP_OP_RELEASE on final drop. Result-cap adoption validates the kernel-supplied interface ID before producing an owned typed handle. CAP_OP_FINISH remains reserved and returns CAP_ERR_UNSUPPORTED_OPCODE. Process exit remains the fallback cleanup path for unreleased local slots.

Access Control: Interfaces, Not Rights Bitmasks

capOS deliberately does not use a rights bitmask (READ/WRITE/EXECUTE) on capability entries, despite this being standard in Zircon and seL4. The reason is that Cap’n Proto typed interfaces already serve as the access control mechanism, and a parallel rights system creates an impedance mismatch.

Why rights bitmasks exist in other systems: Zircon and seL4 use rights because their syscall interfaces are untyped – a handle is an opaque reference to a kernel object, and the kernel needs something to decide which fixed syscalls are allowed. capOS has typed interfaces where the .capnp schema defines exactly what methods exist.

capOS’s approach: the interface IS the permission. To restrict what a caller can do, grant a narrower capability:

  • Fetch (full HTTP) → HttpEndpoint (scoped to one origin)
  • Store (read-write) → Store wrapper that rejects write methods
  • Namespace (full) → Namespace scoped to a prefix

The “restricted” capability is a different CapObject implementation that wraps the original. The kernel doesn’t know or care – it dispatches to whatever CapObject is in the slot. Attenuation is userspace/schema logic, not a kernel mechanism.

When transfer control is needed (Stage 6): meta-rights for the capability itself (can it be transferred? duplicated?) may be added as a small bitmask. These are about the reference, not the referenced object, and don’t overlap with interface-level method access control.

See research.md for the cross-system analysis that led to this decision (§1 Capability Table Design).

Planned Enhancements (from research)

Tracked in ROADMAP.md Stages 5-6:

  • Badge (from seL4) – u64 value per capability entry, delivered to the server on invocation. Implemented for manifest cap refs, IPC transfer, and ProcessSpawner endpoint-client minting so servers can distinguish callers without separate capability objects per client.
  • Epoch (from EROS) – per-object revocation epoch. Incrementing the epoch invalidates all outstanding references. O(1) revoke, O(1) check.

Current Limitations

  • Blocking wait exists, but waits are still process-level. cap_enter(min_complete, timeout_ns) processes pending SQEs and can block the current process until enough CQEs exist or a finite timeout expires. It is not yet a general futex/thread wait primitive; in-process threading and futex-shaped measurements are tracked separately.
  • No persistence. Capabilities exist only at runtime.
  • Capability transfer is implemented for Endpoint CALL/RECV/RETURN. Transfer descriptors on the capability ring let callers and receivers copy or move transferable local caps through IPC messages. See storage-and-naming-proposal.md “IPC and Capability Transfer” for the full design.
  • Transfer ABI (3.6.0 draft). Sideband transfer descriptors are defined in capos-config/src/ring.rs as CapTransferDescriptor:
    • cap_id is the sender-side local capability-table handle.
    • transfer_mode is either CAP_TRANSFER_MODE_COPY or CAP_TRANSFER_MODE_MOVE.
    • xfer_cap_count in CapSqe is the descriptor count.
    • For CALL/RETURN, descriptors are packed at addr + len after the payload bytes and must be aligned to CAP_TRANSFER_DESCRIPTOR_ALIGNMENT.
    • Result-cap insertion semantics are defined by CapCqe: result reports normal payload bytes, while cap_count reports how many CapTransferResult { cap_id, interface_id } records were appended immediately after those payload bytes in result_addr when CAP_CQE_TRANSFER_RESULT_CAPS is set. User space must bound-check result + cap_count * CAP_TRANSFER_RESULT_SIZE against its requested result_len.
    • Transfer-bearing SQEs are fail-closed:
      • unsupported-by-kernel-transfer path: CAP_ERR_TRANSFER_NOT_SUPPORTED (until sideband transfer is enabled),
      • malformed descriptor metadata (invalid mode, reserved bits, non-zero _reserved0, misalignment, overflow): CAP_ERR_INVALID_TRANSFER_DESCRIPTOR,
      • all other reserved-field misuse remains CAP_ERR_INVALID_REQUEST.
  • No revocation propagation. Removing a table entry doesn’t invalidate copies or derived capabilities. Epoch-based revocation is planned.
  • No bulk data path. All data goes through capnp message copy. SharedBuffer / MemoryObject capability needed for file I/O, networking, GPU data plane. See storage-and-naming-proposal.md “Shared Memory for Bulk Data” for the interface design.

Future Directions

  • Capability transfer. Cross-process capability calls already go through the kernel via Endpoint objects with RECV/RETURN SQE opcodes on the existing per-process capability ring (no new syscalls). The remaining transfer work will carry capability references with sideband descriptors and install result caps in the receiver’s local table. See storage-and-naming-proposal.md for how this enables Directory.open() returning File caps, Namespace.sub() returning scoped Namespace caps, etc.
  • Persistence. Serialize capability state to storage using capnp format. Restore capabilities across reboots.
  • Network transparency. Forward capability calls to remote machines using the same capnp wire format. A remote Console capability looks identical to a local one.

Capability Ring

The capability ring is the userspace-to-kernel transport for capability invocation. It avoids one syscall per operation while preserving a typed Cap’n Proto method boundary and explicit completion reporting.

Status: Implemented. The shared-memory ring, cap_enter, CALL/RECV/RETURN/RELEASE/NOP dispatch, structured transport errors, bounded scratch buffers, and Loom ring model are implemented. FINISH, promise pipelining, multishot, link, drain, and SQPOLL remain future work.

Current Behavior

Each non-idle process gets one 4 KiB ring page mapped at RING_VADDR. The page contains a volatile header, a 16-entry submission queue, and a 32-entry completion queue. Userspace writes CapSqe records, advances sq_tail, and uses cap_enter(min_complete, timeout_ns) to make ordinary calls progress.

sequenceDiagram
    participant U as Userspace runtime
    participant R as Ring page
    participant K as Kernel ring dispatcher
    participant C as Capability object
    U->>R: write CapSqe and advance sq_tail
    U->>K: cap_enter(min_complete, timeout_ns)
    K->>R: read sq_head..sq_tail
    K->>K: validate SQE fields and user buffers
    K->>C: call method or endpoint operation
    C-->>K: completion, pending, or error
    K->>R: write CapCqe and advance cq_tail
    K-->>U: return available CQE count
    U->>R: read matching CapCqe

Timer polling also processes each current process’s ring before preemption, but only non-CALL operations and CALL targets that explicitly allow interrupt dispatch may run there. Ordinary CALLs wait for cap_enter.

Why ordinary CALL waits for cap_enter: Submitting a CALL SQE is only a shared-memory write. The kernel still needs a safe execution point to drain the ring and run capability code. Timer polling runs in interrupt context, so it must not execute arbitrary capability methods that may allocate, block on locks, mutate page tables, spawn processes, parse Cap’n Proto messages, or perform IPC side effects. cap_enter is the normal process-context drain point: it processes pending SQEs, posts CQEs, and then either returns the available completion count or blocks until enough completions arrive. The design keeps SQE publication syscall-free and batchable, keeps the syscall ABI limited to exit and cap_enter, and avoids turning the timer interrupt into a general capability executor. A future SQPOLL-style path can remove the explicit syscall from the hot path only by running dispatch in a worker context, not from arbitrary timer interrupt execution.

Design

CapSqe is a fixed 64-byte ABI record. CAP_OP_CALL names a local cap-table slot and method ID plus parameter/result buffers. CAP_OP_RECV and CAP_OP_RETURN implement endpoint IPC. CAP_OP_RELEASE removes a local cap-table slot through the transport. CAP_OP_NOP measures the fixed ring path. CAP_OP_FINISH is ABI-reserved and currently returns CAP_ERR_UNSUPPORTED_OPCODE.

The kernel copies user params into preallocated per-process scratch, dispatches capability methods, serializes results directly into caller-provided result buffers, and posts CapCqe. A successful method returns non-negative bytes written. Transport failures are negative CAP_ERR_* codes. Application exceptions are serialized CapException payloads with CAP_ERR_APPLICATION_EXCEPTION.

Transfer-bearing CALL and RETURN SQEs pack CapTransferDescriptor records after the params/result payload. Successful result-cap transfers append CapTransferResult records after normal result bytes.

Future behavior should use the reserved SQE fields for system transport features, not ad hoc per-interface extensions.

Invariants

  • SQ and CQ sizes are powers of two and fixed by the ABI.
  • Unknown opcodes fail closed; FINISH is reserved, not silently accepted.
  • Reserved fields must be zero for currently implemented opcodes.
  • cap_enter rejects min_complete > CQ_ENTRIES.
  • User buffers must be in user address space with page permissions matching read/write intent before copy.
  • Timer dispatch must not run capabilities that allocate, block on locks, or mutate page tables unless the cap explicitly opts in.
  • Per-dispatch SQE processing is bounded by SQ_ENTRIES.
  • Transfer descriptors must be aligned, valid, and bounded by MAX_TRANSFER_DESCRIPTORS.

Code Map

  • capos-config/src/ring.rs - shared ring ABI, opcodes, errors, SQE/CQE structs, endpoint message headers, transfer records.
  • kernel/src/cap/ring.rs - kernel dispatcher, SQE validation, CQE posting, cap calls, endpoint CALL/RECV/RETURN, release, transfer framing.
  • kernel/src/arch/x86_64/syscall.rs - cap_enter syscall.
  • kernel/src/sched.rs - timer polling, cap-enter blocking, direct IPC wake.
  • kernel/src/process.rs - ring page allocation and mapping.
  • capos-rt/src/ring.rs - runtime ring client, pending calls, transfer packing, result-cap parsing.
  • capos-rt/src/entry.rs - single-owner runtime ring client token and release queue flushing.
  • capos-config/tests/ring_loom.rs - bounded producer/consumer model.

Validation

  • cargo test-ring-loom validates SQ/CQ producer-consumer behavior, capacity, FIFO, CQ overflow/drop behavior, and corrupted SQ recovery.
  • make run exercises Console CALLs, reserved opcode rejection, ring corruption recovery, NOP, fairness, transfers, and endpoint IPC.
  • make run-measure exercises measurement-only NOP and NullCap baselines.
  • cargo test-config covers shared ring layout and helper invariants.
  • make capos-rt-check checks userspace runtime ring code under the bare-metal target.

Open Work

  • Implement CAP_OP_FINISH as part of the system Cap’n Proto transport.
  • Add promise pipelining using pipeline_dep and pipeline_field.
  • Define LINK, DRAIN, and MULTISHOT semantics before accepting those flags.
  • Add SQPOLL after SMP gives the kernel a spare execution context.

IPC and Endpoints

Endpoints let one process serve capability calls to another process without adding a separate IPC syscall surface. The same ring transport carries ordinary kernel capability calls and cross-process endpoint calls.

Status: Partially implemented. Ring-native endpoint CALL/RECV/RETURN, client endpoint attenuation, badges, copy and move capability transfer, direct IPC handoff, and cleanup for many exit paths are implemented. Notification objects, promise pipelining, shared buffers, revocation, and transfer-path cleanup refactoring remain open.

Current Behavior

An Endpoint is a kernel capability object with queues for pending client calls, pending server receives, and in-flight calls awaiting RETURN. A service that owns the raw endpoint can receive and return. Importers receive a ClientEndpoint facet that can CALL but cannot RECV or RETURN.

sequenceDiagram
    participant Client
    participant ClientRing as Client ring
    participant Endpoint
    participant ServerRing as Server ring
    participant Server
    Server->>ServerRing: submit RECV on raw endpoint
    Client->>ClientRing: submit CALL on client facet
    ClientRing->>Endpoint: deliver params and caller result target
    Endpoint->>ServerRing: complete RECV with EndpointMessageHeader and params
    ServerRing-->>Server: cap_enter returns completion
    Server->>ServerRing: submit RETURN with call_id and result
    ServerRing->>Endpoint: take in-flight target
    Endpoint->>ClientRing: post caller CQE with result and badge
    ClientRing-->>Client: wait returns matching completion

If a CALL arrives before a RECV, the endpoint queues bounded params. If a RECV arrives before a CALL, the endpoint queues the receive request. Delivered calls move into the in-flight queue until the server returns or cleanup cancels them.

Design

Endpoint IPC is capability-oriented. The manifest can export a raw endpoint from one service; importers get a narrowed client facet. This keeps server-only authority out of clients without introducing rights bitmasks.

CALL and RETURN may carry sideband transfer descriptors. Copy transfers insert a new cap into the receiver while preserving the sender. Move transfers reserve the sender slot, insert the destination, then remove the source on commit. RETURN-side transfers append result-cap records after the normal result payload.

Badges are stored on cap-table hold edges and delivered to servers with endpoint invocation metadata, so one endpoint can distinguish callers without one object per caller.

Future IPC should add notification objects for lightweight signaling and promise pipelining for Cap’n Proto-style dependent calls.

Invariants

  • Only raw endpoint holders may RECV or RETURN.
  • Imported endpoint caps are ClientEndpoint facets and must reject RECV and RETURN from userspace.
  • Endpoint queues are bounded by call count, receive count, in-flight count, per-call params, and total queued params.
  • Each in-flight call has a kernel-assigned non-zero call_id.
  • CALL delivery copies params into kernel-owned queued storage before the caller can resume.
  • Move transfer commit must not leave both source and destination live.
  • Transfer rollback must preserve source authority if destination insertion or result delivery fails.
  • Process exit must cancel queued state involving that pid and wake affected peers when possible.

Code Map

  • kernel/src/cap/endpoint.rs - endpoint queues, client facet, call IDs, cancellation by pid.
  • kernel/src/cap/ring.rs - endpoint CALL/RECV/RETURN dispatch, result copying, deferred cancellation CQEs.
  • kernel/src/cap/transfer.rs - transfer descriptor loading and transaction preparation.
  • capos-lib/src/cap_table.rs - cap-table transfer primitives and rollback.
  • kernel/src/cap/mod.rs - manifest export resolution and client-facet construction.
  • capos-config/src/ring.rs - EndpointMessageHeader, transfer descriptors, transfer result records, endpoint opcodes.
  • demos/capos-demo-support/src/lib.rs - endpoint, IPC, transfer, and hostile IPC smoke routines.
  • demos/endpoint-roundtrip, demos/ipc-server, demos/ipc-client - QEMU smoke binaries.

Validation

  • make run validates same-process endpoint RECV/RETURN, cross-process IPC, endpoint exit cleanup, badged calls, transfer success/failure paths, and clean halt.
  • make run-spawn validates init-spawned endpoint-roundtrip, server, and client processes.
  • cargo test-lib covers cap-table transfer preflight, provisional insertion, commit, rollback, stale generation, and slot exhaustion cases.
  • cargo test-ring-loom covers ring queue behavior that endpoint IPC depends on for completion delivery.

Open Work

  • Add notification objects for signal-style events.
  • Add Cap’n Proto promise pipelining after endpoint routing can resolve dependent answers.
  • Add shared-buffer or memory-object capabilities for bulk data transfer.
  • Add epoch-based revocation if broad authority invalidation becomes necessary.

Userspace Runtime

The userspace runtime owns the repeated mechanics that every service needs: bootstrap validation, heap initialization, typed capability lookup, ring submission, completion matching, application exception decoding, and handle lifetime.

Status: Partially implemented. capos-rt provides a no_std entry ABI, fixed heap, typed CapSet lookup, a single-owner ring client, typed Console and ProcessSpawner clients, result-cap adoption, and release-on-drop for owned handles. Full generated bindings, promise pipelining, language runtimes, and broad transport semantics remain future work.

Current Behavior

Runtime-owned _start receives (ring_addr, pid, capset_addr), initializes a fixed heap, validates the ring address, reads the read-only CapSet page, installs an emergency Console panic path when available, calls capos_rt_main(runtime), and exits with the returned code.

The Runtime lends out at most one RuntimeRingClient at a time. The client wraps the raw ring page, keeps request buffers alive until completions are matched, handles out-of-order completions, packs copy-transfer descriptors, and parses result-cap records. Owned runtime handles queue CAP_OP_RELEASE when the last local reference is dropped; the release queue flushes when a ring client is borrowed or dropped.

Design

The runtime separates non-owning bootstrap references from owned local handles. CapSet entries produce typed Capability<T> values only when the interface ID matches the requested type. Result-cap adoption performs the same interface check before producing OwnedCapability<T>.

Typed clients are thin wrappers over the ring client. They encode Cap’n Proto params, submit CALL SQEs, wait for a matching CQE, decode transport errors, and decode kernel-produced CapException payloads into client errors.

Future generated clients should preserve this split: transport lifetime and completion matching belong in the runtime, while interface-specific encoding belongs in generated or handwritten client wrappers.

Invariants

  • ring_addr must equal RING_VADDR; runtime bootstrap rejects any other address.
  • The CapSet header magic/version must validate before lookup.
  • CapSet handles are non-owning unless explicitly adopted.
  • Only one runtime ring client may be live at a time for a process.
  • Request params and result buffers must outlive their matching CQE.
  • A result cap can be consumed only once and only with the expected interface ID.
  • Dropping the final owned handle queues exactly one local CAP_OP_RELEASE.
  • Release flushing treats stale or already-removed caps as non-fatal cleanup.

Code Map

  • capos-rt/src/entry.rs - _start, Runtime, bootstrap validation, single-owner ring token, release queue flushing.
  • capos-rt/src/alloc.rs - fixed userspace heap initialization.
  • capos-rt/src/capset.rs - typed CapSet lookup wrappers.
  • capos-rt/src/ring.rs - ring client, pending calls, completion matching, copy-transfer packing, result-cap parsing.
  • capos-rt/src/client.rs - Console, ProcessSpawner, ProcessHandle clients and exception decoding.
  • capos-rt/src/lib.rs - typed capability marker types and owned handle reference counting.
  • capos-rt/src/panic.rs - emergency Console output path.
  • init/src/main.rs and capos-rt/src/bin/smoke.rs - current runtime users.

Validation

  • make capos-rt-check builds the runtime smoke binary with userspace relocation constraints.
  • make run validates runtime entry, typed Console calls, exception decoding, owned handle release, result-cap parsing through IPC, and clean process exit.
  • make run-spawn validates ProcessSpawnerClient, ProcessHandleClient, result-cap adoption, and release behavior under init spawning.
  • cd capos-rt && cargo test --lib --target x86_64-unknown-linux-gnu covers host-testable runtime invariants when run explicitly.

Open Work

  • Replace duplicated demo support ring helpers with capos-rt where practical.
  • Add generated client bindings after the schema surface stabilizes.
  • Implement promise/answer transport semantics beyond current placeholders.
  • Define release behavior for queued handles when a process exits before the release queue flushes.

Manifest and Service Startup

The manifest is the boot-time authority graph. It names binaries, services, initial capabilities, exported service caps, restart policy metadata, badges, and system config.

Status: Partially implemented. Manifest parsing, Cap’n Proto encoding, CUE conversion, binary embedding, kernel-side service startup, service exports, endpoint client facets, badges, BootPackage manifest exposure to init, init-side graph validation, and generic init-side spawning through ProcessSpawner are implemented for system-spawn.cue. Retiring the default kernel-side service graph is the remaining selected milestone work.

Current Behavior

tools/mkmanifest requires the repo-pinned CUE compiler, evaluates system.cue, embeds declared binaries, validates binary references and authority graph structure, serializes SystemManifest, and places manifest.bin into the ISO. The kernel receives that file as the single Limine module.

flowchart TD
    Cue[system.cue or system-spawn.cue] --> Mkmanifest[tools/mkmanifest]
    Binaries[release userspace binaries] --> Mkmanifest
    Mkmanifest --> Manifest[manifest.bin SystemManifest]
    Manifest --> Limine[Limine boot module]
    Limine --> Kernel[kernel parse and validate]
    Kernel --> Tables[CapTables and CapSet pages]
    Tables --> BootServices[default: kernel-enqueued services]
    Tables --> Init[spawn manifest: init gets ProcessSpawner and BootPackage]
    Init --> BootPackage[BootPackage.readManifest chunks]
    BootPackage --> Plan[capos-config ManifestBootstrapPlan validation]
    Init --> Spawner[ProcessSpawner.spawn]
    Spawner --> Children[init-spawned child processes]

The default manifest currently starts all services from the kernel. The spawn manifest starts only init, grants it ProcessSpawner plus a read-only BootPackage cap, and lets init read bounded manifest chunks into a metadata-only capos-config::ManifestBootstrapPlan. Init validates binary references, authority graph structure, exports, cap sources, and interface IDs before spawning the endpoint, IPC, VirtualMemory, and FrameAllocator cleanup demos. Spawn grants carry explicit requested badges: raw parent-capability grants must preserve the source hold badge, endpoint-client grants may mint the requested badge only from an endpoint-owner source, and kernel-source FrameAllocator/VirtualMemory grants mint fresh child-local caps without badges.

Design

Manifest validation has three layers:

  • Binary references: binary names are unique, service binary references resolve, and referenced binary payloads are non-empty.
  • Authority graph: service names, cap names, export names, and service-sourced references are unique and resolvable; re-exporting service-sourced caps is rejected.
  • Bootstrap cap sources: expected interface IDs match kernel sources or declared service exports.

Kernel startup resolves caps in two passes. Pass 1 creates kernel-sourced caps and records declared exports. Pass 2 resolves service-sourced imports against the export registry, attenuating endpoint exports to client-only facets for importers. Declaration order is preserved because CapIds are assigned by insertion order and CapSet entries mirror that order.

Future behavior should make the kernel parse only enough boot information to launch init with manifest/boot-package authority. init should use BootPackage.readManifest to validate the service graph, call ProcessSpawner, grant caps, and wait for services where policy requires it.

Invariants

  • The manifest is schema data, not shell script or ambient namespace.
  • Omitted cap sources fail closed.
  • Cap names within one service are unique and are the names userspace sees in CapSet.
  • Service exports must name caps declared by the same service.
  • Service-sourced imports must reference a declared service export.
  • Endpoint exports to importers must be attenuated to client-only facets.
  • expectedInterfaceId checks compatibility; it is not the authority selector.
  • Badges travel with cap-table hold edges and endpoint invocation metadata. Spawn-time client endpoint minting carries the requested child badge instead of copying the parent’s owner-hold badge.

Code Map

  • schema/capos.capnp - SystemManifest, ServiceEntry, CapRef, KernelCapSource, ServiceCapSource, RestartPolicy.
  • capos-config/src/manifest.rs - manifest structs, CUE conversion, capnp encode/decode, metadata-only ManifestBootstrapPlan, schema-version guardrails, validation.
  • tools/mkmanifest/src/lib.rs and tools/mkmanifest/src/main.rs - host-side manifest build pipeline and binary embedding.
  • kernel/src/main.rs - kernel manifest module parse and validation.
  • kernel/src/cap/mod.rs - service cap creation, exports, endpoint attenuation, CapSet entry construction.
  • kernel/src/cap/boot_package.rs - read-only manifest-size and chunked manifest-read capability.
  • kernel/src/cap/process_spawner.rs - init-callable spawn path for packaged boot binaries.
  • capos-rt/src/client.rs - typed BootPackage and ProcessSpawner clients.
  • init/src/main.rs - current spawn manifest executor smoke.
  • system.cue and system-spawn.cue - current manifests.

Validation

  • cargo test-config validates manifest decode, CUE conversion, graph checks, source checks, and binary reference checks.
  • cargo test-mkmanifest validates host-side manifest conversion, embedded binary handling, and pinned CUE path/version checks.
  • make run validates default kernel-side manifest execution.
  • make run-spawn validates system-spawn.cue, init-side BootPackage manifest reads, init-side manifest graph validation, init-side spawning, hostile spawn failures, child grants, process waits, and cap-table exhaustion checks.
  • make generated-code-check validates schema-generated Rust stays in sync.

Open Work

  • Retire the default kernel-side service graph and make the normal boot path launch only init before service startup.

Memory Management

Memory management gives the kernel controlled ownership of physical frames, separates user processes, enforces page permissions, and exposes memory authority only through explicit capabilities.

Status: Partially implemented. Physical frame allocation, heap initialization, kernel page-table remapping, per-process address spaces, ELF/user stack/TLS mappings, user-buffer validation, FrameAllocator, and VirtualMemory caps are implemented. Broader quota unification, SMP-safe validation, shared buffers, huge pages, and hardware I/O memory isolation remain open.

Current Behavior

The frame allocator builds a bitmap from the Limine memory map, marks all non-usable frames as used, reserves frame zero, and reserves its own bitmap frames. The heap is initialized separately for kernel allocation.

Paging initialization builds a new kernel PML4, remaps kernel sections with section-specific permissions, copies upper-half mappings with NX applied and user access stripped, switches CR3, then enables page-global support. SMEP/SMAP are enabled after those mappings are active.

Each user AddressSpace owns its lower-half page tables and clones the kernel’s upper-half mappings. Dropping an address space walks the user half and frees mapped frames and page-table frames. VirtualMemory lets a process map, unmap, and protect anonymous user pages, with a 256-page per-address-space tracking limit.

Design

The kernel keeps physical allocation host-testable by placing bitmap logic in capos-lib and wrapping it with kernel HHDM access in kernel/src/mem/frame.rs. Page-table manipulation stays in the kernel because it is architecture-specific.

ELF loading and VirtualMemory both use page-table flags to preserve W^X: non-executable data gets NX, writable mappings are explicit, and userspace pages must be USER_ACCESSIBLE. The CapSet and ring bootstrap pages occupy reserved virtual pages; VirtualMemory rejects ranges that overlap either one.

User-buffer validation checks that user pointers stay below the user address limit and that page-table permissions match the requested access. SMAP UserAccessGuard brackets kernel copy operations into or out of user pages.

Future memory work should unify quotas across frame grants, VM mappings, shared buffers, and DMA resources rather than adding one-off counters per cap.

Invariants

  • Frame addresses are 4 KiB aligned.
  • The frame bitmap’s own frames are never returned as free frames.
  • Upper-half kernel mappings are not user-accessible.
  • Kernel text is RX, rodata is read-only NX, and data/bss are RW NX.
  • User address spaces own only lower-half page-table frames.
  • CapSet is read-only/no-execute; ring is writable/no-execute.
  • VirtualMemory cannot map, unmap, or protect the ring or CapSet pages.
  • VirtualMemory protect/unmap only succeeds for pages tracked as owned by the cap’s address space.

Code Map

  • capos-lib/src/frame_bitmap.rs - host-testable physical frame bitmap core.
  • capos-lib/src/frame_ledger.rs - outstanding grant ledger for FrameAllocator cleanup.
  • kernel/src/mem/frame.rs - Limine memory-map integration and global frame allocator wrapper.
  • kernel/src/mem/heap.rs - kernel heap setup.
  • kernel/src/mem/paging.rs - kernel remap, AddressSpace, page mapping, VM-cap page tracking, user copy helpers.
  • kernel/src/mem/validate.rs - user-buffer validation.
  • kernel/src/cap/frame_alloc.rs - FrameAllocator capability and cleanup.
  • kernel/src/cap/virtual_memory.rs - VirtualMemory capability.
  • kernel/src/spawn.rs - ELF, stack, and TLS user mappings.
  • kernel/src/arch/x86_64/smap.rs - SMEP/SMAP setup and user access guard.

Validation

  • cargo test-lib covers frame bitmap, frame ledger, ELF parser, and cap-table pure logic.
  • cargo miri-lib runs host-testable capos-lib tests under Miri when installed.
  • make kani-lib proves bounded frame bitmap, cap ID, and ELF parser invariants when Kani is installed.
  • make run validates ELF mapping, process teardown, FrameAllocator cleanup, TLS, VirtualMemory map/protect/unmap/quota/release smoke, and clean halt.
  • make run-spawn validates ELF load failure rollback and frame exhaustion handling through ProcessSpawner.

Open Work

  • Resolve quota fragmentation across FrameAllocator, VirtualMemory, and future shared memory.
  • Harden user-buffer validation for SMP-era page-table races.
  • Add shared-buffer or memory-object capabilities for zero-copy data paths.
  • Add DMA isolation and device memory capability boundaries before userspace drivers.
  • Add huge-page handling only with explicit ownership and teardown rules.

Scheduling

Scheduling decides which process runs, preserves CPU state across preemption and blocking, and integrates capability-ring progress with process execution.

Status: Partially implemented. Single-CPU preemptive round-robin scheduling, PIT timer interrupts, full context switches, cap_enter blocking waits, user-mode idle, process exit, and direct IPC handoff are implemented. SMP, per-CPU data, kernel-mode idle, priority, and restart policy are future work.

Current Behavior

The scheduler stores processes in a BTreeMap<Pid, Process> and ready pids in a VecDeque. PIT fires at roughly 100 Hz through IRQ0. On each timer tick, the kernel wakes timed-out or satisfied cap_enter waiters, processes the current process’s ring in timer mode, saves the current context, rotates ready processes, switches CR3, updates TSS.RSP0 and the syscall kernel stack, restores FS base, and returns to the next user context.

cap_enter(min_complete, timeout_ns) processes pending SQEs immediately. If the requested completion count is not available and the timeout permits blocking, the current process enters Blocked(CapEnter { ... }) and the syscall entry path switches to another process.

When endpoint delivery satisfies a blocked server RECV, the scheduler can set a direct IPC target. The next scheduling decision runs that server before ordinary round-robin work when it is ready.

Design

The implementation keeps ring dispatch outside the global scheduler lock. Timer dispatch extracts ring/cap/scratch handles, releases the scheduler lock, processes bounded SQEs, then reacquires the scheduler lock to choose the next process. This prevents Cap’n Proto decode, serial output, and capability method bodies from running under the global scheduler lock.

The idle task is currently a user-mode process with one code page and one stack page. It exists because the timer return path assumes interrupts entered from CPL3. A future kernel-mode idle loop requires distinct IRQ entry/restore handling for CPL0 and CPL3 frames.

Exit switches to the kernel PML4 before tearing down the exiting address space, releases capability authority, completes process waiters, and defers final drop until the scheduler is running on another kernel stack.

Invariants

  • The idle process must never block in cap_enter or exit.
  • Ring dispatch must not hold the scheduler lock.
  • Timer dispatch runs with the current process CR3, so user buffers are accessible only for that process.
  • Blocked cap_enter waiters wake when enough CQEs are available or their finite timeout expires.
  • Direct IPC handoff is a scheduling preference, not a bypass of process state checks.
  • The scheduler must update TSS.RSP0 and syscall kernel RSP on each switch.
  • FS base is saved and restored across context switches for TLS.
  • The final drop of an exiting process must not occur on its own kernel stack.

Code Map

  • kernel/src/sched.rs - process table, run queue, blocking, wakeups, timer scheduling, exit, direct IPC target.
  • kernel/src/arch/x86_64/context.rs - CPU context layout, timer entry/restore, tick counter.
  • kernel/src/arch/x86_64/idt.rs - timer interrupt handler wiring.
  • kernel/src/arch/x86_64/pic.rs and kernel/src/arch/x86_64/pit.rs - PIC remap and PIT setup.
  • kernel/src/arch/x86_64/gdt.rs - TSS and kernel stack updates.
  • kernel/src/arch/x86_64/syscall.rs - blocking syscall transition for cap_enter.
  • kernel/src/arch/x86_64/tls.rs - FS base save/restore.
  • kernel/src/process.rs - process state, kernel stacks, idle process.

Validation

  • make run validates timer preemption, ring fairness, direct IPC handoff, blocked cap_enter wakeups, process exit, and clean halt.
  • make run-spawn validates process wait blocking and child exit completion through ProcessHandle.wait.
  • cargo build --features qemu verifies QEMU-only scheduler and halt paths.
  • QEMU smoke output for IPC includes direct handoff diagnostics when the server is woken from a blocked RECV.

Open Work

  • Replace the user-mode idle process with a kernel/per-CPU idle context after interrupt restore paths support CPL0 timer entries.
  • Implement SMP with per-CPU scheduler state, per-CPU syscall stacks, and TLB shootdown.
  • Add priority or policy scheduling only after the current authority and IPC semantics remain stable.
  • Add service restart policy outside the static boot graph.

Trust Boundaries

This page gives reviewers one place to find the hostile-input boundaries, trusted inputs, and current isolation assumptions that matter for capOS security review.

Current Boundaries

BoundaryTrust ruleCurrent enforcementValidation and review source
Ring 0 to Ring 3The kernel trusts no userspace register, pointer, SQE, CapSet, or result buffer field.kernel/src/arch/x86_64/syscall.rs, kernel/src/mem/validate.rs, and kernel/src/cap/ring.rs validate syscall arguments, user buffers, opcodes, and capability table lookups before privileged use.../panic-surface-inventory.md, REVIEW.md
Capability table to kernel objectA process acts only through a live table-local CapId with matching generation and interface.capos-lib/src/cap_table.rs owns generation-tagged slots; kernel capability dispatch goes through CapObject::call.cargo test-lib, QEMU ring and IPC smokes recorded in REVIEW_FINDINGS.md
Capability ring shared memoryUserspace owns SQ writes, but the kernel owns validation, dispatch, completion, and failure semantics.SQ/CQ headers and entries live in capos-config/src/ring.rs; kernel dispatch bounds indexes, buffer ranges, opcodes, transfer descriptors, and CQ posting.cargo test-ring-loom, QEMU ring corruption, reserved opcode, fairness, IPC, and transfer smokes
Endpoint IPC and transferIPC cannot create or destroy authority except through explicit copy, move, release, or spawn transactions.kernel/src/cap/endpoint.rs, kernel/src/cap/transfer.rs, and capos-lib/src/cap_table.rs implement queued calls, RECV/RETURN, copy/move transfer, badge propagation, and rollback.../authority-accounting-transfer-design.md, open transfer findings in REVIEW_FINDINGS.md
Manifest and boot packageBoot manifest bytes and embedded binaries are untrusted inputs until parsed and validated. Only holders of the read-only BootPackage cap can request chunked manifest bytes; ordinary services receive no default boot-package authority.tools/mkmanifest, capos-config/src/manifest.rs, kernel/src/cap/boot_package.rs, ELF parsing in capos-lib/src/elf.rs, and kernel load paths validate graph references, paths, CapSet layout, interface IDs, manifest-read bounds, ELF bounds, and load ranges.cargo test-config, cargo test-mkmanifest, cargo test-lib, manifest and ELF fuzz targets, make run-spawn
Process spawn inputsParent-supplied spawn params, ELF bytes, grants, badges, and result-cap insertion must fail closed.ProcessSpawner currently validates ELF load, grants, explicit badge attenuation, frame exhaustion, and parent cap-slot exhaustion. Manifest schema-version guardrails reject unknown manifest vintages before graph validation.Spawn QEMU smoke evidence and open findings in REVIEW_FINDINGS.md
Host tools and filesystemManifest/config input must not escape intended source directories or invoke unconstrained host commands.tools/mkmanifest validates references and path containment, rejects unpinned CUE compilers, and Makefile targets route CUE and Cap’n Proto through pinned tool paths.../trusted-build-inputs.md, make generated-code-check, make dependency-policy-check
Generated code and schemaSchema, generated bindings, and no_std patches are trusted build inputs.schema/capos.capnp, build scripts, tools/generated/capos_capnp.rs, and tools/check-generated-capnp.sh make generated-code drift review-visible.../trusted-build-inputs.md, make generated-code-check
Device DMA and MMIOCurrent userspace receives no raw DMA buffer, device physical address, virtqueue pointer, or BAR mapping.The QEMU virtio-net path is allowed only through kernel-owned bounce buffers until typed DMAPool, DeviceMmio, and Interrupt capabilities exist.../dma-isolation-design.md, make run-net
Panic and emergency pathsHostile input should produce controlled errors, not panic, allocate unexpectedly, or expose stale state.Ring dispatch is mostly controlled-error; remaining panic surfaces are classified by reachability and tracked as hardening work.../panic-surface-inventory.md, REVIEW.md

Security Invariants

  • All authority is represented by capability-table hold edges; no syscall or host tool path should bypass the capability graph.
  • The interface is the permission: method authority is expressed by the typed Cap’n Proto interface or by a narrower wrapper capability, not by ambient process identity.
  • Kernel operations at hostile boundaries validate structure, bounds, ownership, generation, interface ID, and resource availability before mutating privileged state.
  • Failed transfer, spawn, manifest, and DMA setup paths must leave ledgers, cap tables, frame ownership, and in-flight call state unchanged or explicitly rolled back.
  • Trusted build inputs must be pinned or drift-review-visible before their output becomes part of the boot image or generated source baseline.

Open Work

  • Unify fragmented resource ledgers into the authority-accounting model so reviewers can audit quotas without following parallel counters.
  • Harden open panic-surface entries that become more exposed as spawn, lifecycle, SMP, or userspace drivers expand hostile input reachability.
  • Keep DMA in kernel-owned bounce-buffer mode until the DMAPool, DeviceMmio, and Interrupt transition gates have code and QEMU proof.

Verification Workflow

This page maps capOS claims to the commands, QEMU smokes, fuzz targets, proof tools, and review documents that currently support them.

Local Command Set

Use the repo aliases and Makefile targets instead of bare host commands. The workspace default Cargo target is x86_64-unknown-none, so host tests rely on aliases that set the host target explicitly.

ScopeCommandWhat it checks
Formattingmake fmt-checkRust formatting across kernel, shared crates, standalone userspace crates, and demos.
Config and manifest logiccargo test-configCap’n Proto manifest encode/decode, CUE value handling, CapSet layout, and config validation.
Ring concurrency modelcargo test-ring-loomBounded SQ/CQ producer-consumer invariants and corrupted-SQ recovery behavior.
Shared library logiccargo test-libELF parser, frame bitmap, frame ledger, capability table, and property-test coverage.
Manifest toolcargo test-mkmanifestHost-side manifest conversion and validation behavior.
Userspace runtimemake capos-rt-checkcapos-rt build path, entry ABI, typed clients, ring helpers, and no_std constraints.
Kernel buildcargo build --features qemuKernel build with the QEMU exit feature enabled.
Generated bindingsmake generated-code-checkCap’n Proto compiler path/version, generated output equality, no_std patch anchors, and checked-in baseline drift.
Dependency policymake dependency-policy-checkcargo-deny and cargo-audit policy across root and standalone lockfiles.
Full image buildmakeKernel, userspace demos, runtime smoke binaries, manifest, Limine artifacts, and ISO packaging.
Default QEMU smokemake runEnd-to-end boot, userspace process output, capability ring, IPC, transfer, VirtualMemory, TLS, cleanup, and final halt paths included in the default manifest.
Spawn QEMU smokemake run-spawnInit-owned spawn flow, ProcessSpawner hostile cases, child grants, waits, and cleanup.
Networking smokemake run-netQEMU virtio-net attachment and kernel PCI/device-discovery path.
Kani proofsmake kani-libBounded proofs for selected capos-lib invariants when cargo-kani is installed.

Do not claim full verification unless the relevant command actually ran in the current change. For doc-only changes, use an appropriately narrower check such as mdbook build.

Review Workflow

  1. Identify the changed trust boundary or state that the change is docs-only.
  2. Read REVIEW.md for the applicable security, unsafe, memory, performance, capability, and emergency-path checklist.
  3. Read REVIEW_FINDINGS.md before judging correctness so known open findings are not treated as solved behavior.
  4. For system-design work, list the concrete design and research files read; reviewers should reject vague grounding such as “docs” or “research”.
  5. Run the smallest command set that exercises the changed behavior, then add QEMU proof for user-visible kernel or runtime behavior.
  6. Record unresolved non-critical findings in REVIEW_FINDINGS.md with concrete remediation context before treating the task as reviewed.

Evidence by Claim

Claim typeRequired evidence
Parser or manifest validationHost tests for valid and malformed input; fuzz target when arbitrary bytes can reach the parser.
Kernel/user pointer safetyQEMU hostile-pointer smoke plus code review of address, length, permissions, and validation-to-use windows.
Ring or IPC transport behaviorHost model/property tests where possible, plus QEMU process output proving success and failure paths.
Capability transfer or releaseRollback tests for copy/move/release failure, cap-slot exhaustion, stale caps, and process-exit cleanup.
Resource accountingTests that prove quota rejection, matched release on success and failure, and process-exit cleanup.
Generated code or schema changesmake generated-code-check and a checked-in baseline diff generated by the pinned compiler.
Dependency or toolchain changesDependency-class review plus make dependency-policy-check; update ../trusted-build-inputs.md when trust assumptions change.
Device or DMA workmake run-net or a targeted QEMU smoke; no userspace-driver transition without the gates in ../dma-isolation-design.md.
Panic-surface hardeningUpdated ../panic-surface-inventory.md when reachability or classification changes.

Fuzzing and Proof Tracks

The current fuzz corpus lives under fuzz/ and covers manifest Cap’n Proto input, exported JSON conversion for mkmanifest, and arbitrary ELF parser input. Run fuzzers when a change alters those parsers, schema shape, or validation rules.

Kani coverage is intentionally narrow and lives in capos-lib, where pure logic can be bounded without hardware state. Add or refresh Kani harnesses for ledger, cap-table, bitmap, and parser invariants when those invariants become part of a security claim.

Loom coverage belongs in shared ring logic. Extend cargo test-ring-loom when SQ/CQ ownership, ordering, corruption recovery, or wake semantics change.

Documentation Sources

Panic-Surface Inventory

Scope: panic!, assert!, debug_assert!, .unwrap(), .expect(), todo!, and unreachable! surfaces relevant to boot manifest loading, ELF loading, SQE handling, params/result buffers, IPC, and future spawn inputs.

Classification terms:

  • trusted-internal: depends on kernel/shared-code invariants, static ABI layout, or host build/test code; not directly controlled by a service.
  • boot-fatal: reached during boot/package setup before mutually untrusted services run. Bad platform/package state can halt the system.
  • untrusted-input reachable: reachable from userspace-controlled SQEs, Cap’n Proto params/result buffers, IPC state, manifest/package data, or future spawn-controlled service/binary data.

Summary

No current panic!/assert!/unwrap()/expect() site found in the kernel ring dispatch path directly consumes raw SQE fields or user params/result-buffer pointers. Those paths mostly return CQE errors through kernel/src/cap/ring.rs.

The remaining relevant surfaces are boot-fatal setup assumptions, scheduler internal invariants that would become more exposed once untrusted spawn/lifecycle inputs can create or destroy processes dynamically, one IPC queue invariant, and a manifest validation .expect() guarded by a prior graph-validation call.

Manifest And Future Spawn Inputs

LocationSurfaceReachabilityClassificationNotes
kernel/src/main.rs:308 run_initMODULES.response().expect("no modules from bootloader")Boot package/module tableboot-fatalMissing Limine modules abort before manifest validation.
capos-config/src/manifest.rs:328 validate_bootstrap_cap_sources.expect("graph validation checked service exists")Manifest service-source caps after validate_manifest_graph()untrusted-input reachable, guardedThe call is safe only when callers preserve the current validation order in kernel/src/main.rs:346-351. Future spawn/package validation must not call this independently on unchecked manifests.
kernel/src/main.rs:381 run_initelf_cache.get(...).ok_or_else(...)Manifest service binary referenceuntrusted-input reachable, controlled errorNot a panic surface. Included because it is the future spawn shape to preserve: unknown or unparsed binaries return an error.
kernel/src/main.rs:405 run_initProcess::new(...).map_err(...)Manifest-spawned process creationuntrusted-input reachable, controlled errorCurrent boot path converts allocation/mapping failures into boot errors. Future ProcessSpawner should keep this shape instead of adding unwraps.
kernel/src/cap/mod.rs:278 create_all_service_capsunreachable!("kernel source resolved in pass 1")Manifest cap source resolutiontrusted-internalDepends on the two-pass enum construction in the same function. Not directly controlled after pattern matching on CapSource::Kernel, but future dynamic grants should avoid relying on this internal sentinel.

ELF Inputs

LocationSurfaceReachabilityClassificationNotes
kernel/src/main.rs:202 load_elfdebug_assert!(stack_top % 16 == 0)ELF load pathtrusted-internalConstant stack layout invariant, not ELF-controlled.
kernel/src/main.rs:303 align_updebug_assert!(align.is_power_of_two())TLS mapping from parsed ELFtrusted-internalelf::parse rejects non-power-of-two TLS alignment; load_tls also caps the size before calling align_up.
capos-lib/src/elf.rs parserno runtime panic surfaces outside tests/KaniBoot manifest ELF bytes; future spawn ELF bytesuntrusted-input reachable, controlled errorParser uses checked offsets/ranges and returns Err(&'static str). Test-only assertions/unwraps are excluded from runtime classification.
kernel/src/main.rs:167 load_elfslice init_data[src_offset..]Parsed ELF PT_LOAD file rangeuntrusted-input reachable, guardedNot matched by the panic-token grep, but it is an index panic candidate if parser invariants are bypassed. elf::parse checks segment file ranges before load_elf.
kernel/src/main.rs:290-293 load_tlsslice &init_data[init_start..init_end]Parsed ELF TLS file rangeuntrusted-input reachable, guardedNot matched by the panic-token grep, but it is an index panic candidate if parser invariants are bypassed. elf::parse checks TLS file bounds before load_tls.

SQE And Params/Result Buffers

LocationSurfaceReachabilityClassificationNotes
kernel/src/cap/ring.rs process_ring / dispatch_call / dispatch_recv / dispatch_returnno matched panic-like surfacesUserspace SQEs, params, result buffersuntrusted-input reachable, controlled errorSQ corruption, unsupported fields/opcodes, oversized buffers, invalid user buffers, and CQ pressure return transport errors or defer consumption.
capos-config/src/ring.rs:147-149const assert! layout checksShared ring ABItrusted-internalCompile-time ABI guard; not runtime input reachable.
capos-config/src/capset.rs:53-55const assert! layout checksShared CapSet ABItrusted-internalCompile-time ABI/page-fit guard; not runtime input reachable.
capos-lib/src/frame_bitmap.rs:87, capos-lib/src/frame_bitmap.rs:149.try_into().unwrap() on 8-byte bitmap windowsFrame allocation, including work triggered by manifest/process creation and capability methodstrusted-internalGuarded by frame + 64 <= total or i + 64 <= to, assuming the caller-provided bitmap covers total_frames. Kernel constructs that bitmap at boot.

IPC

LocationSurfaceReachabilityClassificationNotes
kernel/src/cap/endpoint.rs:202 Endpoint::endpoint_callpending_recvs.pop_front().unwrap()Cross-process CALL delivered to pending RECVuntrusted-input reachable, guardedGuarded by !inner.pending_recvs.is_empty() under the same lock. It is still on an IPC path driven by service SQEs, so S.8.4 should convert this to an explicit error/rollback path if panic-free IPC is required.
kernel/src/cap/endpoint.rs:343-345 endpoint_restore_recv_frontunchecked push_front growthIPC rollback pathuntrusted-input reachable, non-panic todayVecDeque::push_front can allocate/panic if spare capacity assumptions are broken. Current pending recv queue is pre-reserved and bounded on normal insert; rollback paths should keep the bound explicit when hardened.

Scheduler And Process Lifecycle

LocationSurfaceReachabilityClassificationNotes
kernel/src/sched.rs:55 init_idleProcess::new_idle().expect(...)Boot scheduler initboot-fatalIdle creation OOM/mapping failure panics before services run.
kernel/src/sched.rs:206-211 block_current_on_cap_entercurrent.expect, assert!, process-table expectcap_enter(min_complete > 0) pathuntrusted-input reachable, internal invariantUserspace can request blocking, but these unwraps assert scheduler state, not user values. Future process lifecycle/spawn changes increase this exposure.
kernel/src/sched.rs:252-264 capos_block_current_syscallcurrent.expect, idle assert!, table expect, panic! if not blockedBlocking syscall continuationuntrusted-input reachable, internal invariantTriggered after cap_enter chooses to block. User controls the request, but panic requires kernel state inconsistency.
kernel/src/sched.rs:279, kernel/src/sched.rs:376run_queue references missing process expectScheduling after queue selectiontrusted-internal now; future spawn/lifecycle sensitiveA stale run-queue PID panics. Dynamic spawn/exit must preserve run-queue/process-table invariants.
kernel/src/sched.rs:407-422 exit_currentcurrent.expect, idle assert!, remove(...).unwrap(), next-process unwrap()Ambient exit syscall and future process exituntrusted-input reachable, internal invariantAny service can exit itself. Panic requires scheduler corruption or idle misuse, but future spawn/process APIs should harden this boundary.
kernel/src/sched.rs:468-475 current_ring_and_capscurrent.expect, process-table expectcap_enter flush pathuntrusted-input reachable, internal invariantUser can call cap_enter; panic requires no current process or missing table entry.
kernel/src/sched.rs:493-517 startinitial run-queue expect, process-table unwrap, CR3 expectBoot service startboot-fatalManifest with zero services is rejected earlier, and process creation errors out; panics indicate scheduler/CR3 invariant breakage.
kernel/src/arch/x86_64/context.rs:59-60CR3 expect("invalid CR3 from scheduler")Timer interrupt schedulingtrusted-internal; future lifecycle sensitiveScheduler should only return page-aligned CR3s from AddressSpace.

Boot Platform And Memory Setup

LocationSurfaceReachabilityClassificationNotes
kernel/src/main.rs:36assert!(BASE_REVISION.is_supported())Limine boot protocolboot-fatalPlatform/bootloader contract check.
kernel/src/main.rs:41-44memory-map and HHDM expectLimine boot protocolboot-fatalMissing bootloader responses halt before untrusted services.
kernel/src/main.rs:74cap::init().expect(...)Kernel cap table bootstrapboot-fatalFails on kernel-internal cap-table exhaustion.
kernel/src/mem/frame.rs:39frame-bitmap region expectBoot memory mapboot-fatalBad or too-small memory map halts.
kernel/src/mem/frame.rs:115free_frame uses try_free_frame(...).expect(...)Kernel-owned frame teardowntrusted-internalCapability handlers use try_free_frame; this panic surface is for kernel-owned frames and rollback/Drop paths.
kernel/src/mem/frame.rs:139assert!(offset != 0)HHDM cache use before frame inittrusted-internalInitialization-order invariant.
kernel/src/mem/heap.rs:11heap allocation expectBoot heap initboot-fatalFails if the frame allocator cannot provide the fixed kernel heap.
kernel/src/mem/paging.rs:32, kernel/src/mem/paging.rs:58, kernel/src/mem/paging.rs:70page-alignment .unwrap() / paging initialized assert!Kernel frame/page-table internalstrusted-internalframe::alloc_frame returns page-aligned addresses.
kernel/src/mem/paging.rs:106, kernel/src/mem/paging.rs:189, kernel/src/mem/paging.rs:194kernel PML4/map remap expectsKernel page-table setupboot-fatalAssumes kernel image is mapped in bootloader tables and enough frames exist.
kernel/src/arch/x86_64/syscall.rs:49STAR selector expectSyscall initboot-fatalGDT selector layout invariant.
kernel/src/sched.rs:299, kernel/src/sched.rs:450, kernel/src/sched.rs:517CR3 expect("invalid CR3")Context switch/exit/starttrusted-internal; future lifecycle sensitiveScheduler should only carry page-aligned address-space roots.

Verification Notes

Inventory commands run:

rg -n "\b(panic!|assert!|assert_eq!|assert_ne!|debug_assert!|debug_assert_eq!|debug_assert_ne!|unwrap\(|expect\(|todo!|unreachable!)" kernel capos-lib capos-config init demos tools schema system.cue Makefile docs -g '*.rs' -g '*.cue' -g '*.md' -g 'Makefile'
rg -n "\b(panic!|assert!|assert_eq!|assert_ne!|debug_assert!|debug_assert_eq!|debug_assert_ne!|unwrap\(|expect\(|todo!|unreachable!)" kernel/src capos-lib/src capos-config/src init/src demos/capos-demo-support/src demos/*/src tools/mkmanifest/src -g '*.rs'

Code tests were not run for this doc-only inventory because the write scope is limited to docs/panic-surface-inventory.md and no Rust code, schema, manifest, or build configuration changed.

Trusted Build Inputs

This inventory covers the build inputs currently trusted by the capOS boot image, generated bindings, host tooling, and verification paths. It started as the S.10.0 inventory and now records the S.10.2 generated-code drift check. It now also records the S.10.3 dependency policy.

Summary

InputCurrent sourcePinning statusDrift-review status
Limine bootloader binariesMakefile:5-10, Makefile:34-49Git commit and selected binary SHA-256 values are pinned.make limine-verify fails if the checked-out commit or copied bootloader artifacts drift.
Rust toolchainrust-toolchain.toml:1-3Floating nightly channel with target triples only.No repo-visible date, hash, or installed component audit. The current local resolver reported rustc 1.96.0-nightly (2972b5e59 2026-04-03).
Workspace cargo dependenciesCargo.toml:1-9, crate Cargo.toml files, Cargo.lockLockfile pins exact crate versions and checksums for the root workspace. Manifest requirements remain semver ranges.make dependency-policy-check runs cargo deny check plus cargo audit against the root workspace and lockfile in CI.
Standalone cargo dependenciesinit/Cargo.lock, demos/Cargo.lock, tools/mkmanifest/Cargo.lock, capos-rt/Cargo.lock, fuzz/Cargo.lockEach standalone workspace has its own lockfile.make dependency-policy-check runs the shared deny/audit baseline against every standalone manifest and lockfile. Cross-workspace version drift remains review-visible and intentional where lockfiles differ.
Cap’n Proto compilerMakefile:12-80, kernel/build.rs, capos-config/build.rs, tools/check-generated-capnp.shOfficial capnproto-c++-1.2.0.tar.gz source tarball URL, version, and SHA-256 are pinned in Makefile; make capnp-ensure builds a shared .capos-tools/capnp/1.2.0/bin/capnp under the git common-dir parent so linked worktrees reuse it. The build rule patches the distributed CLI version placeholder to the pinned version before compiling.Build scripts default to the clone-shared pinned path and reject CAPOS_CAPNP when it points elsewhere. Make targets export the pinned path and CI persists it through $GITHUB_ENV. make generated-code-check verifies both the exact compiler path and Cap'n Proto version 1.2.0 before regenerating bindings through Cargo.
Cap’n Proto Rust runtime/codegen cratescapos-config/Cargo.toml:9, capos-config/Cargo.toml:15, kernel/Cargo.toml:12, kernel/Cargo.toml:21, Cargo.lockCargo manifests use exact capnp = "=0.25.4" and capnpc = "=0.25.3" requirements where declared; lockfiles pin exact crate versions and checksums.S.10.3 now requires dependency-class and no_std review before these changes are accepted.
Generated capnp bindingscapos-config/src/lib.rs:10-12, kernel/src/main.rs:15-18, tools/generated/capos_capnp.rs, tools/check-generated-capnp.shGenerated into Cargo OUT_DIR; the expected patched output is checked in under tools/generated/.make generated-code-check regenerates both crate outputs through Cargo and fails if either output differs from the checked-in baseline.
no_std patching of generated bindingskernel/build.rs:13-30, capos-config/build.rs:10-25, tools/check-generated-capnp.shPatch anchors are asserted in both build scripts.make generated-code-check verifies the patched output contains the expected no_std imports for both crates.
Linker script build scriptskernel/build.rs:2, init/build.rs:2-5, demos/*/build.rsSource-controlled scripts and linker scripts.Build rerun boundaries are explicit; generated link args are not independently audited.
CUE manifest compilerMakefile:13-91, tools/mkmanifest/src/main.rs:65-130, tools/mkmanifest/src/lib.rs:30-80, .github/workflows/ci.ymlmake cue-ensure installs cuelang.org/go/cmd/cue pinned to v0.16.0 into the clone-shared .capos-tools/cue/0.16.0/bin/cue path.Make exports CAPOS_CUE to tools/mkmanifest, and CI records that exact path through $GITHUB_ENV before QEMU smoke. mkmanifest also derives the same clone-shared path, rejects missing or non-canonical CAPOS_CUE, and checks cue version v0.16.0 before export.
mdBook documentation toolsMakefile, book.tomlGitHub release assets for mdBook v0.5.0 and mdbook-mermaid v0.17.0 are pinned by version and SHA-256 under the clone-shared .capos-tools path.make docs and make cloudflare-pages-build verify the tarball checksums and executable versions, refresh the Mermaid assets, and build target/docs-site.
QEMU and firmwareMakefile:67-83Host-installed qemu-system-x86_64; OVMF path is hard-coded for UEFI.No repo-visible version or firmware checksum. Current local host reported QEMU 10.2.2.
ISO and host filesystem toolsMakefile:51-65Host-installed xorriso, sha256sum, git, make, shell utilities.No version capture except ad hoc local inspection.
Boot manifest and embedded binariessystem.cue:1-144, tools/mkmanifest/src/lib.rs:82-115, Makefile:28-29, Makefile:51-65Source manifest is checked in; embedded ELF payloads are build artifacts.Manifest validation checks references and path containment, but final manifest.bin is generated and not checksum-recorded.
Build downloadsMakefile, Cargo lockfiles, rust-toolchain.tomlLimine and documentation tool tarballs are explicitly fetched; Cargo, Go, and rustup downloads are implicit when caches/toolchains are absent.Limine artifacts and documentation tool tarballs are verified. Cargo, Go, and rustup downloads rely on upstream tooling and lockfiles, with no repo policy.

S.10.3 Dependency Policy

Dependency changes are accepted only if they satisfy this policy and are recorded in the owning task checklist.

Dependency classes

Use these classes when reviewing a dependency change:

  • Kernel-critical no_std: crates used directly by kernel, capos-lib, and capos-config.
  • Userspace-runtime no_std: crates used by init, demos, and capos-rt.
  • Host/build: crates used by tools/*, build.rs helpers, and generated output pipelines.
  • Test/fuzz/dev: crates gated by dev-dependencies or target-specific for fuzz/proptests/smoke support.

Required pre-merge criteria

For any added dependency (or bump in any class):

  1. Manifest and features are explicit. Dependency entries must include explicit feature choices; avoid default-features = true unless justified.
  2. No_std compatibility is proven for no_std classes. Kernel-critical and userspace-runtime dependencies must compile in a #![no_std] mode with alloc where expected. cargo build -p <crate> --target x86_64-unknown-none must succeed for every kernel/no_std crate affected.
  3. Security policy checks run and pass. CI-equivalent checks for the touched workspace are required through make dependency-policy-check, which runs cargo deny check on every Cargo manifest and cargo audit on every lockfile.
  4. Dependency class change is justified in review. PR text must include target class, ownership rationale, transitive graph impact, and why the crate is not a transitive replacement for an already-allowed dependency.
  5. Lockfile behavior is explicit. Update only intended lockfiles and record intentional cross-workspace drift in this document if workspace purpose differs.

No_std add/edit checklist

  • Reject crates that require std, OS I/O, or unsupported platform APIs in the dependency path intended for kernel classes.
  • Reject dependencies that re-export broad platform facades or large unsafe surface unless there is a replacement with smaller scope and better audit visibility.
  • Record a license and supply-chain review result (via policy checks) before merge.
  • Confirm no unsafe contract escapes are added without a review surface note in the relevant module.

Standing requirements

  • Add S.10.3 checks to the target branch plan item for any kernel/no_std crate dependency change and document the exact pass command set.
  • Keep lockfile deltas review-visible in normal PR flow; lockfile pinning is the minimum bar, not the gate.
  • Keep transitive drift in sync with the trust class: class-wide divergence across lockfiles requires explicit justification.

Remaining gaps after S.10.3 policy

  • Continue Rust toolchain pinning work (date/hash pin, reproducible host compiler inputs) as a separate build-reproducibility task.
  • Decide whether final ISO/payload hashes become policy-grade inputs in production-hardening stages.

Bootloader and ISO Inputs

The Makefile now pins Limine at commit aad3edd370955449717a334f0289dee10e2c5f01 and verifies these copied artifacts:

ArtifactChecksum reference
limine/limine-bios.sysMakefile:7
limine/limine-bios-cd.binMakefile:8
limine/limine-uefi-cd.binMakefile:9
limine/BOOTX64.EFIMakefile:10

make limine-ensure clones https://github.com/limine-bootloader/limine.git only when limine/.git is absent, fetches the pinned commit if needed, checks it out detached, and runs make inside the Limine tree (Makefile:34-40). make limine-verify then checks the repository HEAD and artifact checksums (Makefile:42-49). The ISO copies the kernel, generated manifest.bin, Limine config, and verified Limine artifacts into iso_root/, runs xorriso, then runs limine bios-install (Makefile:51-65).

Remaining reproducibility gap: Limine source is pinned, but the Limine build host compiler and environment are not pinned or recorded.

Rust Toolchain

rust-toolchain.toml specifies:

  • channel = "nightly"
  • targets = ["x86_64-unknown-none", "aarch64-unknown-none"]

This is a floating channel pin, not a reproducible toolchain pin. A future rustup resolution can move the compiler even when the repository is unchanged. The current local host resolved to:

  • rustc 1.96.0-nightly (2972b5e59 2026-04-03)
  • cargo 1.96.0-nightly (888f67534 2026-03-30)
  • host target x86_64-unknown-linux-gnu

The Makefile derives HOST_TARGET from rustc -vV (Makefile:12) and uses that for tools/mkmanifest (Makefile:28-29). Cargo aliases in .cargo/config.toml:4-22 hard-code x86_64-unknown-linux-gnu for host tests.

Remaining reproducibility gap: pin the nightly by date or exact toolchain hash, and record required components. Until then, compiler drift can change codegen, linking, lints, and generated bindings without a repository diff.

Cargo Dependencies

The root workspace members are capos-config, capos-lib, and kernel (Cargo.toml:1-4). init/, demos/, tools/mkmanifest/, and fuzz/ are standalone workspaces with their own lockfiles.

Important direct dependencies and current root-lock resolutions:

DependencyManifest referencesRoot lock resolution
capnpcapos-config/Cargo.toml:8, capos-lib/Cargo.toml:7, kernel/Cargo.toml:110.25.4 in Cargo.lock
capnpccapos-config/Cargo.toml:14, kernel/Cargo.toml:190.25.3 in Cargo.lock
limine cratekernel/Cargo.toml:70.6.3 in Cargo.lock
spinkernel/Cargo.toml:80.9.8 in Cargo.lock
x86_64kernel/Cargo.toml:90.15.4 in Cargo.lock
linked_list_allocatorkernel/Cargo.toml:100.10.6 in Cargo.lock
loomcapos-config/Cargo.toml:170.7.2 in Cargo.lock
proptestcapos-lib/Cargo.toml:9-101.11.0 in Cargo.lock

Standalone lockfile drift observed during this inventory:

LockfileNotable direct/runtime resolution
init/Cargo.lockcapnp 0.25.4, capnpc 0.25.3, linked_list_allocator 0.10.5
demos/Cargo.lockcapnp 0.25.4, capnpc 0.25.3, linked_list_allocator 0.10.6
tools/mkmanifest/Cargo.lockcapnp 0.25.4, capnpc 0.25.3, serde_json 1.0.149
capos-rt/Cargo.lockcapnp 0.25.4, capnpc 0.25.3, linked_list_allocator 0.10.6
fuzz/Cargo.lockcapnp 0.25.4, capnpc 0.25.3, libfuzzer-sys 0.4.12

Cargo lockfiles pin exact crate versions and crates.io checksums, so ordinary crate upgrades are review-visible through lockfile diffs. They do not, by themselves, define whether a dependency is acceptable for kernel/no_std use, whether multiple lockfiles must converge, or whether advisories/licenses block the build.

S.10.3 policy gate:

  • deny.toml defines the shared license, advisory, ban, and source baseline.
  • make dependency-policy-check runs cargo deny check on the root workspace, init, demos, tools/mkmanifest, capos-rt, and fuzz.
  • The same target runs cargo audit --deny warnings on every checked-in lockfile.
  • Local packages are marked publish = false so cargo-deny treats them as private, and local path dependencies include version = "0.1.0" so registry wildcard requirements can remain denied.
  • CI installs pinned cargo-deny 0.19.4 and cargo-audit 0.22.1 and runs the target.

Remaining dependency-policy gap: decide whether standalone lockfiles may intentionally drift from the root lockfile, especially for capnp and allocator crates used by userspace.

Cap’n Proto Compiler, Runtime, and Generated Bindings

The trusted Cap’n Proto inputs are:

  • schema/capos.capnp, the source schema.
  • Repo-local pinned capnp, invoked through the capnpc Rust build dependency via CAPOS_CAPNP.
  • capnp runtime crate with default-features = false and alloc.
  • capnpc codegen crate.
  • Generated capos_capnp.rs written to Cargo OUT_DIR.
  • Local no_std patching applied after generation.

kernel/build.rs and capos-config/build.rs both run capnpc::CompilerCommand over ../schema/capos.capnp, then read the generated capos_capnp.rs, assert that the expected #![allow(unused_variables)] anchor is present, and inject:

#![allow(unused)]
#![allow(unused_imports)]
fn main() {
use ::alloc::boxed::Box;
use ::alloc::string::ToString;
}

The generated code used by builds is included from OUT_DIR in capos-config/src/lib.rs:10-12 and kernel/src/main.rs:15-18. The expected patched output is checked in as tools/generated/capos_capnp.rs, so schema, compiler, capnpc crate, and patch-output changes must update that baseline and become review-visible as a source diff.

S.10.2 generated-code drift check:

  • make generated-code-check runs tools/check-generated-capnp.sh.
  • The script invokes the actual Cargo build-script path for capos-config and capos-kernel in an isolated target directory, so it checks the generated artifacts that those crates would include from OUT_DIR.
  • The script verifies that each patched file still contains the capnpc anchor plus the local no_std patch imports, compares the two crate outputs byte-for-byte, and then compares both outputs against tools/generated/capos_capnp.rs.
  • Any intentional schema/codegen/patch change must update the checked-in baseline in the same review, making generated output drift review-visible.
  • make check runs fmt-check plus generated-code-check for a single local or CI entry point.
  • Current pinned compiler source is capnproto-c++-1.2.0.tar.gz from https://capnproto.org/ with SHA-256 ed00e44ecbbda5186bc78a41ba64a8dc4a861b5f8d4e822959b0144ae6fd42ef. The checked-in tools/generated/capos_capnp.rs baseline must be regenerated with that compiler when schema or codegen behavior intentionally changes. The current pinned baseline SHA-256 is 224b4ec2296f800bd577b75cfbd679ebb78e7aa2f813ad9893061f4867c9dd3d.

Remaining gaps for S.10.3:

  • The no_std patching logic still lives in both build scripts. The baseline and pairwise output comparison catch divergent results, but a future cleanup could move the patch helper into shared source to reduce duplication.

Cargo Build Scripts

Build scripts currently do these trusted operations:

ScriptBehavior
kernel/build.rsWatches kernel/linker-x86_64.ld, schema/capos.capnp, and itself; generates and patches capnp bindings. Checked by make generated-code-check.
capos-config/build.rsWatches schema/capos.capnp; generates and patches capnp bindings. Checked by make generated-code-check.
init/build.rsEmits a linker script argument for init/linker.ld.
demos/*/build.rsEmits a linker script argument for demos/linker.ld.

The linker build scripts derive CARGO_MANIFEST_DIR from Cargo and only emit link arguments plus rerun directives. The capnp build scripts read and rewrite generated code under OUT_DIR. None of these scripts fetch network resources.

S.10.2 coverage: make generated-code-check exercises both capnp build scripts through Cargo, validates the patched generated files, and fails if the two crate outputs drift apart or no longer match the checked-in generated baseline.

Manifest, Embedded Binaries, and Downloaded Artifacts

system.cue declares named binaries and services. Makefile:54-55 builds manifest.bin by running tools/mkmanifest on the host. mkmanifest runs:

  1. Resolve the clone-shared pinned CUE compiler from git state, reject missing or mismatched CAPOS_CUE, check cue version v0.16.0, then run cue export system.cue --out json (tools/mkmanifest/src/main.rs:65-128).
  2. JSON-to-CueValue conversion and manifest validation (tools/mkmanifest/src/main.rs:13-23).
  3. Binary embedding from relative paths (tools/mkmanifest/src/lib.rs:135-180).
  4. Binary-reference validation and Cap’n Proto serialization (tools/mkmanifest/src/main.rs:37-49).

Path handling rejects absolute paths, parent traversal, non-normal components, and canonicalized paths that escape the manifest directory (tools/mkmanifest/src/lib.rs:182-217). The generated manifest.bin is copied into the ISO as /boot/manifest.bin (Makefile:117) and loaded by Limine via limine.conf:5.

Downloaded or generated artifacts in the current build:

ArtifactProducerPinning/drift status
limine/ checkoutgit clone/git fetch in Makefile:34-40Commit-pinned and artifact-verified.
Cargo registry cratescargo build, cargo run, tests, fuzzLockfile-pinned checksums plus CI-enforced deny/audit checks through make dependency-policy-check.
Rust toolchain and targetsrustup from rust-toolchain.toml when absentFloating nightly channel.
target/ kernel and host artifactsCargoGenerated, not checked in.
init/target/ and demos/target/ ELFsCargo standalone buildsGenerated, embedded into manifest.bin; no final payload checksums recorded in source.
manifest.bintools/mkmanifestGenerated from system.cue plus ELF payloads; not checked in.
iso_root/ and capos.isoMakefile, xorriso, Limine installerGenerated and gitignored; Limine inputs verified, full ISO checksum not source-recorded.

Remaining gaps for S.10.2/S.10.3:

  • Decide whether CI should record or compare hashes for manifest.bin, embedded ELF payloads, or the final ISO for reproducible-build tracking.
  • Pin or record xorriso, qemu-system-x86_64, OVMF firmware, and other host tools used by build and boot verification with the same strictness as capnp and cue.
  • Decide whether CI should record the pinned cue export JSON or final manifest.bin bytes if manifest reproducibility becomes release-critical.

Host Tools

Current local host versions observed during this inventory:

ToolObserved versionBuild role
capnp1.2.0Repo-local schema compiler built by make capnp-ensure from a SHA-256-pinned official source tarball into the shared .capos-tools cache for this clone.
cuev0.16.0Repo-local manifest compiler installed by make cue-ensure into the shared .capos-tools cache for this clone.
qemu-system-x86_6410.2.2Boot verification via make run and make run-uefi.
xorriso1.5.8ISO generation.
make4.4.1Build orchestration.
git2.53.0Limine checkout/fetch and review workflow.

These are environment observations, not repository pins. make run-uefi also trusts /usr/share/edk2/x64/OVMF.4m.fd (Makefile:82-83) without a checksum.

Remaining gap for S.10.3: decide the minimum supported host tool versions and whether they are enforced by CI, a container/devshell, or explicit preflight checks.

Verification Used for This Inventory

This was a documentation-only inventory. Code tests and QEMU boot were not run because no source, build, runtime, or generated-code behavior was changed.

Scoped read-only verification commands used:

  • git status --short --branch
  • rg -n "S\\.10|trusted|supply|Limine|limine|capnp|capnpc|QEMU|qemu|download|curl|git clone|wget|build\\.rs|rust-toolchain|Cargo\\.lock" ...
  • rg --files
  • cargo metadata --locked --format-version 1 --no-deps
  • rg -n '^name = |^version = |^checksum = ' Cargo.lock init/Cargo.lock demos/Cargo.lock tools/mkmanifest/Cargo.lock fuzz/Cargo.lock
  • command -v rustc cargo capnp cue qemu-system-x86_64 xorriso sha256sum git make
  • rustc -Vv, cargo -V, capnp --version, cue version, qemu-system-x86_64 --version, xorriso -version, make --version, git --version

DMA Isolation Design

S.11 gates PCI, virtio, and later userspace device-driver work on an explicit DMA authority model. The immediate goal is narrow: let the kernel bring up a QEMU virtio-net smoke without creating a user-visible raw physical-memory escape hatch.

Short-Term Decision

Use kernel-owned bounce buffers for the first in-kernel QEMU virtio-net smoke.

The kernel allocates DMA-capable pages from its own frame allocator, owns the virtqueue descriptor tables and packet buffers, programs the device with the corresponding physical addresses, and copies packet payloads between those buffers and the networking stack. No userspace process receives a DMA buffer capability, a physical address, a virtqueue pointer, or a BAR mapping for this smoke.

This is deliberately conservative:

  • It works before ACPI/DMAR or AMD-Vi parsing, IOMMU page-table management, MSI/MSI-X routing, and userspace driver lifecycle supervision exist.
  • It keeps all physical-address programming inside the kernel, where the same code that allocates the frames also bounds the descriptors that reference them.
  • It does not make the current FrameAllocator capability part of the DMA path. FrameAllocator can expose raw frames today and is already tracked in REVIEW_FINDINGS.md; DMA must not build new untrusted-driver semantics on that interface.
  • It gives the smoke a disposable implementation path. When NIC or block drivers move to userspace, bounce-buffer authority becomes a typed DMAPool object instead of an ad hoc physical-address grant.

An IOMMU-backed DMA-domain model remains the target for direct device access from mutually untrusted userspace drivers, but it is not a prerequisite for the first QEMU smoke. Without an IOMMU, a malicious bus-mastering device can still DMA to arbitrary RAM at the hardware level; the short-term smoke assumes QEMU-provided virtio hardware and protects against confused or untrusted userspace, not hostile hardware.

Authority Model

Device authority is split into three independent capabilities:

  • DMAPool: authority to allocate, expose, and revoke device-visible memory within a kernel-owned physical range or IOMMU domain.
  • DeviceMmio: authority to map and access one device’s register windows.
  • Interrupt: authority to wait for and acknowledge one interrupt source.

Holding one of these capabilities never implies the others. A driver needs all three for a normal device, but the kernel and init can grant, revoke, and audit them separately.

DMAPool Invariants

DMAPool is the only future userspace-facing authority that may cause a device-visible DMA address to exist.

  • Authority: A holder may allocate buffers only from the pool object it was granted. It may not request arbitrary physical frames, import caller virtual memory by address, or derive another pool.
  • Physical range: Every exported device address must resolve to pages owned by the pool. The kernel records the allowed host-physical page set and validates every descriptor mapping against that set before a device can use it. If an IOMMU domain backs the pool, the exported address is an IOVA, not raw host physical memory.
  • Ownership: Each DMA buffer has one pool owner, one device-domain owner, and explicit CPU mappings. Sharing a buffer with another process requires a later typed memory-object transfer; copying packet data is the default until that object exists.
  • No raw grants: Userspace never receives an unrestricted host-physical address. A driver may receive an opaque DMA handle or an IOVA meaningful only to its DMAPool/device domain. It cannot turn that value into access to unrelated RAM.
  • Bounds: Buffer length, alignment, segment count, and queue depth are bounded by the pool. Descriptor chains that point outside an allocated buffer, wrap arithmetic, exceed device limits, or reference freed buffers fail closed before doorbell writes.
  • Revocation: Revoking the pool first quiesces the device path using it, prevents new descriptors, waits for or cancels in-flight descriptors, then removes IOMMU mappings or invalidates bounce-buffer handles before freeing pages.
  • Reset: If in-flight DMA cannot be proven stopped, revocation escalates to device reset through the owning device object before pages are reused.
  • Residual state: Pages returned from a pool are zeroed or otherwise scrubbed before reuse by a different owner. Receive buffers are treated as device-written untrusted input until validated by the driver or stack.

For the in-kernel QEMU smoke, the kernel is the only DMAPool holder. The same invariants apply internally even though no userspace capability object is exposed yet.

DeviceMmio Invariants

DeviceMmio is register authority, not memory authority.

  • Authority: A holder may map only BARs or subranges recorded in the claimed device object. It may not map PCI config space globally, another function’s BAR, RAM, ROM, or synthetic kernel pages.
  • Physical range: Each mapping is bounded to the BAR’s decoded physical range, page-rounded by the kernel, and tagged as device memory with cache attributes appropriate for MMIO. Partial BAR grants must preserve page-level isolation; otherwise the grant must cover the whole page-aligned register window and be treated as that much authority.
  • Ownership: At most one mutable driver owner controls a device function’s MMIO at a time. Management capabilities may inspect topology, but register writes require the claimed DeviceMmio object.
  • No DMA implication: Mapping registers does not grant any DMA buffer, frame allocation, interrupt, or config-space authority. Doorbell writes are accepted only as effects of register access; descriptor validity is enforced by DMAPool before queues are made visible to the device.
  • Revocation: Revocation unmaps the driver’s register pages, marks the device object unavailable for new calls, and invalidates outstanding MMIO handles. Stale mappings or calls fail closed.
  • Reset: Revoking the final mutable DeviceMmio owner resets or disables the device unless a higher-level device manager explicitly transfers ownership without exposing it to an untrusted holder.

Interrupt Invariants

Interrupt is event authority for one routed source.

  • Authority: A holder may wait for, mask/unmask where supported, and acknowledge only its assigned vector, line, or MSI/MSI-X table entry. It may not reprogram arbitrary interrupt controllers or claim another source.
  • Ownership: Each interrupt source has one delivery owner at a time. Shared legacy lines must be represented as a kernel-demultiplexed object with explicit device membership, not as ambient access to the whole line.
  • Range: The capability records the hardware source, vector, trigger mode, polarity, and target CPU/routing state. User-visible operations are checked against that record.
  • Revocation: Revocation masks or detaches the source, drains pending notifications for the old holder, invalidates waiters, and prevents stale acknowledgements from affecting a new owner.
  • Reset: If the source cannot be detached cleanly, the owning device is reset or disabled before the interrupt is reassigned.
  • No MMIO or DMA implication: Interrupt delivery does not grant register access, DMA buffers, or packet memory.

Revocation Ordering

Device revocation must follow a fixed order:

  1. Stop new submissions by invalidating the driver’s user-visible handles.
  2. Revoke MMIO write authority by write-blocking or unmapping BAR pages, or by disabling the device before any DMA teardown starts.
  3. Mask or detach interrupts.
  4. Quiesce virtqueues or device command queues.
  5. Reset or disable the device if in-flight DMA cannot be accounted for.
  6. Remove IOMMU mappings or invalidate bounce-buffer handles.
  7. Scrub and free DMA pages.

This order prevents a stale driver from racing revocation with doorbell writes, interrupt acknowledgement, or descriptor reuse. Logical handle invalidation is not sufficient while a BAR remains mapped; register-write authority must be removed or the device must be disabled before descriptor or DMA-buffer ownership is reclaimed.

Future Userspace-Driver Transition Criteria

Moving NIC or block drivers out of the kernel is gated by S.11.2. The gate is only open when all rows below are implemented and demonstrated.

Gate itemRequired stateMust-have proof
S.11.2.0 DMA-objected buffersDMAPool owns every driver-visible DMA mapping.A driver receives opaque buffer handles or IOVA-only values; no path hands out raw host physical addresses.
S.11.2.1 Bound checksAllocation, descriptor chain length, alignment, segment length, and ring depth are bounded and constant-time validated before ring submission.Ring submissions fail closed on overflow, wrap, stale-handle, and freed-handle reuse attempts.
S.11.2.2 Explicit remap/ownershipDeviceMmio can only grant claimed BAR pages; cache attributes and write policy are enforced.Driver cannot access unclaimed BARs, ROM, RAM pages, config-space globals, or stale mappings after revoke.
S.11.2.3 Interrupt correctnessInterrupt owns exactly one logical source at a time and drains/waits only for that source.Reassigning an owner invalidates old waiters and masks or detaches the source first.
S.11.2.4 Quiesce + reset contractDevice manager can force reset/disable on failed revoke or teardown.No in-flight descriptor may continue touching freed buffers after driver removal.
S.11.2.5 Process lifecycleCapability release, process exit, and process-spawn cleanup paths cannot leak DMA pages/MMIO/intr ownership.Crash-path teardown removes holds and invalidates user-visible handles before page free.
S.11.2.6 Isolation and accountingS.9 quota and authority ledgers include DMA, MMIO, and interrupt hold edges.A malicious or buggy driver cannot consume more than its allocated authority budget.
S.11.2.7 Hostile-smoke coverageQEMU/CI smokes cover stale handles, descriptor abuse, revoke races, and exit-under-dma.Smoke output has explicit closed-case proof lines for each above failure mode.

For each row, the transition requires an owner, implementation notes, and a CI-backed verification path. Until all rows pass, Phase 4.2 NIC/block drivers remain in-kernel for functionality, and only kernel-mapped bounce-buffer mode is allowed for prototype DMA.

S.11.2 Decision Record

S.11.2 is not complete until the kernel has a dedicated device manager object model that can produce, transfer, and revoke DMAPool, DeviceMmio, and Interrupt in a single ownership transaction for a driver process.

Current status: transition remains blocked pending implementation of the conditions above.

S.9 Design: Authority Graph and Resource Accounting for Transfer

This document defines the concrete S.9 design gate for:

  • WORKPLAN 3.6 capability transfer (xfer_cap_count, copy/move, rollback)
  • WORKPLAN 5.2 ProcessSpawner prerequisites (spawn quotas and result-cap insertion)

S.9 is complete when this design contract is concrete enough to guide implementation. The invariants and acceptance criteria below are implementation gates for later work in 3.6/5.2/S.8/S.12, not requirements for declaring the S.9 design artifact complete.

1. Authority Graph Model

Authority is modeled as a directed multigraph:

  • Nodes:
    • Process(Pid)
    • Object(ObjectId) (kernel object identity, independent of per-process CapId)
  • Edges:
    • Hold(Pid -> ObjectId) with metadata:
      • cap_id (table-local handle)
      • interface_id
      • badge
      • transfer_mode (copy, move, non_transferable)
      • origin (kernel, spawn_grant, ipc_transfer, result_cap)

Security invariant A1: all authority is represented by Hold edges; no operation can create object authority outside this graph.

Security invariant A2: each process mutates only its own CapTable edges except through explicit transfer/spawn transactions validated by the kernel.

Security invariant A3: for every live Hold edge there is exactly one cap_id slot in one process table referencing the object generation.

2. Per-Process Resource Ledger and Quotas

Each process owns a kernel-maintained ResourceLedger with hard limits. Enforcement is fail-closed at reservation time (before side effects).

ResourceLedger {
  cap_slots_used / cap_slots_max
  endpoint_queue_used / endpoint_queue_max
  outstanding_calls_used / outstanding_calls_max
  scratch_bytes_used / scratch_bytes_max
  frame_grant_pages_used / frame_grant_pages_max
  log_bytes_window_used / log_bytes_per_window (token bucket)
  cpu_time_us_window_used / cpu_budget_us_per_window (token bucket)
}

Initial quota profile for Stage 6/5.2 bring-up (tunable by kernel config):

  • cap_slots_max: 256
  • endpoint_queue_max: 128 messages
  • outstanding_calls_max: 64
  • scratch_bytes_max: 256 KiB
  • frame_grant_pages_max: 4096 pages (16 MiB at 4 KiB pages)
  • log_bytes_per_window: 64 KiB/sec with 256 KiB burst
  • cpu_budget_us_per_window: 10,000 us per 100,000 us window

Security invariant Q1: no counter may exceed its max.

Security invariant Q2: every resource reservation has a matched release on all success, error, timeout, process-exit, and rollback paths.

Security invariant Q3: quota checks for transfer/spawn happen before mutating sender or receiver capability state.

3. Diagnostic Rate Limiting and Aggregation

Repeated invalid ring/cap submissions are aggregated per process and error key.

  • Key: (pid, error_code, opcode, cap_id_bucket)
  • Buckets:
    • cap_id_bucket = exact cap id for stale/invalid cap failures
    • cap_id_bucket = 0 for structural ring errors
  • Per-key token bucket: allow first N=4 emissions/sec, then suppress.
  • Suppressed counts are flushed once per second as one summary line:
    • pid=X invalid submissions suppressed=Y last_err=...

Security invariant D1: invalid submission floods cannot consume unbounded serial bandwidth or scheduler time in log formatting.

Security invariant D2: suppression never hides first-observation diagnostics for a new (pid,error,opcode,cap bucket) key.

4. Transfer and Rollback Semantics

Transfers (xfer_cap_count > 0) use a kernel transfer transaction (TransferTxn) scoped to a single SQE dispatch. The current ring ABI does not provide kernel-owned SQE sequence numbers or a durable transaction table, so userspace replay of a copy-transfer SQE is repeatable: each replay is treated as a new copy grant. Move-transfer replay fails closed after the source slot is removed or reserved by the first successful dispatch.

Future exactly-once replay suppression requires transaction identity scoped to (sender_pid, call_id, sqe_seq) and a monotonic transfer epoch. Until that exists, exactly-once claims apply only within one dispatch attempt, not across malicious rewrites of shared SQ ring indexes.

Phases:

  1. Prepare:
    • validate SQE transport fields and xfer_cap_count
    • validate sender ownership/generation/transferability for each exported cap
    • reserve receiver quota (cap_slots, outstanding_calls, scratch if needed)
    • pin sender entries in txn state (no sender table mutation yet)
  2. Commit:
    • insert destination edges exactly once
    • for copy: increment object refcount/export ref
    • for move: remove sender slot only after destination insertion succeeds
    • publish completion/result
  3. Finalize:
    • release transient reservations
    • mark txn terminal (committed or aborted)

On any error before Commit, rollback is full:

  • receiver inserts are not visible
  • sender slots/refcounts unchanged
  • reservations released
  • CQE returns transfer failure (CAP_ERR_TRANSFER_ABORTED / subtype)

On error during Commit, kernel executes compensating rollback to preserve exactly-once visibility: either all inserts are visible with matching sender state transition, or none are visible.

Security invariant T1: each transfer descriptor is applied at most once within a single SQE dispatch attempt.

Security invariant T2: move transfer is atomic from observer perspective; no state exists where both sender and receiver lose authority due to partial apply.

Security invariant T3: copy-transfer SQE replay is explicitly repeatable until kernel-owned transaction identity exists. Move-transfer replay fails closed after source removal or source reservation.

Security invariant T4: CAP_OP_RELEASE removes one local hold edge only from the caller table and decrements remote export refs exactly once.

5. Integration with 3.6 Capability Transfer

3.6 implementation must consume this design directly:

  • CALL and RETURN validate all currently-reserved transfer fields fail-closed when unsupported.
  • xfer_cap_count path is wired through TransferTxn (no ad hoc direct inserts).
  • Badge propagation is explicit in transfer descriptors and copied into destination edge metadata.
  • CAP_OP_RELEASE uses the same authority ledger and refcount bookkeeping.

3.6 acceptance criteria:

  1. Copy transfer produces one new receiver edge and retains sender edge.
  2. Move transfer produces one new receiver edge and deletes sender edge atomically.
  3. Any transfer failure leaves sender and receiver CapTables unchanged.
  4. Copy replay is an explicit repeatable-grant policy until a kernel-owned transaction identity is added; move replay fails closed after source removal or reservation.
  5. CAP_OP_RELEASE on stale/non-owned cap fails closed without mutating other process tables.

6. Integration with 5.2 ProcessSpawner Prerequisites

5.2 must use the same accounting and transfer machinery:

  • spawn() preflights child quotas (cap_slots, outstanding_calls, scratch, frame_grant_pages, endpoint queue baseline) before mapping child memory or scheduling.
  • Parent-provided CapGrant entries are inserted via the same transfer transaction semantics (copy for initial grants in 5.2.2).
  • Returned ProcessHandle is inserted through the standard result-cap insertion path and accounted as a normal cap slot.
  • Child setup rollback must unwind:
    • address space mappings
    • ring page
    • CapSet page
    • kernel stack
    • allocated frames
    • provisional capability edges/reservations

5.2 acceptance criteria:

  1. Spawn failure at any step leaves no child-visible process and no leaked ledger usage.
  2. Successful spawn accounts all child bootstrap resources within quotas.
  3. Parent and child cap-table accounting remains balanced under repeated spawn/exit cycles.
  4. ProcessHandle.wait and exit cleanup release outstanding-call/scratch/frame usage deterministically.

7. Implementation Notes for Verification Tracks

This design unblocks:

  • S.8 hostile-input tests for quota and invalid-transfer failures.
  • S.12 Kani bounds refresh for ledger and transfer invariants.
  • Target 12 in docs/proposals/security-and-verification-proposal.md with explicit allocator hooks and fail-closed exhaustion behavior.

Proposal Index

This page classifies proposal documents by current role so readers do not confuse implemented behavior, active design direction, future architecture, and rejected alternatives.

Active or Near-Term

ProposalStatusPurpose
Service ArchitecturePartially implementedDefines authority-at-spawn, service composition, exported capabilities, and the init-owned service graph direction.
Storage and NamingAccepted designDefines capability-native storage, namespaces, boot-package structure, and future persistence instead of a global filesystem.
Error HandlingPartially implementedDefines the two-level transport/application error model and the current CQE transport error namespace.
Security and VerificationPartially implementedDefines the security review vocabulary, trust-boundary checklist, and practical verification tracks used by capOS.
mdBook Documentation SitePartially implementedDefines the documentation site structure, status vocabulary, and curation rules for architecture, proposal, security, and research pages.

Future Architecture

ProposalStatusPurpose
NetworkingPartially implementedPlans the in-kernel QEMU virtio-net smoke and the later userspace NIC, network stack, and socket capability architecture.
SMPFuture designDefines the future multi-core scheduler, per-CPU state, AP startup, and TLB shootdown direction.
Userspace BinariesPartially implementedDescribes native userspace binaries, capos-rt, language support, POSIX compatibility, and runtime authority handling.
Go RuntimeFuture designPlans a custom GOOS=capos path, runtime services, memory growth, TLS, scheduling, and network integration for Go.
ShellFuture designDescribes native, agent-oriented, and POSIX shell models over explicit capabilities instead of ambient paths.
Boot to ShellQueued future milestoneDefines text-only console and web-terminal login/setup, password verifier and passkey authentication, and the authenticated native shell launch path after manifest execution, terminal input, native shell, session, broker, audit, and credential-storage prerequisites are credible.
System MonitoringFuture designDefines capability-scoped logs, metrics, health, traces, crash records, and audit/status views.
User Identity and PolicyFuture designDefines users, sessions, guest profiles, and policy layers for RBAC, ABAC, and MAC over capability grants.
Cloud MetadataFuture designDescribes cloud instance bootstrap through metadata/config-drive capabilities and manifest deltas.
Cloud DeploymentFuture designPlans hardware abstraction, cloud VM support, storage/network boot dependencies, and later aarch64 deployment work.
Live UpgradeFuture designDefines service replacement without dropping capabilities or in-flight calls through retargeting and quiesce/resume protocols.
GPU CapabilityFuture designSketches capability-oriented GPU, CUDA, memory, and driver isolation models.
Formal MAC/MICFuture designDefines a formal mandatory-access and mandatory-integrity model plus future proof obligations.
Browser/WASMFuture designExplores running capOS concepts in a browser using WebAssembly and worker-per-process isolation.

Rejected or Superseded

ProposalStatusPurpose
Cap’n Proto SQE EnvelopeRejectedRecords why ring SQEs stay fixed-layout transport records instead of becoming Cap’n Proto messages themselves.
Sleep(INF) Process TerminationRejectedRecords why infinite sleep should not replace explicit process termination, while preserving typed status and future sys_exit removal as separate lifecycle work.

Maintenance

When a proposal becomes implemented, rejected, or stale, update this index in the same change that changes the proposal or corresponding implementation. Long proposal files may describe target behavior; this index is the first status checkpoint before a reader opens those documents.

Proposal: Capability-Based Service Architecture

How capOS processes receive authority, compose into services, and expose layered capabilities — without a service manager daemon.

Problem

Traditional OSes grant processes ambient authority (file system, network, IPC namespaces) and then restrict it via sandboxing (seccomp, namespaces, AppArmor). Service managers like systemd handle dependencies, lifecycle, and resource limits through a central daemon with a massive configuration surface.

capOS inverts this: processes start with zero authority and receive only the capabilities they need. The capability graph implicitly encodes service dependencies, resource limits, and access control. No central daemon required.

Process Startup Model

A process receives its entire authority as a set of named capabilities at spawn time. There is no ambient authority to fall back on — if a capability wasn’t granted, the operation is impossible.

The child process sees its granted capabilities by name. It cannot discover or request capabilities it wasn’t given.

Capability Layering

Each process consumes lower-level capabilities and exports higher-level ones. Authority narrows at every layer:

Kernel
  │
  ├─ Nic cap (raw frame send/receive for one device)
  ├─ Timer cap (monotonic clock)
  ├─ DeviceMmio cap (one device's BAR regions)
  └─ Interrupt cap (one IRQ line)
       │
       v
NIC Driver Process
  │
  └─ Nic cap ──> Network Stack Process
                   │
                   ├─ TcpSocket cap (one connection)
                   ├─ UdpSocket cap (one socket)
                   └─ NetworkManager cap (create sockets)
                        │
                        v
                   HTTP Service Process
                     │
                     ├─ Fetch cap (any URL)
                     │    │
                     │    v
                     │  Trusted Process (holds Fetch, mints scoped caps)
                     │
                     └─ HttpEndpoint cap (one origin)
                          │
                          v
                     Application Process

The application at the bottom holds an HttpEndpoint cap scoped to a single origin. It cannot make raw TCP connections, send arbitrary packets, or touch any device. The capability is the security policy.

HTTP Capabilities

Two levels of HTTP capability: Fetch (general) and HttpEndpoint (scoped). HttpEndpoint is implemented by a process that holds a Fetch cap and restricts it.

Fetch

Unrestricted HTTP access — equivalent to the browser Fetch API. The holder can make requests to any URL. This is the base capability that HTTP service processes use internally.

interface Fetch {
    # General-purpose HTTP request to any URL.
    request @0 (url :Text, method :Text, headers :List(Header), body :Data)
        -> (status :UInt16, headers :List(Header), body :Data);
}

struct Header {
    name @0 :Text;
    value @1 :Text;
}

Fetch is powerful — granting it is roughly equivalent to granting arbitrary outbound network access. It should only be held by service processes that need to make requests on behalf of others, not by application code directly.

HttpEndpoint

A restricted view of Fetch, scoped to a single origin. The holder can only make requests within the bounds encoded in the capability.

interface HttpEndpoint {
    # Request scoped to this endpoint's origin.
    # Path is relative (e.g., "/v1/users").
    request @0 (method :Text, path :Text, headers :List(Header), body :Data)
        -> (status :UInt16, headers :List(Header), body :Data);
}

Note: same request() signature as Fetch, but path instead of url. The origin is implicit — bound into the capability at mint time.

Attenuation

A process holding Fetch mints HttpEndpoint caps by narrowing authority. The core restriction is always origin — Fetch can reach any URL, HttpEndpoint is locked to one host. Additional constraints (path prefixes, method restrictions, rate limits) are possible but are userspace policy details, not OS-level concerns.

This is the standard object-capability attenuation pattern: same interface, less authority. The application code is identical whether it holds a broad or narrow HttpEndpoint.

Boot and Initialization Sequence

The kernel doesn’t know about services. It boots, creates a handful of kernel-provided caps, and spawns exactly one process: init. Everything else is init’s responsibility.

Current State vs Target State

The implementation is in transition. The default system.cue path still lets the kernel spawn every service listed in the manifest and wire cross-service caps through kernel/src/cap/mod.rs::create_all_service_caps. The system-spawn.cue path now sets config.initExecutesManifest; in that mode the kernel validates the full manifest, boots only the first init service, grants init BootPackage and ProcessSpawner, and lets init resolve the remaining ServiceEntry graph through ProcessSpawner.

The target model removes the kernel-side service graph entirely. The manifest stops being a kernel authority graph and becomes a boot package delivered to init:

  • List of embedded binaries (init needs them before any storage service exists; they can’t be fetched from a filesystem that hasn’t started).
  • Init’s config blob (CUE-encoded tree; what to spawn, with what attenuations, with what restart policy).
  • Kernel boot parameters (memory limits, feature flags) consumed by the kernel itself, not forwarded to init.

The kernel spawns exactly one userspace process (init) with a fixed cap bundle:

  • Console — kernel serial wrapper (may be replaced later by a userspace log service, with init retaining a direct console cap for emergency use).
  • ProcessSpawner — only init and its delegated supervisors hold this.
  • FrameAllocator — physical frame authority for init’s own allocations.
  • VirtualMemory — per-process address-space authority for init.
  • DeviceManager — enumerate/claim devices; init delegates device-specific slices to drivers.
  • Timer — monotonic clock.
  • BootPackage — read-only cap exposing the embedded binaries and the config blob.

Everything else — drivers, net-stack, filesystems, supervisors, apps — init spawns at runtime via ProcessSpawner with appropriate attenuation. No manifest ServiceEntry, no cross-service CapRef, no manifest exports.

Pre-Init Boundary After Stage 6

Rule of thumb: no userspace service runs before init. The kernel’s job is primitive cap synthesis and a single-process handoff; init’s job is the whole service graph. Concretely, after Stage 6:

  • Stays in kernel pre-init: memory map ingest, frame allocator, heap, paging, GDT/IDT/TSS, serial for kernel diagnostics, scheduler, ring dispatch, kernel-cap CapObject impls, ELF loading for init, boot package measurement (if attested boot is added).
  • Stays in manifest: binaries list + init config blob + kernel boot params. Schema-wise, ServiceEntry and CapSource::Service disappear; SystemManifest shrinks to binaries + initConfig + kernelParams.
  • Moves to init: service topology, cross-service cap wiring, attenuation, restart policies, dynamic spawn, cap export/import, supervision trees. Anything a service manager would do.
  • Moves to init or later services: logging policy, config store, secrets, filesystem mounts, network configuration, device binding.

Edge cases that might look like they want a pre-init service but don’t:

  • Early crash / panic handling. Kernel-side panic handler, no service needed.
  • Recovery shell. Kernel fallback: if init fails to reach a healthy state within a timeout (e.g. exits immediately, or never issues a liveness SQE), kernel optionally spawns a “recovery” binary from the boot package with the same cap bundle. Still just one userspace process at a time pre-supervisor-loop.
  • Attested/measured boot. Kernel hashes binaries in the boot package before handing BootPackage to init. The measurement agent, if any, runs as a normal service spawned by init with a cap to the sealed measurements.
  • Early-boot console. Kernel owns serial and exposes Console to init. A userspace log service can layer on top later; it is not pre-init.

Legacy Manifest Fields After Stage 6

ServiceEntry.caps, CapSource::Service, and ServiceEntry.exports are transitional. ProcessSpawner and the generic init-side spawn loop are now in place for system-spawn.cue; the remaining cleanup is to remove these fields from the kernel bootstrap contract:

  1. Delete ServiceEntry and CapSource::Service from schema/capos.capnp.
  2. Collapse SystemManifest.services into initConfig: CueValue.
  3. Remove create_all_service_caps, the two-pass resolver, and the manifest authority-graph validator (validate_manifest_graph).
  4. Kernel spawns one process from initConfig.initBinary with the fixed cap bundle described above plus BootPackage.

The re-export restriction added in capos-config::validate_manifest_graph (service A exports cap sourced from B.ep) becomes moot at that point because there are no manifest exports at all. It stays as defensive validation while the transitional schema exists.

Init Binary Embedding

Init is part of the kernel’s bootstrap contract, not a configuration choice: the cap bundle handed to init is a kernel ABI, the _start(ring, pid, …) entry shape is a kernel ABI, and a version-mismatched init is a footgun with no payoff in a single-init research OS. So the init ELF ships inside the kernel binary via include_bytes!, not as a separate manifest entry or Limine module.

Shape:

  • init/ stays a standalone crate with its own linker script and code model (user-space base 0x200000, static relocation model, 4 KiB alignment). Not a workspace member; different build flags than the kernel.
  • kernel/build.rs drives init/’s build (or depends on the prebuilt artifact at a known path) and emits an include_bytes!("…") into a kernel::boot::INIT_ELF: &[u8] static.
  • Kernel bootstrap parses INIT_ELF through the same capos_lib::elf path used for service binaries, creates the init address space via AddressSpace::new_user(), loads segments, populates the cap bundle (including BootPackage), and jumps. No Limine module lookup for init.
  • SystemManifest.binaries stops containing an “init” entry. Its binaries list is services-only. BootPackage exposes only what init hands out to children.
  • Measured-boot attestation (if added) covers the kernel ELF, which transitively covers init’s bytes. Service binaries are hashed separately by the kernel before handing BootPackage to init.

What this does not change:

  • Init still runs in Ring 3 with its own page tables; embedding is byte packaging, not privilege merging.
  • Init is still ELF-parsed at boot — the same loader and W^X enforcement apply. The only thing different is where the bytes came from.
  • Service binaries (everything spawned after init) stay in the boot package as distinct blobs, exposed to init via BootPackage. They are not linked into the kernel; their lifecycle is independent of the kernel’s.

What option was rejected: fully linking init into the kernel crate (shared compilation unit, shared text). That collapses the kernel/user build boundary, couples linker scripts and code models, and puts init’s panics/UB inside the kernel’s compilation context. The process-isolation boundary survives that arrangement — but the build-time separation that makes the boundary trustworthy does not. include_bytes! preserves the separation; static linking destroys it.

Kernel boot
  │
  ├─ Create kernel caps: Console, Timer, DeviceManager, ProcessSpawner
  │
  └─ Spawn init with all kernel caps
       │
       init process (PID 1)
         │
         ├─ Phase 1: Core services (sequential — each depends on previous)
         │    ├─ DeviceManager.enumerate() → list of devices
         │    ├─ Spawn NIC driver with device-specific caps
         │    ├─ Wait for NIC driver to export Nic cap
         │    ├─ Spawn net-stack with Nic + Timer caps
         │    └─ Wait for net-stack to export NetworkManager cap
         │
         ├─ Phase 2: Higher-level services (can be parallel)
         │    ├─ Spawn http-service with TcpSocket cap from net-stack
         │    ├─ Spawn dns-resolver with UdpSocket cap
         │    └─ ...
         │
         └─ Phase 3: Applications
              ├─ Spawn app-a with HttpEndpoint("api.example.com")
              ├─ Spawn app-b with Fetch cap (trusted)
              └─ ...

The Init Process in Detail

Init is a regular userspace process with privileged caps. It is the only process that holds ProcessSpawner (the right to create new processes) and DeviceManager (the right to enumerate and claim devices). It can delegate subsets of these to child supervisors.

// init/src/main.rs — this IS the system configuration

fn main(caps: CapSet) {
    let spawner = caps.get::<ProcessSpawner>("spawner");
    let devices = caps.get::<DeviceManager>("devices");
    let timer = caps.get::<Timer>("timer");
    let console = caps.get::<Console>("console");

    // === Phase 1: Hardware drivers ===

    // Find the NIC
    let nic_device = devices.find("virtio-net")
        .expect("no network device found");

    // Spawn NIC driver — gets ONLY its device's MMIO + IRQ
    let nic_driver = spawner.spawn(SpawnRequest {
        binary: "/sbin/virtio-net",
        caps: caps![
            "device_mmio" => nic_device.mmio(),
            "interrupt"   => nic_device.interrupt(),
            "log"         => console.clone(),
        ],
        restart: RestartPolicy::Always,
    });

    // The driver exports a Nic cap once initialized
    let nic: Cap<Nic> = nic_driver.exported("nic").wait();

    // === Phase 2: Network stack ===

    let net_stack = spawner.spawn(SpawnRequest {
        binary: "/sbin/net-stack",
        caps: caps![
            "nic"   => nic,
            "timer" => timer.clone(),
            "log"   => console.clone(),
        ],
        restart: RestartPolicy::Always,
    });

    let net_mgr: Cap<NetworkManager> = net_stack.exported("net").wait();

    // === Phase 3: HTTP service ===

    let tcp = net_mgr.create_tcp_pool();

    let http_service = spawner.spawn(SpawnRequest {
        binary: "/sbin/http-service",
        caps: caps![
            "tcp" => tcp,
            "log" => console.clone(),
        ],
        restart: RestartPolicy::Always,
    });

    let fetch: Cap<Fetch> = http_service.exported("fetch").wait();

    // === Phase 4: Applications ===

    // Trusted telemetry agent — gets full Fetch
    spawner.spawn(SpawnRequest {
        binary: "/sbin/telemetry",
        caps: caps![
            "fetch" => fetch.clone(),
            "log"   => console.clone(),
        ],
        restart: RestartPolicy::OnFailure,
    });

    // Sandboxed app — gets scoped HttpEndpoint
    let api_cap = fetch.attenuate(EndpointPolicy {
        origin: "https://api.example.com",
        paths: Some("/v1/users/*"),
        methods: Some(&["GET", "POST"]),
    });

    spawner.spawn(SpawnRequest {
        binary: "/app/my-service",
        caps: caps![
            "api" => api_cap,
            "log" => console.clone(),
        ],
        restart: RestartPolicy::OnFailure,
    });

    // Init stays alive as the root supervisor
    supervisor_loop(&spawner);
}

Key Mechanisms

Cap export. A spawned process can export capabilities back to its parent via the ProcessHandle (see Spawn Mechanism section). This is how the NIC driver makes its Nic cap available to the network stack — init spawns the driver, waits for it to export "nic", then passes that cap to the next process.

Restart policy. Encoded in SpawnRequest, enforced by the supervisor loop in the spawning process. When a child exits unexpectedly:

  1. Old caps held by the child are automatically revoked (kernel invalidates the process’s cap table on exit)
  2. Supervisor re-spawns with the same SpawnRequest
  3. New instance gets fresh caps — same authority, new identity

Dependency ordering. Sequential in code: wait() on exported caps blocks until the dependency is ready. No declarative dependency graph needed — Rust’s control flow is the dependency graph.

Service Taxonomy

Concrete categories of userspace services capOS expects to run. All spawned by init (or a supervisor init delegates to) after Stage 6. None are pre-init.

Hardware Drivers

One process per managed device. Each holds exactly the caps for its own hardware: an DeviceMmio slice, the corresponding Interrupt cap, and optionally a DmaRegion cap carved out of the frame allocator. Exports a typed device cap (Nic, BlockDevice, Framebuffer, Gpu, …). Examples: virtio-net, virtio-blk, NVMe, AHCI, framebuffer/GPU.

Platform Services

  • Logger / journal — accepts Log cap writes, forwards to console and/or durable storage. Init and kernel bootstrap use a direct Console cap until the logger is up; afterwards new services get Log caps only.
  • Filesystem — one per mounted volume. Consumes a BlockDevice cap, exports Directory / File caps. FAT, ext4, overlay, tmpfs.
  • Store — capability-native content-addressed storage backing persistent capability state (storage-and-naming-proposal.md).
  • Network stack — userspace TCP/IP (networking-proposal.md). Consumes Nic + Timer, exports NetworkManager, TcpSocket, UdpSocket, TcpListener.
  • DNS resolver — consumes a UdpSocket, exports Resolver.
  • Config / secrets store — reads the initial config from BootPackage, exposes runtime Config and Secret caps with per-key attenuation.
  • Cloud metadata agent — detects IMDS / ConfigDrive / SMBIOS on cloud boot and delivers a ManifestDelta (cloud-metadata-proposal.md).
  • Upgrade manager — orchestrates CapRetarget for live service replacement (live-upgrade-proposal.md).
  • Capability proxy — makes local caps reachable over the network with capnp-rpc transport (Plan 9’s exportfs equivalent).
  • Measurement / attestation agent — consumes sealed kernel hashes from BootPackage, exposes Quote caps for remote attestation.

Supervisors

Per-subsystem restart managers that hold a narrowed ProcessSpawner plus the caps of the subtree they own. If any child crashes, the supervisor tears down and re-spawns the set. Example: net-supervisor owns NIC driver + net-stack + DHCP client.

Application Services

User-facing or user-spawned processes: HTTP servers, API gateways, worker pools, shells, interactive tools. Hold only the narrow caps the supervisor grants (HttpEndpoint for one origin, Directory for one mount, etc.). Human users, service accounts, guests, and anonymous callers are represented by session/profile services that grant scoped cap bundles; they are not kernel subjects or ambient process credentials. See user-identity-and-policy-proposal.md.

What Does Not Become a Service

  • Console / serial — stays in the kernel as a CapObject wrapper. Small enough, needed for kernel diagnostics, no benefit from userspace isolation. A userspace log service can layer on top.
  • Frame allocator, virtual memory, scheduler, ring dispatch — kernel primitives, exposed as caps but not as services.
  • Interrupt delivery, DMA mapping — kernel mechanisms, exposed to drivers as caps.
  • Boot measurement — if added, happens in the kernel before BootPackage exists; the measurement agent (userspace) only reports them.

Supervision

Supervision Tree

Init doesn’t have to supervise everything directly. It can delegate:

init (root supervisor)
  ├─ net-supervisor (holds: spawner subset, device caps)
  │    ├─ virtio-net driver
  │    ├─ net-stack
  │    └─ http-service
  └─ app-supervisor (holds: spawner subset, service caps)
       ├─ my-service
       └─ another-app

Each supervisor is a process that holds a ProcessSpawner cap (possibly restricted to specific binaries) and the caps it needs to grant to children. If net-supervisor crashes, init restarts it, and it re-spawns the entire networking subtree.

Supervisor Loop

#![allow(unused)]
fn main() {
fn supervisor_loop(children: &[SpawnRequest], spawner: &ProcessSpawner) {
    let mut handles: Vec<ProcessHandle> = children.iter()
        .map(|req| spawner.spawn(req.clone()))
        .collect();

    loop {
        // Wait for any child to exit
        let (index, exit_code) = wait_any(&handles);
        let req = &children[index];

        match req.restart {
            RestartPolicy::Always => {
                handles[index] = spawner.spawn(req.clone());
            }
            RestartPolicy::OnFailure if exit_code != 0 => {
                handles[index] = spawner.spawn(req.clone());
            }
            _ => {
                // Process exited normally, don't restart
            }
        }
    }
}
}

Socket Activation

systemd pre-creates a socket and passes the fd to the service on first connection. In capOS, the supervisor does the same with caps:

Eager (default): supervisor spawns the child immediately with a TcpListener cap. Child calls accept() and blocks.

Lazy: supervisor holds the TcpListener cap itself. On first incoming connection (or on first accept() from a proxy cap), it spawns the child and transfers the cap. The child code is identical in both cases.

#![allow(unused)]
fn main() {
// Lazy activation — supervisor holds the listener until needed
let listener = net_mgr.create_tcp_listener();
listener.bind([0,0,0,0], 8080);

// This blocks until a connection arrives
let _conn = listener.accept();

// Now spawn the actual service, giving it the listener
spawner.spawn(SpawnRequest {
    binary: "/app/web-server",
    caps: caps!["listener" => listener, "log" => console.clone()],
    restart: RestartPolicy::Always,
});
}

Configuration

See docs/proposals/storage-and-naming-proposal.md for the full storage, naming, and configuration model.

Summary: the system topology is currently defined in a capnp-encoded system manifest baked into the boot image. tools/mkmanifest compiles the human-authored system.cue or system-spawn.cue source into the binary manifest. The transition path already lets init validate and execute that manifest through ProcessSpawner; the default boot path still needs to retire the legacy kernel-side service graph. Runtime configuration lives in a capability-based store service once that service exists.

Comparison with Traditional Approaches

Concernsystemd/LinuxcapOS
Service dependenciesWants=, After=, Requires=Implicit in cap graph
Sandboxingseccomp, namespaces, AppArmorDefault: zero ambient authority
Socket activationListenStream=, fd passing protocolPass TcpListener cap
Restart policyRestart=on-failureSupervisor process loop
Loggingjournald, StandardOutput=journalLog cap in granted set
Resource limitscgroups, MemoryMax=, CPUQuota=Bounded allocator caps
Network access controlfirewall rules (iptables/nftables)Scoped HttpEndpoint / TcpSocket caps
Config formatINI-like unit files (~1500 directives)Rust code or minimal manifest
Trusted computing basesystemd PID 1 (~1.4M lines)Init process (hundreds of lines)

Spawn Mechanism

Spawning is a capability-gated operation. The kernel provides a ProcessSpawner capability — only the holder can create new processes.

Implemented Kernel Slice

The kernel now provides:

  1. ProcessSpawner capability — a CapObject impl in kernel/src/cap/process_spawner.rs. Methods:

    • spawn(name, binaryName, grants) -> handleIndex — resolve a boot-package binary, load ELF, create address space (builds on existing elf.rs loader and AddressSpace::new_user() in mem/paging.rs), populate the initial cap table, schedule the process, and return the ProcessHandle through the ring result-cap list
    • the returned ProcessHandle cap lets the parent wait for child exit in the first slice; exported caps and kill semantics are later lifecycle work
  2. Initial cap passing — at spawn time, the kernel copies permitted parent cap references into the child’s cap table or mints authorized child-local kernel caps. Raw grants preserve the source badge, endpoint-client grants can mint a requested badge from an owner endpoint, and child-local FrameAllocator/VirtualMemory grants are created for the child’s address space. The parent’s references are unaffected.

  3. Cap export — future lifecycle work will let a child register a cap by name in its ProcessHandle, making it available to the parent (or anyone holding the handle). This is the mechanism behind nic_driver.exported("nic").wait() once exported-cap lookup is added.

Schema

interface ProcessSpawner {
    spawn @0 (name :Text, binaryName :Text, grants :List(CapGrant)) -> (handleIndex :UInt16);
}

struct CapGrant {
    name @0 :Text;
    capId @1 :UInt32;
    interfaceId @2 :UInt64;
    mode @3 :CapGrantMode;
    badge @4 :UInt64;
    source @5 :CapGrantSource;
}

struct CapGrantSource {
    union {
        capability @0 :Void;
        kernel @1 :KernelCapSource;
    }
}

enum CapGrantMode {
    raw @0;
    clientEndpoint @1;
}

interface ProcessHandle {
    wait @0 () -> (exitCode :Int64);
}

Note on capability passing: Capabilities are referenced by cap table slot IDs (UInt32), not by Cap’n Proto’s native capability table mechanism. spawn() returns the ProcessHandle through the ring result-cap list; handleIndex identifies that transferred cap in the completion. The first slice passes a boot-package binaryName instead of raw ELF bytes so the request stays within the bounded ring parameter buffer. kill, post-spawn grants, and exported-cap lookup remain future lifecycle work until their teardown and authority semantics are implemented. capOS uses manual capnp dispatch (CapObject trait with raw message bytes, not capnp-rpc), so cap references are plain integers and typed result caps use the ring transfer-result metadata. See userspace-binaries-proposal.md Part 7 for the surrounding userspace bootstrap schema context.

Relationship to Existing Code

The current kernel has these pieces in place:

  • ELF loading (kernel/src/elf.rs) — parses PT_LOAD segments, validates alignment, and feeds the reusable spawn primitive behind ProcessSpawner.
  • Address space creation (kernel/src/mem/paging.rs) — AddressSpace::new_user() creates isolated page tables with the kernel mapped in the upper half.
  • Cap table (kernel/src/cap/table.rs) — CapTable with insert(), get(), remove(), transfer preflight, provisional insert, commit, and rollback helpers. Each Process owns one local table.
  • Process struct and scheduler (kernel/src/process.rs, kernel/src/sched.rs) — a process table plus round-robin run queue are in place for both legacy manifest-spawned services and init-spawned children.

Generic capability transfer/release and the reusable ProcessSpawner lifecycle path are complete enough for init-owned service startup. Remaining lifecycle gaps are kill, post-spawn grants, runtime exported-cap lookup, restart supervision, and retiring the default kernel-side service graph.

Prerequisites

PrerequisiteStatusWhy
ELF loading + address spacesDone (Stage 2-3)elf.rs, AddressSpace::new_user()
Capability ring + cap_enterDone (Stage 4/6 foundation)Ring-based cap invocation with blocking waits
Scheduling + preemption (core)Done (Stage 5)Round-robin, PIT 100 Hz, context switch
Cross-process Endpoint IPCDone (Stage 6 foundation)CALL/RECV/RETURN routing through Endpoint objects
Generic cap transfer/releaseDone (Stage 6, 2026-04-22)Copy/move transfer, result-cap insertion, CAP_OP_RELEASE; epoch revocation still future
ProcessSpawner + ProcessHandleDone (Stage 6, 2026-04-22)Init-driven spawn with grants, wait completion, hostile-input coverage; kill/post-spawn grants still future
Authority graph + quota design (S.9)Done (2026-04-21)Defines transfer/spawn invariants, per-process quotas, and rollback rules; see docs/authority-accounting-transfer-design.md

This proposal describes the target architecture. Individual pieces (like Fetch/HttpEndpoint) are additive — they’re userspace processes that compose existing caps into higher-level ones. No kernel changes needed beyond Stages 4-6.

First Step After Transfer and ProcessSpawner — done 2026-04-22

The minimal demonstration of this architecture landed together with capability transfer and ProcessSpawner:

  1. ProcessSpawner cap in kernel/src/cap/process_spawner.rs wraps ELF loading and address-space creation behind a typed capability.
  2. Init spawns childrenmake run-spawn boots a manifest with config.initExecutesManifest set; the kernel boots only init, then init spawns endpoint-roundtrip, ipc-server, and ipc-client from manifest entries through ProcessSpawner, grants endpoint owner/client facets, and waits on each ProcessHandle.
  3. Cross-process cap invocation — spawned client invokes the server’s Endpoint cap, server replies, both print to console.

This exercises: spawn cap, initial cap passing, manifest-declared export recording, cross-process cap invocation, hostile-input rejection, and per-process resource exhaustion paths. Generic manifest execution exists for the system-spawn.cue transition path. Making it the default make run path and deleting the legacy kernel resolver is the selected follow-on milestone in WORKPLAN.md.

Open Questions

  1. Cap revocation. If a service is restarted, its old caps should be invalidated. Planned approach (from research): epoch-based revocation (EROS-inspired, O(1) invalidate) plus generation counters on CapId (Zircon-inspired, stale reference detection). See ROADMAP.md Stages 5-6.

  2. Cap discovery. How does a process learn what caps it was given? Resolved: name→(cap_id, interface_id) mapping passed at spawn via a well-known page (CapSet). See userspace-binaries-proposal.md Part 2. cap_id is the authority-bearing table handle. interface_id is the transported capnp TYPE_ID used by typed clients to check that the handle speaks the expected interface.

  3. Lazy spawning. Should the init process start everything eagerly, or should caps be backed by lazy proxies that spawn the backing service on first invocation?

  4. Cap persistence. If the system reboots, should the cap graph be reconstructable from saved state? Or is it always rebuilt from init code?

  5. Delegation depth. Can an application further delegate its HttpEndpoint cap to a subprocess? If so, the HTTP gateway needs to support fan-out. If not, how is this restriction enforced?

Proposal: Storage, Naming, and Persistence

What replaces the filesystem in a capability OS where Cap’n Proto is the universal wire format.

The Problem with Filesystems

In Unix, the filesystem is the universal namespace. Everything is a path: /dev/sda, /etc/config, /proc/self/fd/3, /run/dbus/system_bus_socket. Paths are ambient authority — any process can open /etc/passwd if the permission bits allow. The filesystem conflates naming, access control, persistence, and device abstraction into one mechanism.

capOS has capabilities instead of paths. Access control is structural (you can only use what you were granted), not advisory (permission bits checked at open time). This means:

  • No global namespace needed — each process sees only its granted caps
  • No path-based access control — the cap IS the access
  • No distinction between “file”, “device”, “socket” — everything is a typed capability interface

A traditional VFS would reintroduce ambient authority through the back door. Instead, capOS needs a storage and naming model native to capabilities and Cap’n Proto.

Core Insight: Cap’n Proto Everywhere

Cap’n Proto is already used in capOS for:

  • Interface definitions.capnp schemas define capability contracts
  • IPC messages — capability invocations are capnp messages
  • Serialization — capnp wire format crosses process boundaries

If we extend this to storage, then:

  • Stored objects are capnp messages
  • Configuration is capnp structs
  • Binary images are capnp-wrapped blobs
  • The boot manifest is a capnp message describing the initial capability graph

No format conversion anywhere. The same tools (schema compiler, serializer, validator) work for IPC, storage, config, and network transfer.

Architecture

Three Layers

Target architecture after the manifest executor and process-spawner work:

Boot Image (read-only, baked into ISO)
  │
  │  capnp-encoded manifest + binaries
  │
  v
Kernel (creates initial caps from manifest)
  │
  │  grants caps to init
  │
  v
Init (builds live capability graph)
  │
  ├──> Filesystem services (FAT, ext4 — wrap BlockDevice as Directory/File)
  │
  ├──> Store service (capability-native content-addressed storage)
  │      backed by: virtio-blk, RAM, or network
  │
  └──> All other services (receive Directory, Store, or Namespace caps)

Layer 1: Boot Image

The boot image (ISO/disk) contains a capnp-encoded system manifest loaded as a Limine module alongside the kernel. The manifest describes:

struct SystemManifest {
    # Binaries available at boot, keyed by name
    binaries @0 :List(NamedBlob);
    # Initial service graph — what to spawn and with what caps
    services @1 :List(ServiceEntry);
    # Static configuration values as an evaluated CUE-style tree
    config @2 :CueValue;
}

struct NamedBlob {
    name @0 :Text;
    data @1 :Data;
}

struct ServiceEntry {
    name @0 :Text;
    binary @1 :Text;          # references a NamedBlob by name
    caps @2 :List(CapRef);    # what caps this service receives
    restart @3 :RestartPolicy;
    exports @4 :List(Text);   # cap names this service is expected to export
}

struct CapRef {
    name @0 :Text;                 # local name in the child's cap table
    expectedInterfaceId @1 :UInt64; # generated .capnp TYPE_ID for validation
    union {
        unset @2 :Void;             # invalid; keeps omitted sources fail-closed
        kernel @3 :KernelCapSource;
        service @4 :ServiceCapSource;
    }
}

enum KernelCapSource {
    console @0;
    endpoint @1;
    frameAllocator @2;
    virtualMemory @3;
}

struct ServiceCapSource {
    service @0 :Text;
    export @1 :Text;
}

enum RestartPolicy {
    never @0;
    onFailure @1;
    always @2;
}

struct CueValue {
    union {
        null @0 :Void;
        boolean @1 :Bool;
        intValue @2 :Int64;
        uintValue @3 :UInt64;
        text @4 :Text;
        bytes @5 :Data;
        list @6 :List(CueValue);
        fields @7 :List(CueField);
    }
}

struct CueField {
    name @0 :Text;
    value @1 :CueValue;
}

Capability source identity is already structured in the bootstrap manifest, so source selection does not depend on parsing authority strings:

struct CapRef {
    name @0 :Text;                 # local name in the child's CapSet
    expectedInterfaceId @1 :UInt64; # generated .capnp TYPE_ID for validation
    union {
        unset @2 :Void;             # invalid; keeps omitted sources fail-closed
        kernel @3 :KernelCapSource;
        service @4 :ServiceCapSource;
    }
}

enum KernelCapSource {
    console @0;
    endpoint @1;
    frameAllocator @2;
    virtualMemory @3;
}

struct ServiceCapSource {
    service @0 :Text;
    export @1 :Text;
}

KernelCapSource / ServiceCapSource select the authority to grant. The expectedInterfaceId field carries the generated Cap’n Proto interface TYPE_ID and only checks that the granted object speaks the expected schema. It cannot replace source identity: many different objects may expose the same interface while representing different authority.

The build system (Makefile) generates this manifest from a human-authored description and packs it into the ISO as manifest.bin. Current code embeds every SystemManifest.binaries entry into that manifest as NamedBlob data, including the release-built init and smoke-demo ELFs. Exposing the manifest to init as a read-only BootPackage capability (rather than letting the kernel parse and act on the service graph) is the selected follow-on milestone.

Using a CueValue tree instead of AnyPointer keeps the manifest directly decodable in no_std userspace without depending on Cap’n Proto reflection.

Transitional Schema Note

ServiceEntry, CapSource::Service, and ServiceEntry.exports are transitional. ProcessSpawner and copy/move cap transfer are implemented (2026-04-22), but the default make run boot path still has the kernel spawn every declared service and wire cross-service caps. Once init owns generic manifest execution, the manifest loses the service graph entirely:

struct SystemManifest {
    # Binaries available at boot, keyed by name
    binaries @0 :List(NamedBlob);
    # Init's config blob (replaces the service graph)
    initConfig @1 :CueValue;
    # Kernel boot parameters (memory limits, feature flags)
    kernelParams @2 :CueValue;
}

ServiceEntry / CapRef disappear from the schema and become plain CUE fields inside initConfig. Init reads them at runtime and calls ProcessSpawner directly. validate_manifest_graph, validate_bootstrap_cap_sources, and create_all_service_caps all retire once that happens. See docs/proposals/service-architecture-proposal.md — “Legacy Manifest Fields After Stage 6” for the deprecation plan.

Layer 2: Kernel Bootstrap

Target design for the kernel’s boot role:

  1. Parse the system manifest (read-only capnp message from Limine module).
  2. Hash the embedded binaries for optional measured-boot attestation.
  3. Create kernel-provided capabilities: Console, Timer, DeviceManager, ProcessSpawner, FrameAllocator, VirtualMemory (per-process), and a read-only BootPackage cap exposing SystemManifest.binaries and initConfig.
  4. Spawn init — exactly one userspace process — with that cap bundle.

Current code has not reached this split for the default boot: the kernel still parses the manifest and creates one process per ServiceEntry. The transition path exists in system-spawn.cue: it sets config.initExecutesManifest, the kernel validates the full manifest but boots only init, and init spawns endpoint, IPC, VirtualMemory, and FrameAllocator cleanup demo children through ProcessSpawner. Retiring the legacy kernel resolver for default make run is the selected follow-on milestone tracked in WORKPLAN.md.

Layer 3: Init and the Live Capability Graph

Target init reads initConfig from the BootPackage cap and executes it:

fn main(caps: CapSet) {
    let spawner = caps.get::<ProcessSpawner>("spawner");
    let boot = caps.get::<BootPackage>("boot");
    let config = boot.init_config()?;  // CueValue

    // Walk service entries from the config and spawn in dependency order
    for entry in config.field("services")?.iter()? {
        let binary = boot.binary(entry.field("binary")?.as_str()?)?;
        let granted = resolve_caps(entry.field("caps")?, &running_services, &caps);
        let handle = spawner.spawn(binary, granted, entry.field("restart")?.into())?;
        running_services.insert(entry.field("name")?.as_str()?.into(), handle);
    }

    supervisor_loop(&running_services);
}

In this target model, init is a generic manifest executor rather than a hardcoded service graph. The system topology is defined in the boot package’s initConfig, not in init’s source code. Changing what services run means rebuilding the boot image with a different config blob, not recompiling init. Manifest graph resolution stops being a kernel concern.

The current transition still uses SystemManifest.services as the service graph instead of initConfig; init reads the BootPackage manifest, validates a metadata-only ManifestBootstrapPlan, resolves kernel and service cap sources, records exported caps, spawns children in manifest order, and waits for their ProcessHandles.

Two Storage Models

capOS supports two complementary storage models, both exposed as typed capabilities:

Filesystem Capabilities (Directory, File)

For accessing traditional block-based filesystems (FAT, ext4, ISO9660) and for POSIX compatibility. A filesystem service wraps a BlockDevice and exports Directory/File capabilities.

BlockDevice (raw sectors)
    │
    └──> Filesystem service (FAT, ext4, ...)
              │
              ├──> Directory caps (namespace over files)
              └──> File caps (read/write byte streams)

This model maps naturally to USB flash drives, NVMe partitions, and network-mounted filesystems. The open() and sub() operations return new capabilities via IPC cap transfer (see “IPC and Capability Transfer” below).

Capability-Native Store (Store, Namespace)

For capOS-native data: configuration, service state, content-addressed object storage. A store service wraps a BlockDevice and exports Store/Namespace capabilities.

BlockDevice (raw sectors)
    │
    └──> Store service
              │
              ├──> Store cap (content-addressed put/get)
              └──> Namespace caps (mutable name→hash mappings)

Content-addressing provides automatic deduplication, verifiable integrity, and immutable references. Namespaces add mutable bindings on top.

Bridging the Two Models

The models are composable. An adapter service can bridge between them:

  • FsStore adapter: exposes a Directory tree as a content-addressed Store (hash each file’s contents, directory listings become capnp-encoded objects)
  • StoreFS adapter: exposes Store/Namespace as a Directory tree (each name maps to a File whose contents are the stored object)
  • Import/export: a utility service reads files from a Directory and stores them in a Store, or materializes Store objects as files in a Directory

In both cases the adapter is a userspace service holding caps to both subsystems. No kernel mechanism needed — just capability composition.

File I/O Interfaces

Directory, File, Store, and Namespace caps may be scoped to a user session, guest profile, anonymous request, or service identity, but the cap remains the authority. POSIX ownership metadata is compatibility data inside these services, not a system-wide authorization channel. See user-identity-and-policy-proposal.md.

BlockDevice

Raw sector access, served by device drivers (virtio-blk, NVMe, USB mass storage). The driver receives hardware capabilities (MMIO, IRQ, FrameAllocator for DMA) and exports a BlockDevice cap.

interface BlockDevice {
    readBlocks  @0 (startLba :UInt64, count :UInt32) -> (data :Data);
    writeBlocks @1 (startLba :UInt64, count :UInt32, data :Data) -> ();
    info        @2 () -> (blockSize :UInt32, blockCount :UInt64, readOnly :Bool);
    flush       @3 () -> ();
}

For bulk transfers, readBlocks/writeBlocks accept a SharedBuffer capability instead of inline Data (see “Shared Memory for Bulk Data” below). The inline-Data variants work for metadata reads and small operations; the SharedBuffer variants avoid copies for large I/O.

File

Byte-stream access to a single file. Served by filesystem services. Created dynamically when a client calls Directory.open() — the filesystem service creates a File CapObject for the opened file and transfers it to the caller via IPC cap transfer.

interface File {
    read     @0 (offset :UInt64, length :UInt32) -> (data :Data);
    write    @1 (offset :UInt64, data :Data) -> (written :UInt32);
    stat     @2 () -> (size :UInt64, created :UInt64, modified :UInt64);
    truncate @3 (length :UInt64) -> ();
    sync     @4 () -> ();
    close    @5 () -> ();
}

close releases the server-side state for this file (open cluster chain cache, dirty buffers). The kernel-side CapTable entry is removed by the system transport via CAP_OP_RELEASE when the local holder releases it; generated capos-rt handle drop still needs RAII integration before ordinary userspace handles submit that opcode automatically. CapabilityManager is management-only (list(), later grant()); it does not expose a drop() method because ordinary handle lifetime belongs to the transport, not to an application call on the same table that dispatches it.

Attenuation: a read-only File wraps the original and rejects write, truncate, sync calls. An append-only File rejects write at offsets other than the current size.

Directory

Namespace over files on a filesystem. Served by filesystem services. open() and sub() return new capabilities via IPC cap transfer.

interface Directory {
    open    @0 (name :Text, flags :UInt32) -> (file :File);
    list    @1 () -> (entries :List(DirEntry));
    mkdir   @2 (name :Text) -> (dir :Directory);
    remove  @3 (name :Text) -> ();
    sub     @4 (name :Text) -> (dir :Directory);
}

struct DirEntry {
    name  @0 :Text;
    size  @1 :UInt64;
    isDir @2 :Bool;
}

sub() returns a Directory scoped to a subdirectory — the analog of chroot. The caller cannot traverse upward or see the parent directory. open() with create flags creates a new file if it doesn’t exist.

The flags field in open() is a bitmask: CREATE = 1, TRUNCATE = 2, APPEND = 4. No READ/WRITE flags — those are determined by the Directory cap’s attenuation (a read-only Directory returns read-only Files).

Syscall Trace: Reading a File from a FAT USB Drive

Four userspace processes: App, FAT service, USB mass storage, xHCI driver.

With promise pipelining (one submission):

Cap’n Proto promise pipelining lets the App chain dependent calls without waiting for intermediate results. The App submits a single pipelined request: “open this file, then read from the result”:

# Single pipelined submission (SQEs with PIPELINE flag):
#   call 0: dir.open("report.pdf")         → promise P0
#   call 1: P0.file.read(offset=0, len=4096)  → depends on P0

cap_submit([
    {cap=2, method=OPEN, params={"report.pdf", flags=0}},
    {cap=PIPELINE(0, field=file), method=READ, params={offset:0, length:4096}},
])
  → kernel routes call 0 to FAT service via Endpoint
  → FAT service reads directory entry from BlockDevice
  → FAT service creates FileCapObject, replies with File cap
  → kernel sees pipelined call 1 targeting the File cap from call 0
  → kernel dispatches call 1 to the same FAT service (or direct-invokes
    the new File CapObject if it's a local endpoint)
  → FAT service maps offset → cluster chain → LBA
  → FAT service submits CALL SQE: {cap=blk_cap, method=READ_BLOCKS, params={lba, count}}
      → USB mass storage → xHCI → hardware → back up
  ← completion: {data: [4096 bytes]}, File cap installed as cap_id=5

One app-to-kernel transition. The kernel resolves the pipeline dependency internally — the App never sees the intermediate File cap until the whole chain completes (though the cap is installed and usable afterward).

This is a core Cap’n Proto feature: by expressing “call method on the not-yet-resolved result of another call,” the client avoids a round-trip for each link in the chain. For deeper chains (e.g., dir.sub("a").sub("b") .open("file").read(0, 4096)), the savings compound — one submission instead of four sequential syscalls.

Without pipelining (two sequential ring submissions):

Without promise pipelining, the App submits two separate CALL SQEs via the ring, blocking on each completion before submitting the next:

# 1. Open file (App holds Directory cap, cap_id=2)
# App writes CALL SQE: {cap=2, method=OPEN, params={"report.pdf", flags=0}}
cap_enter(min_complete=1, timeout=MAX)
  → kernel routes CALL to FAT service via Endpoint
  → FAT service reads directory entry from BlockDevice
  → FAT service creates FileCapObject for this file
  → FAT service posts RETURN SQE with [FileCapObject] in xfer_caps
  → kernel installs File cap in App's table → cap_id=5
  ← App reads CQE: result={file: cap_index=0}, new_caps=[5]

# 2. Read 4096 bytes from offset 0
# App writes CALL SQE: {cap=5, method=READ, params={offset:0, length:4096}}
cap_enter(min_complete=1, timeout=MAX)
  → kernel routes CALL to FAT service
  → FAT service maps offset → cluster chain → LBA
  → FAT service submits CALL SQE: {cap=blk_cap, method=READ_BLOCKS, params={lba, count}}
      → kernel routes to USB mass storage
      → mass storage submits CALL SQE: {cap=usb_cap, method=BULK_TRANSFER, params={scsi_cmd}}
          → kernel routes to xHCI driver
          → xHCI programs TRBs, waits for interrupt
          ← returns raw sector data
      ← returns sector data
  ← FAT service extracts file bytes, posts RETURN SQE with {data: [4096 bytes]}

This works but costs two round-trips where pipelining needs one. The synchronous path is useful for simple cases and bootstrapping; pipelining is the intended steady-state model.

In both cases, the intermediate IPC hops (FAT → USB mass storage → xHCI) are invisible to the App.

Capability-Native Store

The Store Capability

Once the system is running, persistent storage is provided by a userspace service — the store. It’s backed by a block device (virtio-blk), and exposes a content-addressed object store where objects are capnp messages.

interface Store {
    # Store a capnp message, returns its content hash
    put @0 (data :Data) -> (hash :Data);
    # Retrieve by hash
    get @1 (hash :Data) -> (data :Data);
    # Check existence
    has @2 (hash :Data) -> (exists :Bool);
    # Delete (if caller has authority — see note below)
    delete @3 (hash :Data) -> ();
}

Note on delete: In a content-addressed store, deleting a hash can break references from other namespaces pointing to the same object. delete on the base Store interface is dangerously broad — a StoreAdmin interface (separate from Store) may be more appropriate, with delete restricted to a GC service that can verify no live references exist. Open Question #3 (GC) should be resolved before implementing delete. The attenuation table below lists Store (full) as “Read, write, delete any object” — in practice, most callers should receive a Store attenuated to put/get/has only.

Content-addressed means:

  • Deduplication is automatic (same content = same hash)
  • Integrity is verifiable (hash the data, compare)
  • References between objects are just hashes embedded in capnp messages
  • No mutable paths — “updating a file” means storing a new version and updating the reference

Mutable References: Namespaces

A Namespace capability provides mutable name-to-hash mappings on top of the immutable store:

interface Namespace {
    # Resolve a name to a store hash
    resolve @0 (name :Text) -> (hash :Data);
    # Bind a name to a hash (if caller has write authority)
    bind @1 (name :Text, hash :Data) -> ();
    # List names (if caller has list authority)
    list @2 () -> (names :List(Text));
    # Get a sub-namespace (attenuated — restricted to a prefix)
    sub @3 (prefix :Text) -> (ns :Namespace);
}

A Namespace cap scoped to "config/" can only see and modify names under that prefix. This is the analog of a chroot — but structural, not a kernel hack. The sub() method returns a new Namespace cap via IPC cap transfer.

Future: union composition. The research survey recommends extending Namespace with Plan 9-inspired union semantics — a union(other, mode) method that merges two namespaces with before/after/replace ordering. This adds composability without a global mount table. See research.md §6.

IPC and Capability Transfer

Several storage operations return new capabilities: Directory.open() returns a File, Directory.sub() returns a Directory, Namespace.sub() returns a Namespace. This requires dynamic capability management — the kernel must install new capabilities in a process’s CapTable at runtime as part of IPC.

The Capability Ring

All kernel-userspace interaction goes through a shared-memory ring pair (submission queue + completion queue), inspired by io_uring. SQE opcodes map to capnp-rpc Level 1 message types. The ring is allocated per-process at spawn time and mapped into the process’s address space.

Syscall surface: 2 syscalls. New capabilities, operations, and transfer mechanisms are expressed as new SQE opcodes instead of expanding the syscall ABI.

#SyscallPurpose
1exit(code)Terminate process
2cap_enter(min_complete, timeout_ns)Process pending SQEs, then wait until enough CQEs exist or the timeout expires

Writing SQEs is syscall-free, but ordinary capability CALLs make progress through cap_enter. Timer polling handles non-CALL ring work and only CALL targets that explicitly opt into interrupt-context dispatch. cap_enter flushes pending SQEs and can block the process until min_complete completions are available or a finite timeout expires. An indefinite wait uses timeout_ns = u64::MAX; timeout_ns = 0 keeps the call non-blocking. A future SQPOLL-style worker can reintroduce a zero-syscall CALL-completion hot path without running arbitrary capability methods from timer interrupt context.

The ring structs and synchronous CALL dispatch are implemented and working. See capos-config/src/ring.rs for the shared ring structs and kernel/src/cap/ring.rs for kernel-side processing.

Ring Layout

One 4 KiB page per process, mapped into both kernel (HHDM) and user space:

┌─────────────────────────┐  offset 0
│ Ring Header              │  SQ/CQ head, tail, mask, flags
├─────────────────────────┤  offset 128
│ SQE Array (16 × 64B)    │  submission queue entries
├─────────────────────────┤  offset 1152
│ CQE Array (32 × 32B)    │  completion queue entries
└─────────────────────────┘

SQ: userspace owns tail (producer), kernel owns head (consumer)
CQ: kernel owns tail (producer), userspace owns head (consumer)

SQE Opcodes

Five opcodes handle everything — client calls, server dispatch, capability transfer, pipelining, and lifecycle:

Opcodecapnp-rpc analogPurpose
CALLCallInvoke method on a capability
RETURNReturnRespond to incoming call (server side)
RECV(implicit)Wait for incoming calls on Endpoint
RELEASEReleaseDrop a capability reference
FINISHFinishRelease pipeline answer state
TIMEOUTPost a CQE after N nanoseconds (io_uring-inspired)

TIMEOUT is an alternative to the timeout_ns argument on cap_enter: it works with zero-syscall polling (kernel fires the CQE on a timer tick) and composes with LINK/DRAIN for deadline-based chains.

SQE flags: PIPELINE (cap_id is a promise reference), LINK (chain to next SQE), MULTISHOT (keep generating CQEs), DRAIN (barrier).

Promise Pipelining

A CALL SQE can target either a concrete CapId or a PromisedAnswer reference (via the PIPELINE flag + pipeline_dep/pipeline_field fields). The kernel resolves the dependency chain internally:

SQE[0]: CALL dir.open("report.pdf")        → user_data=100
SQE[1]: CALL [PIPELINE: dep=100, field=0].read(0, 4096)  → user_data=101

One cap_enter call. The kernel dispatches SQE[0], extracts the File cap from the result, and dispatches SQE[1] against it — all without returning to userspace between steps.

The Endpoint Kernel Object

For cross-process IPC, an Endpoint connects client-side proxy caps to a server’s receive loop:

Client's CapTable                                   Server's CapTable
┌─────────────────┐                                 ┌──────────────────┐
│ cap 2: Proxy     │                                 │ cap 0: Endpoint   │
│   → endpoint ────────── Endpoint ◄──── RECV SQE ──│                  │
│   badge: 42      │     (kernel obj)                │                  │
└─────────────────┘                                 └──────────────────┘

The server posts a RECV SQE (with MULTISHOT flag). Incoming calls appear as CQEs with badge, interface_id, method_id, and a kernel-assigned call_id. The server responds by posting a RETURN SQE referencing the call_id.

interface_id is the transported schema ID for the interface being invoked. It should equal the generated TYPE_ID for that capnp interface. cap_id is the authority-bearing table handle; interface_id is only the protocol tag. The target capability entry owns one public interface; method_id selects a method inside that interface, while cap_id identifies the object being invoked. If the same backing state needs another interface, the transport should mint a separate capability entry for that interface rather than letting one handle accept multiple unrelated interface_id values.

Direct-Switch IPC

When a client’s CALL targets a cap served by a blocked server (waiting on RECV), the kernel marks that server as the direct IPC handoff target so the next context-switch path runs the callee before unrelated round-robin work. The current implementation still uses the ordinary saved-context restore path; small-message register transfer remains a future fastpath after measurement. See research.md §2.

Capability Transfer via Ring

Capabilities travel as sideband arrays (CapTransferDescriptor) alongside capnp message bytes:

  • CALL params: params buffer contains the capnp message bytes followed by xfer_cap_count transfer descriptors packed at addr + len, which must be aligned to CAP_TRANSFER_DESCRIPTOR_ALIGNMENT.
  • RETURN results: server result buffers carry the capnp reply bytes and may carry return transfer descriptors on addr + len; the kernel inserts destination capability records in the caller’s result buffer after the normal result bytes. Count is reported in CQE cap_count and those records are written as CapTransferResult { cap_id, interface_id } values at result_addr + result. The requested result buffer (result_len) must be large enough for both normal reply bytes and all appended cap_count records.

xfer_cap_count > 0 with malformed descriptor metadata (bad mode bits, reserved bits, _reserved0, or misalignment) fails closed as CAP_ERR_INVALID_TRANSFER_DESCRIPTOR. Kernels that have not yet enabled transfer handling should return CAP_ERR_TRANSFER_NOT_SUPPORTED for transfer-bearing SQEs.

The capnp wire format’s WirePointerKind::Other encodes capability indices in messages. The sideband arrays map these indices to actual CapIds. The kernel does not parse capnp messages — it transfers a list of caps alongside the opaque message bytes.

Dynamic Capability Management

Every open(), sub(), or resolve() creates and transfers a new capability at runtime. The kernel’s CapTable insert() and remove() are the primitives. Capabilities flow through RETURN SQE sideband arrays (and through the manifest at boot). No separate cap_grant mechanism needed — authority flow follows the ring’s IPC graph.

The CapTable generation counter handles stale references: when a File cap is closed (slot freed, generation bumps), any cached CapId returns StaleGeneration instead of accidentally hitting a new occupant.

Shared Memory for Bulk Data

Copying file data through capnp Data fields works for metadata and small reads, but is impractical for anything above a few KB. A 1 MB read through a capability CALL copies data four times: device → driver heap → capnp message → kernel buffer → client buffer.

SharedBuffer Capability

A SharedBuffer (also called MemoryObject, listed in ROADMAP.md Stage 6) is a kernel object backed by physical pages that can be mapped into multiple address spaces simultaneously. Zero copies between processes.

interface SharedBuffer {
    # Map into caller's address space (returns virtual address and size)
    map   @0 () -> (addr :UInt64, size :UInt64);
    # Unmap from caller's address space
    unmap @1 () -> ();
    # Size of the buffer
    size  @2 () -> (bytes :UInt64);
}

The kernel creates SharedBuffer objects on request (via a kernel-provided BufferAllocator capability). The pages are reference-counted — the buffer persists as long as any process holds a cap to it.

File I/O with SharedBuffer

File and BlockDevice interfaces support both inline-Data and SharedBuffer modes:

# Small read (< ~4 KB): inline in capnp message
file.read(offset=0, length=256) → {data: [256 bytes]}

# Large read: caller provides SharedBuffer, server fills it
let buf = buf_alloc.create(1048576);   # 1 MB SharedBuffer
file.readBuf(offset=0, buf, length=1048576) → {bytesRead: 1048576}
# Data is now in buf's mapped pages — no copy through kernel

Extended File interface with SharedBuffer support:

interface File {
    read      @0 (offset :UInt64, length :UInt32) -> (data :Data);
    write     @1 (offset :UInt64, data :Data) -> (written :UInt32);
    readBuf   @2 (offset :UInt64, buffer :SharedBuffer, length :UInt32) -> (bytesRead :UInt32);
    writeBuf  @3 (offset :UInt64, buffer :SharedBuffer, length :UInt32) -> (written :UInt32);
    stat      @4 () -> (size :UInt64, created :UInt64, modified :UInt64);
    truncate  @5 (length :UInt64) -> ();
    sync      @6 () -> ();
    close     @7 () -> ();
}

The readBuf/writeBuf methods accept a SharedBuffer cap (transferred via IPC). The server maps the buffer, performs DMA or memory copies into it, then returns. The caller reads directly from the mapped pages.

For BlockDevice, the same pattern applies — the driver maps the SharedBuffer, programs DMA descriptors pointing to its physical pages, and the device writes directly into the shared memory.

When to Use Each Mode

ScenarioMechanismWhy
Reading a 64-byte config valueFile.read() inline DataCopy overhead negligible
Reading a 10 MB binaryFile.readBuf() SharedBufferAvoids 4× copy overhead
FAT directory entry (32 bytes)BlockDevice.readBlocks() inlineSmall metadata read
Streaming video framesFile.readBuf() + ring of SharedBuffersContinuous zero-copy
Network packet buffersSharedBuffer ring between NIC driver and net stackDMA-capable pages

Attenuation

Storage services mint restricted capabilities using wrapper CapObjects:

CapabilityAuthority
Directory (full)Open, list, mkdir, remove, sub
Directory (read-only)Open (returns read-only Files), list, sub only
File (full)Read, write, truncate, sync
File (read-only)Read and stat only
File (append-only)Read, stat, write at end only
Store (full)Read, write, delete any object
Store (read-only)Get and has only
Namespace (full)Resolve, bind, list under prefix
Namespace (read-only)Resolve and list only
Blob (single object)Read one specific hash
SharedBuffer (read-only)Map as read-only (page table: R, no W)

An application that only needs to read its config gets a read-only Directory scoped to its config path. It can’t write, can’t see other apps’ directories, can’t access the raw BlockDevice.

Naming Without Paths

Traditional OS: process opens /var/lib/myapp/data.db — a global path.

capOS: process receives a Directory or Namespace cap at spawn time, opens "data.db" within it. The process has no idea where on disk this lives. It can’t traverse upward. There is no global root.

# Traditional: global path namespace
/
├── etc/
│   └── myapp/
│       └── config.toml
├── var/
│   └── lib/
│       └── myapp/
│           └── data.db
└── sbin/
    └── myapp

# capOS: per-process capability set (no global namespace)
Process "myapp" sees:
  "config" → Directory(read-only, scoped to myapp's config files)
  "data"   → Directory(read-write, scoped to myapp's data files)
  "state"  → Namespace(read-write, scoped to myapp's store objects)
  "log"    → Console cap
  "api"    → HttpEndpoint cap

The process doesn’t know or care about the backing storage layout. It just uses the capabilities it was granted.

Configuration

Build-Time Config (Boot Manifest)

The system manifest is authored at build time. The human-writable source could be any format — TOML, CUE, or even a Makefile target that generates the capnp binary. What matters is that it compiles to a SystemManifest capnp message baked into the ISO.

Example source (TOML, compiled to capnp by a build tool):

[services.virtio-net]
binary = "virtio-net"
restart = "always"
caps = [
    { name = "device_mmio", source = { kernel = "device_mmio" } },
    { name = "interrupt", source = { kernel = "interrupt" } },
    { name = "log", source = { kernel = "console" } },
]
exports = ["nic"]

[services.net-stack]
binary = "net-stack"
restart = "always"
caps = [
    { name = "nic", source = { service = { service = "virtio-net", export = "nic" } } },
    { name = "timer", source = { kernel = "timer" } },
    { name = "log", source = { kernel = "console" } },
]
exports = ["net"]

[services.fat-fs]
binary = "fat-fs"
restart = "always"
caps = [
    { name = "blk", source = { service = { service = "usb-storage", export = "block-device" } } },
    { name = "log", source = { kernel = "console" } },
]
exports = ["root-dir"]

[services.my-app]
binary = "my-app"
restart = "on-failure"
caps = [
    { name = "api", source = { service = { service = "http-service", export = "api" } } },
    { name = "docs", source = { service = { service = "fat-fs", export = "root-dir" } } },
    { name = "data", source = { service = { service = "store", export = "namespace" } } },
    { name = "log", source = { kernel = "console" } },
]

A build tool validates this against the capnp schemas (does virtio-net actually export "nic"? does http-service support endpoint() minting?) and produces the binary manifest.

Runtime Config (via Store)

Once the store service is running, configuration can be stored there and updated without rebuilding the ISO. The store is just another capability — a config-management service could watch for changes and signal services to reload.

Connection to Network Transparency

If capabilities are the only abstraction, and capnp is the only wire format, then the transport is irrelevant:

  • Local IPC: capnp message copied between address spaces by kernel
  • Local store: capnp message written to block device
  • Remote IPC: capnp message sent over TCP to another machine
  • Remote store: capnp message fetched from a remote store service

A capability reference doesn’t encode where the backing service lives. The kernel (or a proxy) handles routing. This means:

  • A Directory cap could be backed by local FAT or a remote 9P server
  • A Namespace cap could be backed by local storage or a remote store
  • A Fetch cap could route through a local HTTP service or a remote proxy
  • A ProcessSpawner cap could spawn locally or on a remote machine

The system manifest could describe services that run on different machines, and the capability graph spans the network. This is the “network transparency” item in the roadmap — it falls out naturally from the model.

Persistence of the Capability Graph

The live capability graph (which process holds which caps) is ephemeral — it exists in kernel memory and is lost on reboot. The system manifest describes the intended graph, and init rebuilds it on each boot.

For true persistence (resume after reboot without re-initializing):

  1. Each service serializes its state to the store before shutdown
  2. On next boot, the manifest includes “restore from store hash X” hints
  3. Services read their saved state from the store and resume

This is application-level persistence, not kernel-level. The kernel doesn’t snapshot the capability graph — services are responsible for their own state. This avoids the complexity of EROS-style transparent persistence while still allowing stateful services.

Phases

Phase 1: Boot Manifest (parallel with Stage 4)

  • Define SystemManifest schema in schema/
  • Build tool (tools/mkmanifest) that compiles system.cue into a capnp-encoded manifest and packs it into the ISO as a Limine module
  • Kernel parses the manifest and currently creates one process per ServiceEntry
  • Kernel passes the manifest to init as bytes or a Manifest capability without interpreting the child service graph in the system-spawn.cue transition path
  • Init becomes a generic manifest executor instead of a demo parser for the system-spawn.cue transition path
  • No persistent storage yet — boot image is the only data source

Phase 2: File I/O Interfaces in Schema (parallel with Stage 6)

Depends on: IPC (Stage 6) for cross-process cap transfer.

  • Add BlockDevice, File, Directory, DirEntry, SharedBuffer to schema/capos.capnp
  • Implement kernel Endpoint and RECV/RETURN SQE opcodes
  • Capability transfer in IPC replies (RETURN SQE xfer_caps installs caps in caller’s table)
  • Demo: two-process file server (in-memory File/Directory service + client)

Phase 3: RAM-backed Store (after Phase 2)

Depends on: IPC (Stage 6) for cross-process store access.

  • Implement Store and Namespace as a userspace service
  • Backed by RAM (no disk driver yet, data lost on reboot)
  • Services can store and retrieve capnp objects at runtime
  • Demonstrates the naming model without requiring a block device driver
  • Namespace.sub() returns new caps via IPC cap transfer

Phase 4: BlockDevice Drivers and Filesystem (after virtio infrastructure)

  • virtio-blk driver (userspace, reuses virtqueue infrastructure from networking smoke test)
  • BlockDevice trait implementation
  • FAT filesystem service: wraps BlockDevice, exports Directory/File caps
  • SharedBuffer integration for bulk reads (depends on Stage 6 MemoryObject)
  • Store service uses BlockDevice for persistence
  • System state survives reboot via store + manifest restore hints

Phase 5: Network Store (after networking)

  • Store service can replicate to or fetch from a remote store
  • Capability references transparently span machines
  • Directory cap backed by a remote filesystem (9P-style)

Relationship to Other Proposals

  • Networking proposal — the NIC driver and net stack are services described in the manifest, not hardcoded. The store could be backed by network storage once networking works. A remote Directory cap (9P over capnp) reuses the same File/Directory interfaces.
  • Service architecture proposal — the manifest replaces code-as-config for init. ProcessSpawner, supervision, and cap export work as described there, but driven by manifest data instead of compiled Rust code. IPC Endpoints are the mechanism for service export.
  • Capability model — IPC cap transfer (Endpoint + RETURN SQE) is the mechanism that makes open() and resolve() work. SharedBuffer is the bulk data path that makes file I/O practical. Both are tracked in ROADMAP.md Stage 6.

Open Questions

  1. Manifest validation. How much can the build tool verify statically? Cap export names depend on runtime behavior of services. Should services declare their exports in their own metadata (like a package manifest)?

  2. Schema evolution. When a service’s capnp interface changes, stored objects referencing the old schema need migration. Cap’n Proto has backwards-compatible schema evolution, but breaking changes need a story.

  3. Garbage collection. Content-addressed store accumulates unreferenced objects. Who GCs? A separate service with Store read + delete authority? Reference counting in the namespace layer?

  4. Large objects. Storing multi-megabyte binaries as single capnp Data fields is wasteful (capnp allocates contiguously). SharedBuffer partially addresses this for I/O, but the Store’s put/get interface still takes Data. Options: chunked storage (Merkle tree of hashes), a streaming Blob interface, or SharedBuffer-aware Store methods.

  5. Trust model for the manifest. The boot manifest has full authority to define the system. Who signs it? How do you prevent a tampered ISO from granting excessive caps? Secure boot integration?

  6. File locking and concurrent access. Multiple processes opening the same file through the same filesystem service need coordination. Options: mandatory locking in the filesystem service (rejects conflicting opens), advisory locking via a separate Lock capability, or single-writer enforcement at the Directory level (open with exclusive flag).

  7. RETURN+RECV atomicity. When a server posts a RETURN SQE followed by a RECV SQE, there must be no window where a client call can arrive but the server isn’t listening. SQE LINK chaining (RETURN → RECV) should provide this atomicity — the kernel processes both SQEs as a unit.

Proposal: Userspace TCP/IP Networking

How capOS gets from “kernel boots” to “userspace process opens a TCP connection.”

This document has two parts: a kernel-internal smoke test (actionable now) and a userspace networking architecture (blocked on Stages 4-6).


Part 1: Kernel-Internal Networking (Phase A)

Prove that capOS can send and receive TCP/IP traffic. Everything runs in-kernel — no IPC, no capability syscalls, no multiple processes needed.

What’s Needed

  1. PCI enumeration — scan config space, find virtio-net device. Uses the standalone PCI/PCIe subsystem described in cloud-deployment-proposal.md Phase 4 (~200 lines of glue code on top of the shared PCI infrastructure)
  2. virtio-net driver — init virtqueues, send/receive raw Ethernet frames. Use virtio-drivers crate or implement manually (~600-800 lines)
  3. Timer — PIT or LAPIC timer for smoltcp’s poll loop (retransmit timeouts, Instant::now() support). Not a full scheduler — just a monotonic clock (~50-100 lines)
  4. smoltcp integration — implement phy::Device trait over the in-kernel driver, create an Interface with static IP, ICMP ping, then TCP
  5. QEMU flags — add -netdev user,id=n0 -device virtio-net-pci,netdev=n0 to the Makefile

Milestones

  • Ping: ICMP echo to QEMU gateway (10.0.2.2 with default user-mode net)
  • HTTP: TCP connection to a host-side server, send GET, receive response

Estimated Scope

~1000-1500 lines of new kernel code. ~200 more for TCP on top of ping.

Crate Dependencies

CratePurposeno_std
smoltcpTCP/IP stackyes (features: medium-ethernet, proto-ipv4, socket-tcp)
virtio-driversvirtio device abstractionyes (optional — can implement manually)

Timer Source Decision

Resolved: PIT is already configured at 100 Hz from Stage 5. A monotonic TICK_COUNT (AtomicU64 in kernel/src/arch/x86_64/context.rs) increments on each timer interrupt, providing ~10ms resolution — sufficient for TCP timeouts. Switch to LAPIC timer when SMP lands (see smp-proposal.md Phase A).

QEMU Network Config

ConfigUse case
-netdev user,id=n0 -device virtio-net-pci,netdev=n0Default: NAT, guest reaches host
Add hostfwd=tcp::5555-:80 to netdevForward host port to guest

Part 2: Userspace Networking Architecture (Phases B+C)

Blocked on: Stage 4 (Capability Syscalls), Stage 5 (Scheduling), Stage 6 (IPC + Capability Transfer).

Architecture

+--------------------------------------------------+
|  Application Process                             |
|    holds: TcpSocket cap, UdpSocket cap, ...      |
|    calls: connect(), send(), recv() via capnp    |
+---------------------------+----------------------+
                            | IPC (capnp messages)
+---------------------------v----------------------+
|  Network Stack Process (userspace)               |
|    smoltcp TCP/IP stack                          |
|    holds: NIC cap (from driver), Timer cap       |
|    implements: TcpSocket, UdpSocket, Dns caps    |
+---------------------------+----------------------+
                            | IPC (capnp messages)
+---------------------------v----------------------+
|  NIC Driver Process (userspace)                  |
|    virtio-net driver                             |
|    holds: DeviceMmio cap, Interrupt cap          |
|    implements: Nic cap                           |
+---------------------------+----------------------+
                            | capability syscalls
+---------------------------v----------------------+
|  Kernel                                          |
|    DeviceMmio cap: maps BAR into driver process  |
|    Interrupt cap: routes virtio IRQ to driver     |
|    Timer cap: provides monotonic clock            |
+--------------------------------------------------+

Three separate processes, each with minimal authority:

  1. NIC driver – only has access to the specific virtio-net device registers and its interrupt line. Implements the Nic interface.
  2. Network stack – holds the Nic capability from the driver. Runs smoltcp. Implements higher-level socket interfaces.
  3. Application – holds socket capabilities from the network stack. Cannot touch the NIC or raw packets directly.

Prerequisites

PrerequisiteRoadmap StageWhy
Capability syscallsStage 4 (sync path done)All resource access via cap invocations
Scheduling + preemptionStage 5 (core done)Network I/O requires blocking/waking
IPC + capability transferStage 6Cross-process cap calls
Interrupt routing to userspaceNew kernel primitiveNIC driver receives IRQs
MMIO mapping capabilityNew kernel primitiveNIC driver accesses device registers

Phase B: Capability Interfaces

  • Define networking schema (Nic, TcpSocket, etc.) in schema/net.capnp
  • Implement Nic and NetworkManager as kernel-internal CapObjects wrapping the Phase A code
  • Verify capability-based invocation works end-to-end in kernel

Phase C: Userspace Decomposition

  • Move NIC driver into a userspace process
  • Move network stack into a separate userspace process
  • Application process uses socket capabilities via IPC
  • Full capability isolation achieved

Cap’n Proto Schema (draft — will evolve with IPC implementation)

interface Nic {
    transmit @0 (frame :Data) -> ();
    receive @1 () -> (frame :Data);
    macAddress @2 () -> (addr :Data);
    linkStatus @3 () -> (up :Bool);
}

interface DeviceMmio {
    map @0 (bar :UInt8) -> (virtualAddr :UInt64, size :UInt64);
    unmap @1 (virtualAddr :UInt64) -> ();
}

interface Interrupt {
    wait @0 () -> ();
    ack @1 () -> ();
}

interface Timer {
    now @0 () -> (ns :UInt64);
    sleep @1 (ns :UInt64) -> ();
}

interface TcpSocket {
    connect @0 (addr :Data, port :UInt16) -> ();
    send @1 (data :Data) -> (bytesSent :UInt32);
    recv @2 (maxLen :UInt32) -> (data :Data);
    close @3 () -> ();
}

interface NetworkManager {
    createTcpListener @0 () -> (listener :TcpListener);
    createUdpSocket @1 () -> (socket :UdpSocket);
    getConfig @2 () -> (addr :Data, netmask :Data, gateway :Data);
}

Open Questions

  1. DMA memory management. Dedicated DmaAllocator capability vs extending FrameAllocator with allocDma?
  2. Blocking model. Kernel blocks caller on IPC channel vs return “would block” vs both?
  3. Buffer ownership. Copy into IPC message vs shared memory vs capability lending?

References

Crates

Specs

Prior Art

QEMU

Proposal: Error Handling for Capability Invocations

How capOS communicates errors from capability calls back to userspace processes.

This proposal defines a two-level error model: transport errors (the invocation mechanism itself failed) and application errors (the capability processed the request and returned a structured error). The design aligns with Cap’n Proto’s own exception model and the patterns used by seL4, Zircon, and other capability systems.

Note (2025-06): This proposal was written when cap_call was the synchronous invocation syscall. Since then, cap_call has been replaced by the shared-memory capability ring + cap_enter syscall. The two-level error model and CapException schema remain valid, but the delivery mechanism changes: transport errors and application errors are communicated through CQE result fields (status code + result buffer), not syscall return values. The “Syscall Return Convention” section below describes the original cap_call convention; a future revision should map these error codes to CQE status fields instead.

Current CQE Error Namespace

The capability ring uses signed 32-bit CapCqe.result values. Non-negative values are opcode-specific success results; negative values are kernel transport errors defined in capos-config/src/ring.rs:

CodeNameMeaning
-1CAP_ERR_INVALID_REQUESTMalformed request metadata or an opcode value not reserved in the ABI.
-2CAP_ERR_INVALID_PARAMS_BUFFERSQE parameter buffer is unmapped, out of range, or not readable.
-3CAP_ERR_INVALID_RESULT_BUFFERSQE result buffer is unmapped, out of range, or not writable.
-4CAP_ERR_INVOKE_FAILEDCapability lookup or invocation failed before a successful result was produced.
-5CAP_ERR_UNSUPPORTED_OPCODEOpcode is reserved in the ABI but not yet dispatched. Currently returned for CAP_OP_FINISH; CAP_OP_RELEASE has kernel dispatch and reports stale/non-owned caps as request/invoke failures.
-6CAP_ERR_TRANSFER_NOT_SUPPORTEDTransfer mode or sideband descriptor layout is recognized as unsupported by this kernel.
-7CAP_ERR_INVALID_TRANSFER_DESCRIPTORxfer_cap_count descriptor layout malformed or contains reserved bits.
-8CAP_ERR_TRANSFER_ABORTEDTransaction-in-progress transfer failed and must not produce partial capability state.

This is deliberately a small transport namespace. Interface-specific failures should be encoded in the result payload once the target capability successfully handles the request.

  • CAP_ERR_TRANSFER_NOT_SUPPORTED is used for transfer-bearing SQEs that the kernel currently dispatches but does not yet process (xfer_cap_count != 0 on kernels where sideband transfer is off).
  • CAP_ERR_INVALID_TRANSFER_DESCRIPTOR is used for structurally validly dispatched transfer SQEs where transfer metadata is malformed:
    • descriptor transfer_mode is not exactly CAP_TRANSFER_MODE_COPY or CAP_TRANSFER_MODE_MOVE;
    • any descriptor reserved bits are set;
    • any descriptor _reserved0 field is non-zero;
    • descriptor region placement (addr + len) is misaligned;
    • descriptor range overflows or cannot be safely bounded.
  • CAP_ERR_TRANSFER_ABORTED is reserved for transaction failure after partial transfer side effects are prepared and must not be observed (all-or-nothing rollback boundary).
  • CAP_ERR_INVALID_REQUEST remains for non-transfer transport malformation (unsupported opcodes for today, unsupported SQE fields not part of the transfer path, and malformed result/payload buffer pairs).

Problem Statement

Currently, cap_call returns u64::MAX on any error and prints the details to the kernel serial console. The userspace process receives no information about what went wrong – it cannot distinguish “invalid capability ID” from “method not implemented” from “out of memory inside the service.”

Every other capability system separates transport-level errors (bad handle, message validation failure) from application-level errors (the service processed the request and returned a meaningful error). capOS needs both.


Background: How Other Systems Do This

Cap’n Proto RPC Protocol

The Cap’n Proto RPC specification defines an Exception type in rpc.capnp:

struct Exception {
  reason @0 :Text;
  type @3 :Type;
  enum Type {
    failed @0;        # deterministic failure, retrying won't help
    overloaded @1;    # temporary resource exhaustion, retry with backoff
    disconnected @2;  # connection to a required capability was lost
    unimplemented @3; # method not supported by this server
  }
  trace @4 :Text;
}

These four types describe client response strategy, not error semantics. The capnp Rust crate maps them to capnp::ErrorKind::{Failed, Overloaded, Disconnected, Unimplemented}.

Cap’n Proto’s official philosophy (from KJ library and Kenton Varda’s writings): exceptions are for infrastructure failures, not application semantics. Application-level errors should be modeled as unions in method return types.

Capability OS Error Models

SystemTransport errorsApplication errors
seL4seL4_Error enum (11 values) from syscall returnIn-band via IPC message payload (user-defined)
Zirconzx_status_t (signed i32, ~30 values) from syscallFIDL per-method error type (union in return)
EROS/CoyotosKernel-generated invocation exceptionsOPR0.ex flag + exception code in reply payload
Plan 9 (9P)Connection loss (no in-band transport error)Rerror message with UTF-8 error string
GenodeIpc_error exceptionDeclared C++ exceptions via GENODE_RPC_THROW

Common pattern: a small kernel error code set for transport failures, combined with service-specific typed errors for application failures.

POSIX errno: Why Not

POSIX errno is a global flat namespace of ~100 integers that conflates transport errors (EBADF) with application errors (ENOENT). In a capability system:

  • EACCES/EPERM don’t apply – if you have the capability, you have permission; if you don’t, you can’t even name the resource.
  • A global error namespace conflicts with typed interfaces where errors should be scoped to the interface.
  • No room for structured information (which argument was invalid, how much memory was needed).
  • Not composable across trust boundaries – a callee’s errno has no meaning in the caller’s address space without explicit serialization.

Design

Principle: Two Levels, One Wire Format

Level 1 – Transport errors are returned in the syscall return value. These indicate that the capability invocation mechanism itself failed before the target CapObject was reached. No result buffer is written.

Level 2 – Application errors are returned as capnp-serialized messages in the result buffer. The capability was found and dispatched; the implementation returned a structured error. The syscall return value distinguishes this from a successful result.

Both levels use Cap’n Proto serialization for the error payload (level 2 always, level 1 when there’s a result buffer available). This keeps one parsing path in userspace.

Syscall Return Convention

The cap_call syscall (number=2) currently returns:

  • 0..N – success, N bytes written to result buffer
  • u64::MAX – error (undifferentiated)

New convention:

Return valueMeaning
0..=(u64::MAX - 256)Success. Value = number of bytes written to result buffer.
u64::MAXTransport error: invalid capability ID or stale generation.
u64::MAX - 1Transport error: invalid user buffer (bad pointer, unmapped, not writable).
u64::MAX - 2Transport error: params too large (exceeds MAX_CAP_CALL_PARAMS).
u64::MAX - 3Application error: the capability returned an error. A CapException message has been written to the result buffer. The message length is encoded in the low 32 bits of the value at result_ptr (the capnp message itself).
u64::MAX - 4Application error, but the result buffer was too small or NULL. The error detail is lost; the caller should retry with a larger buffer or treat it as an opaque failure.

The transport error codes are a small closed set (like seL4’s 11 values). New transport errors can be added, but the set should remain small and stable.

CapException Schema

Add to schema/capos.capnp:

enum ExceptionType {
    failed @0;
    overloaded @1;
    disconnected @2;
    unimplemented @3;
}

struct CapException {
    type @0 :ExceptionType;
    message @1 :Text;
}

This mirrors Cap’n Proto RPC’s Exception struct. The four types match capnp::ErrorKind and describe client response strategy:

  • failed – deterministic failure, retrying won’t help. Covers invalid arguments, invariant violations, deserialization errors, and any capnp::ErrorKind variant not in the other three categories.
  • overloaded – temporary resource exhaustion (out of frames, table full). Client may retry with backoff.
  • disconnected – the capability’s backing resource is gone (device removed, process exited). Client should re-acquire the capability.
  • unimplemented – unknown method ID for this interface. Client should not retry.

The message field is a human-readable string for diagnostics/logging. It must not contain security-sensitive information (internal pointers, kernel addresses) since it crosses the kernel-user boundary.

Application-Level Errors in Interface Schemas

Following Cap’n Proto’s philosophy, expected error conditions that a caller should handle programmatically belong in the method return type, not in the exception mechanism.

Example – FrameAllocator can legitimately run out of memory:

struct AllocResult {
    union {
        ok @0 :UInt64;       # physical address
        outOfMemory @1 :Void;
    }
}

interface FrameAllocator {
    allocFrame @0 () -> (result :AllocResult);
    freeFrame @1 (physAddr :UInt64) -> ();
    allocContiguous @2 (count :UInt32) -> (result :AllocResult);
}

The caller can pattern-match on the result union without parsing an exception. This is the Zircon/FIDL model: transport errors at the syscall layer, application errors as typed return values.

When to use each:

SituationMechanism
Bad cap ID, stale generation, bad bufferTransport error (syscall return code)
Deserialization failure, unknown methodCapException with failed/unimplemented
Temporary resource exhaustion in dispatchCapException with overloaded
Expected domain-specific errorUnion in method return type
Bug in capability implementationCapException with failed

Kernel Implementation

CapObject trait change

The ring SQE does not carry a caller-supplied interface ID. The trait shape below keeps interface selection out of capability implementations because each capability entry owns one public interface:

#![allow(unused)]
fn main() {
pub trait CapObject: Send + Sync {
    fn interface_id(&self) -> u64;
    fn label(&self) -> &str;
    fn call(
        &self,
        method_id: u16,
        params: &[u8],
        result: &mut [u8],
        reply_scratch: &mut dyn ReplyScratch,
    ) -> capnp::Result<CapInvokeResult>;
}
}

Implementations serialize directly into the caller’s result buffer and return a completion containing the number of bytes written, or Pending for async endpoint calls. Dispatch uses the interface assigned to the target capability entry; normal CALL SQEs do not need to repeat that interface ID. capnp::Error carries ErrorKind with the four RPC exception types. The kernel’s dispatch handler converts Err(capnp::Error) into a serialized CapException message and writes it to the result buffer.

Syscall handler changes

In cap_call(), the error path changes from:

#![allow(unused)]
fn main() {
Err(e) => {
    kprintln!("cap_call: ... error: {}", e);
    u64::MAX
}
}

to:

#![allow(unused)]
fn main() {
Err(CapError::NotFound) => ECAP_NOT_FOUND,
Err(CapError::StaleGeneration) => ECAP_NOT_FOUND,
Err(CapError::InvokeError(e)) => {
    // Serialize CapException to result buffer
    let exception_bytes = serialize_cap_exception(&e);
    if result_ptr != 0 && result_capacity >= exception_bytes.len() {
        copy_to_user(result_ptr, &exception_bytes);
        ECAP_APPLICATION_ERROR
    } else {
        ECAP_APPLICATION_ERROR_NO_BUFFER
    }
}
}

The serialize_cap_exception function maps capnp::ErrorKind to ExceptionType:

capnp::ErrorKindExceptionType
Failedfailed
Overloadedoverloaded
Disconnecteddisconnected
Unimplementedunimplemented
All other variants (deserialization, validation)failed

This matches how capnp-rpc maps exceptions to the wire format.

Userspace API

The init crate (and future userspace libraries) wraps cap_call in a helper that interprets the return value:

#![allow(unused)]
fn main() {
pub enum CapCallResult {
    Ok(Vec<u8>),
    Exception(ExceptionType, String),
    TransportError(TransportError),
}

pub enum TransportError {
    InvalidCapability,
    InvalidBuffer,
    ParamsTooLarge,
}

pub fn cap_call(
    cap_id: u32,
    method_id: u16,
    params: &[u8],
    result_buf: &mut [u8],
) -> CapCallResult {
    let ret = sys_cap_call(cap_id, method_id, params, result_buf);
    match ret {
        ECAP_NOT_FOUND => CapCallResult::TransportError(TransportError::InvalidCapability),
        ECAP_BAD_BUFFER => CapCallResult::TransportError(TransportError::InvalidBuffer),
        ECAP_PARAMS_TOO_LARGE => CapCallResult::TransportError(TransportError::ParamsTooLarge),
        ECAP_APPLICATION_ERROR => {
            let (typ, msg) = deserialize_cap_exception(result_buf);
            CapCallResult::Exception(typ, msg)
        }
        ECAP_APPLICATION_ERROR_NO_BUFFER => {
            CapCallResult::Exception(ExceptionType::Failed, String::new())
        }
        n => CapCallResult::Ok(result_buf[..n as usize].to_vec()),
    }
}
}

Future: Batched Calls

When capOS adds batched capability invocations (async rings, pipelining), each request in the batch gets its own result status. The same two-level model applies per-request:

  • Transport error for the batch envelope (invalid ring descriptor, bad capability table) fails the whole batch.
  • Per-request transport errors (individual bad cap_id) fail that request.
  • Application errors are per-request, written to each request’s result slot.

This matches how NFS compound operations and JSON-RPC batch requests work: a transport error on the batch vs per-operation results.


What This Does NOT Cover

  • Error logging/tracing infrastructure. How errors get collected, aggregated, or displayed is a separate concern. The kernel currently prints to serial; a future ErrorLog capability could capture structured error streams.
  • Retry policy. The ExceptionType hints at retry strategy (overloaded -> retry, failed -> don’t), but the retry logic itself belongs in userspace libraries, not the kernel.
  • Error propagation across capability chains. When capability A calls capability B which calls capability C, and C fails – how does the error propagate back through A? This is a concern for the IPC and capability transfer stages (Stage 6+). The current proposal handles the single-hop case.
  • Transactional semantics. Whether a failed operation has side effects (partial writes, allocated-but-not-returned frames) is per-capability semantics, not a kernel-level concern.

Migration Path

Phase 1: Transport error codes (minimal, no schema changes)

Change cap_call to return distinct error codes instead of u64::MAX for all failures. Update the init crate to interpret them. No new schema types needed – application errors still use u64::MAX - 3 but without a structured payload (treated as opaque failure).

This is backward-compatible: existing userspace code that checks == u64::MAX sees different values for different errors, but any >= u64::MAX - 255 check catches all errors.

Phase 2: CapException serialization

Add ExceptionType and CapException to the schema. Implement serialize_cap_exception in the kernel. Update init to deserialize and display errors. Now userspace gets the exception type and message string.

Phase 3: Per-interface application errors

As interfaces mature, add typed error unions to method return types for expected error conditions. FrameAllocator::allocFrame returns AllocResult instead of bare UInt64. The exception mechanism remains for unexpected failures.


Design Rationale

Why mirror capnp RPC’s Exception type instead of inventing our own? Cap’n Proto already defines a well-thought-out exception taxonomy. The four types (failed, overloaded, disconnected, unimplemented) map directly to capnp::ErrorKind in Rust. Using the same vocabulary means capOS capabilities can eventually participate in capnp RPC networks without translation. It also means the Rust compiler enforces exhaustive matching on ErrorKind variants that matter.

Why not put error codes in the syscall return value only (like seL4)? seL4’s 11 error codes work because seL4 kernel objects are simple and fixed-function. capOS capabilities are arbitrary typed interfaces – a file system, a network stack, a GPU driver. The error vocabulary is open-ended. Encoding all possible errors as syscall return values would either require an ever-growing enum (fragile) or lose information (back to errno’s problems). The capnp-serialized CapException in the result buffer gives unbounded expressiveness without changing the syscall ABI.

Why not use capnp exceptions for everything (skip the transport error codes)? Because transport errors happen before the capability is reached. There’s no CapObject to serialize an exception. The kernel would have to synthesize a capnp message on behalf of a non-existent capability, which is wasteful and semantically wrong. A small integer return code is cheaper and more honest about what happened.

Why not define a generic Result(Ok) wrapper in the schema? Cap’n Proto generics only bind to pointer types (Text, Data, structs, lists, interfaces), not to primitives (UInt32, Bool). A Result(UInt64) for allocFrame wouldn’t work. Per-method result structs with unions are more flexible and don’t hit this limitation. The cost is a bit more schema boilerplate, which is acceptable given that capOS has a small number of interfaces.

Why string-based messages (like Plan 9) instead of structured error fields? String messages are adequate for diagnostics and logging. Structured error data belongs in the typed return unions (Phase 3), where the schema enforces what fields exist. Putting structured data in CapException would duplicate the schema’s job and encourage using exceptions for flow control, which Cap’n Proto explicitly warns against.

Security Review and Formal Verification Proposal

How to reason about the correctness and security of the capOS kernel and its trust boundaries in a way that fits a research OS – pragmatic tooling now, targeted verification where it pays off, no aspirational seL4-style full- kernel proofs. The docs/research/sel4.md survey already concluded that Isabelle/HOL-over-C verification does not transfer to Rust and that the design constraints matter more than the proof artefact. This proposal codifies that conclusion into a concrete tooling and process plan.

This proposal uses CWE for concrete vulnerability classes, CAPEC for attacker patterns, Rust language rules / unsafe-code guidance for low-level coding rules, Common Criteria protection-profile concepts for OS security functions, and capability-kernel practice (seL4/EROS-style invariants) for authority, IPC, object lifetime, and scheduler properties. Web-application checklists are not the baseline for OS design review.

Grounding sources:

  • MITRE CWE for root-cause weakness labels: CWE-20 explicitly covers raw data, metadata, sizes, indexes, offsets, syntax, type, consistency, and domain rules; CWE also marks broad classes such as CWE-20 and CWE-400 as discouraged for final vulnerability mapping when a more precise child fits.
  • MITRE CAPEC for attacker behavior, especially input manipulation (CAPEC-153), command injection (CAPEC-248), race exploitation (CAPEC-26 / CAPEC-29), and flooding/resource pressure (CAPEC-125).
  • Rust Reference and Rust 2024 Edition Guide for unsafe-block and unsafe_op_in_unsafe_fn obligations.
  • seL4 MCS and the existing capOS research notes for capability-authorized access to kernel objects and CPU time.
  • Common Criteria General Purpose Operating System Protection Profile for OS access-control, security-function, trusted-channel/path, and user-data protection concepts. capOS is not trying to certify against it; the PP is a vocabulary check for what an OS security review should not omit.

1. Philosophy and Scope

capOS is explicitly a research OS whose design principle is “schema-first typed capabilities, minimal kernel, reuse the Rust ecosystem.” Three consequences shape this proposal:

  1. The schema is part of the TCB. A bug in the .capnp schema, or in the way generated code is patched for no_std, is exactly as dangerous as a bug in the kernel. The schema, the capnpc build pipeline, and the generated code all need review attention – not only hand-written kernel code.
  2. The kernel should stay small. “Everything else is a capability” means the TCB is naturally bounded. Verification effort scales with TCB size, so resisting kernel bloat is itself a security property.
  3. The interface is the permission. Access control lives in capnp method definitions and in userspace cap wrappers (a narrow cap is a different CapObject), not in kernel rights bitmasks. Review must confirm that the kernel never short-circuits this: no ambient authority, no method that bypasses CapObject::call, no syscall that exposes an object without a capability handle.

Non-goals:

  • Full functional-correctness proof of the kernel à la seL4. Infeasible in Rust today, and the payoff is low for a research system whose surface area is still changing.
  • Proving information-flow / confidentiality properties end-to-end.
  • Certifying a specific configuration for external deployment.

2. Trust Boundaries and Threat Model

Enumerating the boundaries forces every future review to ask “which boundary does this change touch?” and picks out the code paths that matter.

Current boundaries

BoundaryWho trusts whomCode that enforces it
Ring 0 ↔ Ring 3kernel trusts nothing from userkernel/src/mem/validate.rs, arch/x86_64/syscall.rs; exercised by init/ and demos/*
Kernel ↔ user pointerkernel validates address + PTE permsvalidate_user_buffer (single-threaded; TOCTOU-prone on SMP)
Manifest ↔ kernelkernel parses capnp manifest at bootcapos-config::manifest, called from kmain
Build inputs ↔ TCBkernel trusts schema/codegen/build artifactsschema/capos.capnp, build.rs, Cargo.lock, Makefile
Host tools ↔ filesystem/processtools must not let manifest/config input escape intended host boundariestools/mkmanifest, generators, CI scripts
ELF bytes ↔ kernelkernel parses user ELF to map segmentscapos-lib::elf
User ring ↔ kernel dispatchkernel trusts no SQ statekernel/src/cap/ring.rs
CapObject::call wire formatkernel trusts no params bytesgenerated capnp decoders + impls
Process ↔ process IPCkernel routes calls between mutually isolated address spaces and trusts neither side’s bufferskernel/src/cap/endpoint.rs, kernel/src/cap/ring.rs, kernel/src/sched.rs
Device DMA ↔ physical memory (future)kernel must constrain device memory accessPCI enumeration exists for the QEMU smoke path; virtio DMA submission is not implemented yet. See networking/cloud proposals.

Attacker model

  • Untrusted service binaries. Today’s services are checked into the repo, but the manifest pipeline is meant to load arbitrary binaries eventually. Assume every byte of a service’s SQEs, params buffers, result buffer pointers, and return addresses is attacker-controlled.
  • Untrusted manifest. Once manifests are produced outside the repo (e.g. generated from CUE fragments, passed in as a Limine module), the manifest parser must reject every malformed input without panicking.
  • Resource exhaustion. Once multiple mutually-untrusting services run, a service can attack by filling rings, endpoint queues, capability tables, frame pools, scratch arenas, logs, or CPU time. Boundedness and accounting are security properties, not performance polish.
  • Build input drift. The schema/codegen path is already part of the TCB. External build inputs such as the bootloader checkout, Rust dependencies, capnp code generation, and generated-code patching must be reproducible enough that review can tell what changed.
  • Host tooling input. Build tools and generators run with developer/CI filesystem access. Treat manifest/config-derived paths and command arguments as untrusted until bounded to the intended directory and execution context.
  • Residual state and disclosure. Kernel logs, returned buffers, recycled frames, endpoint scratch space, and generated artifacts must not expose kernel pointers, stale bytes from another process, secrets, or build-system paths that increase attacker leverage.
  • Hostile interrupts / preemption. The scheduler preempts at arbitrary points. Any kernel invariant that is only transiently true must be held under the right lock or with interrupts disabled.
  • Out of scope (for now): physical attacks, speculative-execution side channels, malicious hardware, IOMMU bypass from DMA devices. These become in-scope once the driver stack lands; revisit the threat model then.

3. Tiered Approach

Four tiers, cheapest first. Each tier is independently useful, and later tiers assume earlier ones are in place.

Tier 1 – Hygiene and CI (cheap, high value)

These are the controls that make every other tier work. The initial GitHub Actions baseline exists in .github/workflows/ci.yml; it runs formatting, host tests, cargo build --features qemu, make capos-rt-check, and make generated-code-check. The QEMU smoke job installs its own boot tools and runs make plus make run, but remains non-blocking, so it is not yet a required boot assertion.

  • Continuous integration via GitHub Actions (or equivalent). Current baseline: make fmt-check, cargo test-config, cargo test-ring-loom, cargo test-lib, cargo test-mkmanifest, cargo build --features qemu, make capos-rt-check, and make generated-code-check. Remaining CI work: treat QEMU boot as a required CI gate once runtime flakiness is acceptable, then add the security policy jobs below.
  • cargo clippy --all-targets -- -D warnings across workspace members, with a curated set of clippy::pedantic / clippy::nursery lints that pay off for kernel code (clippy::undocumented_unsafe_blocks, clippy::missing_safety_doc, clippy::cast_possible_truncation, etc.). Do NOT enable all of pedantic blindly – review each lint and either enable it or add a rationale comment.
  • cargo-deny for license and advisory gating; cargo-audit for the RustSec advisory DB against Cargo.lock. Dependencies include capnp, spin, x86_64, limine, linked_list_allocator – all externally maintained.
  • cargo-geiger report of unsafe surface area per crate, checked in as a snapshot and diffed in CI so growth is visible in PRs.
  • Deny unsafe_op_in_unsafe_fn (already required by edition 2024; make sure it stays on) and missing_docs on public kernel items where it is not already the case.
  • Dependency review discipline: every new dep needs a one-line rationale in the commit message and a check that it is no_std-capable, maintained, and does not pull in a surprise async runtime or heavy transitive graph.
  • No-std dependency rubric: kernel/no_std additions require an explicit compatibility check that core/alloc paths do not regress to std through default feature drift, and class ownership is recorded against docs/trusted-build-inputs.md.
  • Boot/build input pinning: pin external bootloader/tool downloads to an auditable revision or checksum. Branch names are not enough for TCB inputs. CI should fail when generated capnp bindings or no-std patching change outside an intentional schema/codegen update.
  • Untrusted-path panic audit: panic!, assert!, .unwrap(), and .expect() are acceptable during bring-up, but every path reachable from manifest bytes, ELF bytes, SQEs, params buffers, result buffers, and future IPC messages needs either a fail-closed error or a documented halt policy.
  • Hardware protection smoke tests: boot under QEMU with SMEP/SMAP-capable CPU flags and assert CR4.SMEP/CR4.SMAP once paging is initialized. Every explicit user-memory dereference must be wrapped in a short STAC/CLAC window once SMAP is enabled.

Tier 2 – Targeted dynamic analysis

Aimed at the host-testable pure-logic crates (capos-lib, capos-config) where the Rust toolchain just works. No kernel changes required.

  • Miri on the cargo test-lib and cargo test-config suites. Catches UB in pure-logic code: invalid pointer arithmetic, uninitialized reads, bad provenance, unsound unsafe. The FrameBitmap and CapTable tests in particular push against slot indexing, generation counters, and raw &mut [u8] handling – exactly what miri is good at.
  • proptest (or quickcheck) on:
    • capos-lib::elf::parse – random bytes / random perturbations of a valid header must never panic and must refuse anything that isn’t a correctly formed user-half ELF64.
    • capos-lib::frame_bitmap – interleaved sequences of alloc, alloc_contiguous, free, mark_used preserve the invariant free_count == popcount(bitmap == 0) and never double-free.
    • capos-lib::cap_table – insert/remove/lookup sequences preserve “every returned id resolves to its insertion-time object, and stale ids are rejected.”
    • capos-config::manifest encode/decode round trip on arbitrary manifests.
  • cargo fuzz harnesses (libFuzzer) on the same three parsers: elf::parse, manifest::decode, and the ring SQE decoder that will land as part of IPC. These run outside CI (they never terminate) but should have a seed corpus checked into fuzz/corpus/ and be run for a fixed budget nightly or on-demand.
  • Sanitizers on host tests: RUSTFLAGS=-Zsanitizer=address (and thread) on cargo test-lib under nightly. Requires nothing more than a CI job; cheap to add.

Tier 3 – Concurrency model checking

The capability ring is a lock-free single-producer / single-consumer protocol using volatile reads, release/acquire fences, and a shared head/ tail pair. It is the most likely source of subtle memory-ordering bugs and is also the most isolated – a perfect fit for model checking.

  • Loom on a host-buildable wrapper of the ring protocol. Extract the producer/consumer state machine from capos-config::ring into a form where atomics can be swapped for loom::sync::atomic, and write Loom tests that enumerate all interleavings of producer/consumer for small ring sizes (2–4 slots). Properties to check:
    • No CQE is lost.
    • No CQE is double-delivered.
    • The sq_head/sq_tail and cq_head/cq_tail pointers never observe a state that implies tail - head > SQ_ENTRIES.
    • The “corrupted producer state” fail-closed policy (REVIEW_FINDINGS.md “Userspace Ring Client”) holds under adversarial interleavings.
  • Shuttle as a lighter alternative for regression-style tests once the specific bugs are known; cheaper per run, randomised rather than exhaustive. Good for long-running overnight jobs.

Loom coverage here is disproportionately valuable: it substitutes for the SMP-hardness work the project has explicitly deferred, and it exercises exactly the ordering that TOCTOU-style bugs hide in.

Tier 4 – Bounded verification of specific invariants

Not a full-kernel proof. Targeted, property-specific, one-module-at-a-time.

  • Kani (bounded model checking for Rust, via CBMC). Good fit for small, heap-free, arithmetic-heavy functions. Candidate modules:
    • capos-lib::cap_table – prove that for all insert; remove; insert' sequences under a u8 generation counter, a stale CapId never resolves. Bound: table size ≤ 4, generation window ≤ 256.
    • capos-lib::frame_bitmap – prove that for all bitmap sizes up to N bytes, alloc_frame followed by free_frame of the same frame restores the original bitmap and free_count.
    • capos-lib::elf::parse bounds checks: prove that every index into the program header table is < len, given the validated phentsize and phnum.
  • Verus (SMT-based Rust verifier, active development at MSR) for invariants that Kani can’t handle ergonomically, particularly those involving loops and ghost state. Worth tracking but don’t commit to it yet – the proof-engineering cost is real, and the tool is still young. Revisit once IPC lands and the kernel has stable public APIs.
  • Creusot / Prusti are alternatives in the same space. Do not invest in more than one SMT-based verifier; pick whichever has the best story for no_std + alloc code when Tier 4 starts.

Deliberately out of scope: Isabelle/HOL, Coq proofs, Frama-C. They would require re-encoding Rust in a foreign semantic framework with no established Rust front-end mature enough for kernel code.

4. Security Review Process

REVIEW.md is the rules document and REVIEW_FINDINGS.md is the open-issue log. REVIEW.md contains the common security checklist that applies across kernel, userspace, host tooling, generators, and CI. The per-boundary prompts below are an expansion of that common checklist for OS-specific code paths.

CWE/CAPEC tagging policy

Security findings should carry CWE metadata when the mapping is specific enough to help a reviewer or future audit. Do not force a CWE into every title.

  • Prefer Base/Variant CWE IDs when the root cause is known: CWE-770 for unbounded allocation, CWE-88 for argument injection, CWE-367 for a concrete validation-to-use race, CWE-416 for a real use-after-free.
  • Use Class IDs as temporary or umbrella labels: CWE-20 for “input was not validated enough” before the missing property is known; CWE-400 for general resource exhaustion only when the enabling mistake is not more precise.
  • Use capability-kernel invariants instead of weak CWE mappings for design properties such as “no ambient authority”, “cap transfer happens exactly once”, “revocation cannot leave stale authority”, and “scheduling context donation cannot fabricate CPU authority”. Cite CWE-862/CWE-863 only when the issue is actually a missing or incorrect authorization check.
  • Use CAPEC for the attacker pattern when useful: input manipulation, command injection, race exploitation, flooding, or path/file manipulation. CAPEC is not a substitute for the CWE root-cause tag.

Current checklist coverage:

AreaPrimary tagsReview intent
Structured input validationCWE-20, CWE-1284–CWE-1289 when preciseValidate syntax, type, range, length, indexes, offsets, and cross-field consistency before privileged use
Filesystem pathsCWE-22, CWE-23, CWE-59Keep host-tool paths inside intended roots across absolute paths, traversal, symlinks, and file-type confusion
Commands/processesCWE-78, CWE-88Avoid shell interpolation; constrain binaries and arguments
Numeric/buffer boundsCWE-190, CWE-125, CWE-787Check arithmetic before pointer, slice, copy, ELF segment, and page-table use
Resource exhaustionCWE-770 preferred; CWE-400 broadBound queues, allocations, retries, spin loops, frames, scratch arenas, cap slots, and CPU budget
Exceptional pathsCWE-703, CWE-754, CWE-755; CWE-248 only for uncaught exceptionsFail closed on malformed or adversarial input; avoid trust-boundary panic/abort
Authorization/cap authorityCWE-862, CWE-863 plus capOS invariantsVerify capability ownership, generation, object identity, address-space ownership, and transfer policy
Concurrency/TOCTOUCWE-362, CWE-367, CWE-667Preserve lock ordering, interrupt masking, page-table stability, and validation-to-use assumptions
Lifetime/reuseCWE-416, CWE-664, CWE-672Prevent stale caps, stale kernel stacks, stale frames, and expired IPC state from being used
Disclosure/residual dataCWE-200, CWE-226Prevent logs, result buffers, frames, scratch arenas, and generated artifacts from leaking stale or sensitive data
Supply chain / generated TCBcapOS TCB invariant; use CWE only for concrete bugPin or review-visible drift for bootloader, dependencies, schema/codegen, generated code, and patching

Per-boundary review checklist

  • Syscall surface change (arch/x86_64/syscall.rs):
    • Every register-passed argument is treated as attacker-controlled.
    • No user pointer is dereferenced without validate_user_buffer.
    • Numeric conversions, copy lengths, and pointer arithmetic are checked before constructing slices or entering a UserAccessGuard scope.
    • Kernel stack pointer and TSS.RSP0 invariants are preserved.
    • The syscall count stays bounded; a new syscall has an SQE-opcode alternative considered and explicitly rejected with rationale.
  • Ring dispatch change (kernel/src/cap/ring.rs):
    • SQ bounds check and per-dispatch SQE limit still enforced.
    • Corrupted SQ state fails closed (never re-processes the same bad state on the next tick).
    • No allocation in the interrupt-driven path beyond what is already documented in REVIEW_FINDINGS.md.
    • Result buffers and endpoint scratch buffers cannot leak stale bytes beyond the returned completion length.
  • User buffer validation change (kernel/src/mem/validate.rs):
    • Address range check precedes PTE walk.
    • PTE flags checked: present, user, and write (if the buffer is written).
    • Single-CPU assumption explicit; TOCTOU note retained until SMP lands.
  • ELF loader change (capos-lib::elf):
    • Every field bounded before use (phentsize, phnum, p_offset, p_filesz, p_memsz, p_vaddr).
    • Segments confined to the user half.
    • Overlap check preserved.
    • Integer arithmetic uses checked add/subtract before deriving mapped addresses, file slices, or zero-fill ranges.
  • Manifest change (capos-config::manifest):
    • Every optional field is either present or the service is rejected.
    • Name / binary / cap source strings are length-bounded.
    • Unknown / unsupported numbers in CUE input fail-closed with a path- specific error.
    • Capability grants are checked as an authority graph before any rejected graph can start a service.
  • Schema change (schema/capos.capnp):
    • Backward-compatible with existing wire format, or migration documented.
    • Every new method has an explicit capability-granting story (who mints the cap that lets this method be called?).
    • Generated code no_std patching still applies.
  • Host tool or generator change (tools/*, build.rs, CI scripts):
    • Manifest/config-derived paths cannot escape intended directories through absolute paths, traversal, symlinks, or file-type confusion.
    • External command execution uses explicit binaries and argument APIs, not shell interpolation of untrusted strings.
    • Generated outputs are review-visible and fail closed on malformed inputs.
    • Generated files and diagnostics do not disclose secrets, absolute paths, or stale build outputs beyond what the developer intentionally requested.
  • Unsafe block added or expanded: Tier 1 clippy lints plus REVIEW.md §“Unsafe Usage” checklist already cover this; the review should cite the specific invariant being maintained in the commit message.

Threat-model refresh

On every stage completion (Stage 6 IPC, Stage 7 SMP, first driver landing, first time a manifest comes from outside the repo), re-run §2 of this document and update it. The list of trust boundaries grows over time; the proposal decays if it doesn’t grow with the code.

Periodic full audit

Once per stage, schedule a focused audit pass:

  1. Re-verify every boundary’s code is still enforced at its documented entry point (no new bypass path).
  2. Re-run all Tier 2/3 jobs with the latest toolchain (catches tool-upgrade regressions).
  3. Walk through open items in REVIEW_FINDINGS.md and confirm each is still correctly classified (still open, fixed, or explicitly accepted).
  4. Record the audit date + outcome at the top of REVIEW_FINDINGS.md, matching the existing “Last verification” convention.

5. Concrete Verification Targets

Ordered by value and feasibility. Each one is a specific, bounded piece of work a contributor can pick up without needing to redesign the kernel.

#TargetTierPropertyBlocker
1capos-lib::cap_table4 (Kani)Stale CapId never resolves after slot reuse within the generation windowNone
2capos-lib::frame_bitmap4 (Kani)alloc/free preserve free_count invariant; no double-allocNone
3capos-lib::elf::parse2 (proptest + fuzz)No panic on arbitrary input; only well-formed user-half ELF64 acceptedNone
4capos-config::manifest2 (proptest + fuzz)Decode/encode round-trip; malformed input rejected without panicNone
5Ring SPSC protocol3 (Loom)No lost/doubled CQEs; fail-closed on corruption under all interleavingsExtract protocol into Loom-testable wrapper
6validate_user_buffer4 (Kani)Every accepted buffer lies entirely in user half with correct PTE flagsFormalise PTE model
7Ring dispatch path3 (Loom + proptest)SQE poll is bounded per tick; no allocation on the dispatch pathInitial alloc-free synchronous path landed; async transfer/release paths still need coverage
8IPC routing3Capabilities transferred exactly once; no duplication under direct-switchCapability transfer
9Direct-switch IPC handoff2 + 3Scheduler invariants preserved when a blocked receiver bypasses normal run-queue orderLoom-testable scheduler/ring model
10SMEP/SMAP + user access windows1 + QEMU integrationKernel cannot execute user pages; kernel user-memory touches happen only inside audited access windowsWire existing x86_64 helper into init path
11Manifest authority graph2 (property tests)Every granted cap source resolves, every export is unique, and no service starts after a rejected graphManifest executor path
12Resource accounting2 + 3Rings, endpoints, cap tables, scratch arenas, frames, and CPU budget fail closed under exhaustionS.9 design complete; implementation hooks pending
13Build/codegen TCB1Bootloader/deps/codegen inputs are pinned and generated output changes are review-visibleCI bootstrap
14Device DMA boundary (future)1 + design reviewNo driver or device can DMA outside explicitly granted buffersPCI/device work; IOMMU or bounce-buffer decision

Targets 1–4 are feasible today and should be the first batch of work. Target 10 is the security gate before treating Stage 6 services as untrusted. Targets 11–12 should be designed before capability transfer lands, otherwise the first IPC implementation will bake in ambient resource authority. Target 14 gates user-mode or semi-trusted drivers.

Current status as of 2026-04-21:

  • Targets 1–2 have initial Kani coverage in capos-lib.
  • Target 3 has arbitrary-input proptest coverage and a cargo-fuzz target for ELF bytes. The current Kani harness still only proves the short-input early-reject path because fully symbolic ELF parsing reaches allocator and sort internals before there is a sharper proof obligation.
  • Target 4 has cargo-fuzz coverage for manifest decoding/roundtrip and mkmanifest exported-JSON conversion.
  • Target 5 has a feature-gated Loom model for the shared ring protocol.
  • Target 13 has an initial CI baseline plus generated-code drift checking, dependency audit/deny gates, and required QEMU boot still open.

6. Phased Plan

This slots into WORKPLAN.md as a cross-cutting track rather than a phase – items are independent of Stage 6 IPC and can proceed in parallel.

  • Track S.1 – CI bootstrap – landed 2026-04-21
    • .github/workflows/ci.yml: fmt-check, test-config, test-ring-loom, test-lib, test-mkmanifest, cargo build --features qemu, make capos-rt-check, generated-code drift checking, and dependency policy checking.
    • QEMU smoke installs build-essential, capnproto, qemu-system-x86, xorriso, and cue v0.16.0 before running make and make run; it remains optional/non-blocking until boot runtime is stable enough to make it a required gate.
    • Clippy-with-deny and cargo-geiger remain future hardening jobs.
  • Track S.2 – Miri + proptest on capos-lib – landed 2026-04-21
    • Add proptest dev-dependency to capos-lib.
    • Host properties for capos-lib::cap_table and capos-lib::frame_bitmap; ELF arbitrary-input coverage remains open under S.13.
    • cargo test-lib runs the native host suite; cargo miri-lib runs the same crate under Miri.
  • Track S.3 – Manifest + mkmanifest fuzzing – landed 2026-04-21
    • fuzz/ crate with harnesses for manifest::decode and tools/mkmanifest CUE → capnp pipeline. Seed corpus checked in.
  • Track S.4 – Ring Loom harness – landed 2026-04-21
    • Extract the SPSC protocol from capos-config::ring into a test-only wrapper where atomics are swappable.
    • Loom tests covering corruption, overflow, and ordering.
    • Doubles as regression coverage for Phase 1.5 in WORKPLAN.md.
  • Track S.5 – Kani on capos-lib – initial harnesses landed 2026-04-21
    • CapTable generation/index/stale-reference invariants.
    • FrameBitmap alloc/free symmetry over a small symbolic bitmap model plus a concrete bounded contiguous-allocation proof.
    • ELF parser short-input early-reject panic-freedom.
    • The current bounds are intentionally conservative so make kani-lib remains a practical gate; broader symbolic ELF and contiguous-allocation proofs should wait for more specific invariants.
  • Track S.6 – Security review docs stay aligned
    • Keep REVIEW.md’s common security checklist aligned with §4’s boundary prompts as new boundaries land.
    • Add a “threat model refresh” step to the stage-completion workflow in CLAUDE.md.
  • Track S.7 – Stage-6-aware refresh
    • Re-run §2 trust-boundary inventory after capability transfer/release semantics land.
    • Plan Loom coverage for cross-process routing and direct-switch IPC.
  • Track S.8 – Untrusted-service hardening gate
    • Wire SMEP/SMAP enablement into x86_64 init after paging is live.
    • Replace raw user-slice construction in syscall/ring paths with checked copy/access helpers that bracket the actual access with STAC/CLAC.
    • Add QEMU hostile-userspace tests for bad pointers, kernel-half pointers, invalid caps, corrupted rings, and services without Console authority.
    • Audit untrusted-input paths for panics before Stage 6 endpoints run mutually-untrusting processes.
  • Track S.9 – Authority graph and resource accounting – landed 2026-04-21
    • Concrete design is captured in docs/authority-accounting-transfer-design.md.
    • Defines authority graph invariants, per-process quota ledger (cap slots, endpoint queue, outstanding calls, scratch, frame grants, log volume, CPU budget), diagnostic aggregation, and exactly-once transfer/rollback semantics.
    • Establishes acceptance criteria that gate WORKPLAN 3.6 capability transfer and 5.2 ProcessSpawner implementation.
  • Track S.10 – Supply-chain and generated-code TCB
    • Pin Limine and other external build inputs by revision/checksum rather than branch name.
    • Make capnp generated-code changes review-visible in CI, including the no-std patching step.
    • Consider cargo-vet only after cargo-deny/cargo-audit are in place; vetting too early is process theater.
    • S.10.3 adds a concrete dependency policy: no_std additions are accepted only with class attribution, cargo deny + cargo audit, and explicit lockfile intent.
    • S.10.3 enforcement is make dependency-policy-check, backed by deny.toml and pinned CI installs of cargo-deny 0.19.4 and cargo-audit 0.22.1.
  • Track S.11 – Device/DMA isolation gate
    • Before PCI/virtio/NVMe/user drivers land, choose the DMA isolation story: IOMMU-backed DMA domains or kernel-owned bounce buffers.
    • Define DMAPool, DeviceMmio, and Interrupt capability invariants: bounded physical ranges, explicit interrupt ownership, device reset on revoke, and no raw physical-address grants to untrusted drivers.
    • S.11.2 requires a concrete ownership-transition gate before userspace NIC/ block drivers are allowed.
  • Track S.12 – Kani harness bounds refresh
    • Revisit Kani bounds and harness shape once capability transfer, resource-accounting, or validate_user_buffer exposes concrete proof obligations.
    • Prefer actionably narrow properties over arbitrary symbolic parser exploration that spends verifier time in allocator or sort internals.
  • Track S.13 – ELF parser arbitrary-input coverage
    • Add proptest coverage for capos-lib::elf::parse on arbitrary bytes and valid-header perturbations.
    • Add a cargo fuzz target for ELF bytes once the corpus and runtime budget are defined.

Tracks S.1 through S.5 have initial coverage. S.6 is ongoing doc hygiene and should move with review-process changes. S.8 must land before Stage 6 runs mutually-untrusting services. S.9 design is complete and now gates concrete implementation work in 3.6/5.2. S.11 gates device-driver work. S.12 should not expand bounds for their own sake; it is a refresh point when new kernel invariants make better proof targets available. S.13 closes the remaining target-3 gap from the table above.

7. What This Proposal Does Not Promise

  • No claim that capOS will be “secure” at the end. It will be harder to write a silently wrong change to the code paths the tooling covers, and it will be easier to find the ones that are still wrong.
  • No proof obligation on every PR. Kani and Loom are expensive to run on every push; CI runs them on a reduced schedule (e.g. nightly, or on PRs that touch the covered crates).
  • Userspace and host-tool bugs are in scope, but their impact is classified by boundary. A userspace bug should not compromise kernel isolation; a host-tool bug can still compromise the build TCB or developer/CI filesystem.
  • No claim that confidentiality is handled beyond architectural isolation. Timing channels, cache side channels, device side channels, and covert channels through shared services remain explicit research topics, not current implementation goals.

8. Relation to Other Docs

  • docs/research/sel4.md §1 and §6.1 already make the case that full verification is not the right goal. This proposal is the operational answer.
  • REVIEW.md is the reviewer’s rulebook. This proposal explains the security and verification rationale behind its common checklist and per-boundary prompts.
  • REVIEW_FINDINGS.md is the open-issue log. This proposal feeds it – every bug found by Tier 2/3/4 tooling lands there unless fixed in the same change.
  • ROADMAP.md owns the stages; this proposal does not add stages, only a cross-cutting track that runs alongside them.
  • WORKPLAN.md owns concrete ordering; Track S.1–S.11 above is the actionable slice mirrored there.

Proposal: Capability-Based Binaries, Language Support, and POSIX Compatibility

How userspace binaries receive, use, and compose capabilities — from the native Rust runtime through POSIX compatibility to running unmodified software.

Current State

The init binary (init/src/main.rs) is no_std Rust with a static heap allocator and still reaches the raw bootstrap syscalls through shared demo support. It validates its CapSet, invokes the Console capability over the capability ring, and exits. The former ring and IPC smokes now live as separate release-built binaries in the nested demos/ workspace. The kernel creates multiple processes from the boot manifest and schedules them with round-robin preemption. Init is not yet based on a reusable runtime crate — demos/capos-demo-support is only a test-support shim, not capos-rt.

The kernel-side roadmap (Stages 4-6) provides the capability ring (SQ/CQ shared memory + cap_enter syscall, implemented), scheduling, and IPC. This proposal covers the userspace half: what binaries look like, how they’re built, and how existing software runs on a system with no ambient authority.

Part 1: Native Userspace Runtime (capos-rt)

The Problem

Every userspace binary currently needs to:

  • Define _start and a panic handler
  • Set up an allocator
  • Construct raw syscall wrappers
  • Manually serialize/deserialize capnp messages
  • Know the syscall ABI (register layout, method IDs)

This is fine for one proof-of-concept binary. It won’t scale to dozens of services.

Solution: A Userspace Runtime Crate

capos-rt is a no_std + alloc Rust crate that every native capOS binary depends on. It provides:

1. Entry point and allocator setup.

// capos-rt provides the real _start that:
// - initializes the heap allocator (bump allocator over a fixed region,
//   or grows via FrameAllocator cap if granted)
// - parses the initial capability set from a kernel-provided page
// - calls the user's main(CapSet)
// - calls sys_exit with the return value

#[capos_rt::main]
fn main(caps: CapSet) -> Result<(), Error> {
    let console = caps.get::<Console>("console")?;
    console.write_line("Hello from capOS")?;
    Ok(())
}

2. Syscall layer. Raw syscall asm wrapped in safe Rust functions. The entire syscall surface is 2 calls – new operations are SQE opcodes, not new syscalls:

  • sys_exit(code) – terminate process (syscall 1)
  • sys_cap_enter(min_complete, timeout_ns) – flush pending SQEs, then wait until N completions are available or the timeout expires (syscall 2)

Capability invocations go through the per-process SQ/CQ ring. capos-rt provides helpers for writing SQEs and reading CQEs:

#![allow(unused)]
fn main() {
/// Submit a CALL SQE to the capability ring and wait for the CQE.
pub fn cap_call(
    ring: &mut CapRing,
    cap_id: u32,
    method_id: u16,
    params: &[u8],
    result_buf: &mut [u8],
) -> Result<usize, CapError> {
    ring.push_call_sqe(cap_id, method_id, params);
    sys_cap_enter(1, u64::MAX);
    ring.pop_cqe(result_buf)
}
}

3. Cap’n Proto integration. Re-exports generated types from schema/capos.capnp and provides typed wrappers:

#![allow(unused)]
fn main() {
// Generated from schema + thin wrapper in capos-rt
impl Console {
    pub fn write(&self, data: &[u8]) -> Result<(), CapError> {
        let mut msg = capnp::message::Builder::new_default();
        let mut req = msg.init_root::<console::write_params::Builder>();
        req.set_data(data);
        self.invoke(0, &msg)  // method @0
    }

    pub fn write_line(&self, text: &str) -> Result<(), CapError> {
        let mut msg = capnp::message::Builder::new_default();
        let mut req = msg.init_root::<console::write_line_params::Builder>();
        req.set_text(text);
        self.invoke(1, &msg)  // method @1
    }
}
}

4. CapSet – the initial capability environment.

At spawn time, the kernel writes the process’s initial capabilities into a well-known page (or passes them via registers/stack – ABI TBD). capos-rt parses this into a CapSet: a name-to-CapId map.

#![allow(unused)]
fn main() {
pub struct CapSet {
    caps: BTreeMap<String, CapEntry>,
}

struct CapEntry {
    id: u32,            // authority-bearing slot in the process CapTable
    interface_id: u64,  // generated capnp TYPE_ID, carried for type checking
}

impl CapSet {
    /// Get a typed capability by name. Fails if not present or wrong type.
    pub fn get<T: Capability>(&self, name: &str) -> Result<T, CapError> { ... }

    /// List available capability names (for debugging/discovery).
    pub fn list(&self) -> impl Iterator<Item = (&str, u64)> { ... }
}
}

interface_id is not a handle. It is metadata carrying the generated Cap’n Proto TYPE_ID for the interface expected by the typed client. The handle is id (CapId). A typed client constructor must check that entry.interface_id == T::TYPE_ID, then store the CapId. Normal CALL SQEs do not need to repeat the interface ID because each capability table entry exposes one public interface. The ring SQE keeps fixed-size reserved padding for ABI stability, not a required interface field for the system transport.

This matters for the system transport because several capabilities can expose the same interface while representing different authority: a serial console, a log-buffer console, and a console proxy all have the Console TYPE_ID, but different CapId values.

Crate Structure

capos-rt/
  Cargo.toml          # no_std + alloc, depends on capnp
  build.rs            # capnpc codegen from schema/capos.capnp
  src/
    lib.rs            # re-exports, #[capos_rt::main] macro
    syscall.rs        # raw asm syscall wrappers
    caps.rs           # CapSet, CapEntry, Capability trait
    alloc.rs          # userspace heap allocator setup
    generated.rs      # include!(capnp generated code)

capos-rt is NOT a workspace member (same as init/ – needs different code model and linker script). It’s a path dependency for userspace crates.

Init After capos-rt

// init/src/main.rs -- after capos-rt exists
use capos_rt::prelude::*;

#[capos_rt::main]
fn main(caps: CapSet) -> Result<(), Error> {
    let console = caps.get::<Console>("console")?;
    let spawner = caps.get::<ProcessSpawner>("spawner")?;
    let manifest = caps.get::<Manifest>("manifest")?;

    console.write_line("capOS init starting")?;

    for entry in manifest.services()? {
        let binary_name = entry.binary();
        let granted = resolve_caps(&entry, &running_services, &caps)?;
        let handle = spawner.spawn(entry.name(), binary_name, &granted)?;
        running_services.insert(entry.name(), handle);
    }

    supervisor_loop(&running_services, &spawner)
}

Part 2: Capability-Based Binary Model

Binary Format

ELF64, same as now. The kernel’s ELF loader (kernel/src/elf.rs) already handles PT_LOAD segments. No changes to the binary format itself.

What changes is the ABI contract between kernel and binary:

AspectCurrent (Stage 3)After capos-rt
Entry point_start(), no args_start(cap_page: *const u8) or via well-known address
Syscall ABIad-hoc (rax=0 write, rax=1 exit)SQ/CQ ring + sys_cap_enter + sys_exit
Capability accessNoneCapSet parsed from kernel-provided page
SerializationNonecapnp messages
AllocatorNone (no heap)Bump allocator, optionally backed by FrameAllocator cap

Initial Capability Passing

The kernel needs to communicate the initial cap set to the new process. Options:

Option A: Well-known page. Kernel maps a read-only page at a fixed virtual address (e.g., 0x1000) containing a capnp-serialized InitialCaps message:

struct InitialCaps {
    entries @0 :List(InitialCapEntry);
}

struct InitialCapEntry {
    name @0 :Text;
    id @1 :UInt32;
    interfaceId @2 :UInt64;
}

Option B: Register convention. Pass pointer and length in rdi/rsi at entry. Simpler, but the data still needs to live somewhere in user memory.

Option C: Stack. Push the cap descriptor onto the user stack before iretq. Similar to how Linux passes auxv to _start.

Option A is cleanest – the page is always there, no calling-convention dependency, and it naturally extends to passing additional boot info later.

Service Binary Lifecycle

1. Kernel loads ELF, creates address space, populates cap table
2. Kernel maps InitialCaps page at well-known address
3. Kernel enters userspace at _start

4. capos-rt _start:
   a. Initialize heap allocator
   b. Parse InitialCaps page into CapSet
   c. Call user's main(CapSet)

5. User main:
   a. Extract needed caps from CapSet
   b. Do work (invoke caps, serve requests)
   c. Optionally export caps to parent once ProcessHandle export lookup exists

6. On return from main (or sys_exit):
   a. Kernel destroys process
   b. All caps in process's cap table are dropped
   c. Parent's ProcessHandle receives exit notification

Part 3: Language Support Roadmap

Tier 1: Rust (native, now)

Rust is the only language that matters until the runtime is stable. Reasons:

  • no_std + alloc works today with the existing kernel
  • capnp crate (v0.25) has no_std support with codegen
  • Zero runtime overhead – no GC, no dynamic linker, no libc
  • Same language as the kernel, shared understanding of the memory model
  • Ownership model maps naturally to capability lifecycle

All system services (drivers, network stack, store) will be Rust.

Tier 2: C (via libcapos.h, after Stage 6)

C is the second target because most existing driver code and system software is C, and the FFI boundary with Rust is trivial.

libcapos is a static library providing:

#include <capos.h>

// Ring-based capability invocation (synchronous wrapper around SQ/CQ ring)
int cap_call(cap_ring_t *ring, uint32_t cap_id, uint16_t method_id,
             const void *params, size_t params_len,
             void *result, size_t result_len);

// Typed wrappers (generated from .capnp schema)
int console_write(cap_t console, const void *data, size_t len);
int console_write_line(cap_t console, const char *text);

// CapSet access
cap_t capset_get(const char *name);
uint64_t capset_interface_id(const char *name);

// Syscalls (the entire syscall surface -- 2 calls total)
_Noreturn void sys_exit(int code);                   // terminate
uint32_t sys_cap_enter(uint32_t min_complete,        // flush SQEs + wait
                       uint64_t timeout_ns);

Implementation: libcapos is Rust compiled to a static .a with a C ABI (#[no_mangle] extern "C"). The capnp message construction happens in Rust behind the C API. This avoids requiring a C capnp implementation.

C binaries link against libcapos.a and use the same linker script as Rust userspace binaries. The entry point and allocator setup are in libcapos.

Tier 3: Regular Rust Runtime Support

After the native capos-rt service model is stable, the next language priority is making capOS build and run ordinary Rust programs as far as the capability model permits. The target is not an ambient POSIX clone; it is a Rust runtime path where common crates can use allocation, time, threading where available, and capability-backed I/O through capOS-native shims.

This has higher priority than C++ and should be evaluated before broad POSIX compatibility work, because Rust is already the system language and can reuse the existing capos-rt ownership and ring abstractions directly.

Tier 4: Go (GOOS=capos)

Go is the next high-priority runtime after regular Rust. It needs in-process threading, futex-like wait/wake, TLS/runtime metadata support, GC integration, and a network poller mapped to capOS capabilities. See docs/proposals/go-runtime-proposal.md for the dedicated plan.

Go has higher priority than C++ because it unlocks CUE and a large practical tooling/runtime ecosystem; C++ support should not displace the Go runtime track.

Tier 5: Any Language Targeting WASI (longer term)

See Part 5 below. Languages that compile to WASI (Rust, C, Go, etc.) can run on capOS through a WASI-to-capability translation layer.

Important distinction: WASI works differently for compiled vs. interpreted languages:

  • Compiled languages (Rust, C) compile directly to .wasm — no interpreter in the loop. WASI is a clean, efficient execution path.
  • Interpreted languages (Python, JS, Lua) still need their interpreter (CPython, QuickJS, etc.) — it’s just compiled to .wasm instead of native code. The stack becomes: script → interpreter.wasm → WASI runtime → kernel. You pay for a wasm sandbox layer on top of the interpreter you’d need anyway.

For interpreted languages, WASI sandboxing is valuable when running untrusted code (plugins, user-submitted scripts) where you don’t trust the interpreter itself. For trusted system scripts, native CPython/QuickJS via the POSIX layer (Part 4) is simpler and faster — the capability model already constrains what the process can do.

Tier 6: Managed Runtimes (much later)

Languages with their own runtimes (Java, .NET) would need their runtime ported to capOS. This is large effort and low priority. WASI is the pragmatic answer for these languages.

Go is a special case — see docs/proposals/go-runtime-proposal.md for the custom GOOS=capos path (motivated by CUE support). Go via WASI (GOOS=wasip1) is an alternative for CPU-bound use cases but lacks goroutine parallelism and networking.

C++ Note: pg83/std

pg83/std (https://github.com/pg83/std) was reviewed as a possible easier path to C++ on capOS. It is MIT licensed and centered on ObjPool, an arena-owned object graph model with small containers and lightweight public interfaces.

The useful subset for capOS is the low-level core: std/mem, std/lib, std/str, std/map, std/alg, std/typ, and std/sys/atomic. The main shim boundary is std/sys/crt.cpp, which currently provides allocation, memory/string intrinsics, and monotonic time through hosted libc calls.

The full library is not a shortcut to C++ support. It assumes hosted/POSIX facilities in large areas: malloc/free, clock_gettime, pthreads, poll, epoll/kqueue, sockets, fd I/O, DNS, and optional TLS libraries. Its build also expects a C++26-capable compiler. On the current development host, g++ 13.3.0 rejected -std=c++26 and clang++ was unavailable.

Treat it as a later C++ experiment after libcapos and C/C++ startup exist: port only the freestanding arena/container subset first, with exceptions and RTTI disabled unless a concrete C++ ABI decision enables them. Regular Rust and Go remain higher-priority runtime tracks.

Language-Specific Notes

Python

CPython is a C program. It can reach capOS via two paths:

  1. WASI: CPython compiled to python.wasm, runs inside Wasmtime/WAMR on capOS. Note: this is still CPython — WASI doesn’t eliminate the interpreter, it just compiles it to wasm. The stack is: script.py → python.wasm → WASI runtime (native) → kernel.
  2. POSIX layer: CPython compiled to native ELF via musl + libcapos-posix. Direct: script.py → cpython (native) → kernel.

WASI path — upstream status (as of March 2026):

  • CPython on WASI is Tier 2 since Python 3.13 (PEP 816)
  • Works for compute-only workloads (no I/O beyond stdout)
  • No sockets/networking — blocked on WASI 0.3 (no release date)
  • No threading — WASI SDK 26/27 have bugs, skipped by CPython
  • WASI 0.2 skipped entirely — going straight to 0.3
  • Python 3.14 targets WASI 0.1, SDK 24

POSIX path:

  • Full CPython built against musl + libcapos-posix
  • Networking works immediately (via TcpSocket/UdpSocket caps behind the POSIX socket shim), no dependency on WASI 0.3
  • More integration work than WASI, but unblocked

MicroPython: Small C program (~100K source) designed for embedded use. Builds against musl + libcapos-posix with minimal effort. No threading, no mmap, minimal syscall surface. Good for early scripting needs before full CPython is ported.

When to use which:

Use casePathWhy
Untrusted Python pluginsWASIWasm sandbox isolates interpreter bugs
System scripts, config toolingPOSIX (native CPython)Simpler, faster, networking works
Early scripting before POSIX layerWASI (compute-only)Works today, no porting needed
Lightweight embedded scriptingMicroPython via POSIXTiny footprint, minimal deps

Recommendation: Use POSIX path (native CPython) as the primary Python target once the POSIX layer exists. WASI path for sandboxed/untrusted execution. MicroPython for early experimentation. No custom Python runtime port needed — both paths reuse upstream CPython.

JavaScript / TypeScript

Same situation as Python — JS engines (V8, SpiderMonkey, QuickJS) are C/C++ programs that can be compiled to native via POSIX layer or to wasm via WASI. In both cases, the engine interprets JS; WASI just sandboxes the engine itself.

QuickJS is the MicroPython equivalent — tiny (~50K lines C), embeddable, trivially builds against libcapos. Good candidate for embedded scripting in capOS services without pulling in a full V8.

Lua

Tiny C implementation (~30K lines). Trivially builds against libcapos. Good candidate for an embedded scripting language in capOS services. Alternatively, runs via WASI with near-zero overhead.

Part 4: POSIX Compatibility Layer

Why POSIX at All?

capOS is not POSIX and doesn’t want to be. But:

  1. Existing software. Most useful software assumes POSIX. A DNS resolver, an HTTP server, a database – all speak open()/read()/write()/socket(). Without some compatibility layer, every piece of software must be rewritten.

  2. Developer familiarity. Programmers know POSIX. A compatibility layer lowers the barrier to writing capOS software, even if native caps are better.

  3. Gradual migration. Port software first with POSIX compat, then incrementally convert to native capabilities for tighter sandboxing.

The goal is NOT full POSIX compliance. It’s a pragmatic translation layer that maps POSIX concepts to capabilities, enabling existing software to run with minimal modification while preserving capability-based security.

Architecture: libcapos-posix

Application (C/Rust, uses POSIX APIs)
  │
  │  open(), read(), write(), socket(), ...
  │
  v
libcapos-posix (POSIX-to-capability translation)
  │
  │  Maps fds to caps, paths to namespace lookups
  │
  v
libcapos (native capability invocation)
  │
  │  SQ/CQ ring + cap_enter syscall
  │
  v
Kernel (capability dispatch)

libcapos-posix is a static library that provides POSIX-like function signatures. It is NOT libc – it doesn’t provide malloc (that’s the allocator in capos-rt/libcapos), locale support, or the thousand other things in glibc. It’s the ~50 syscall wrappers that matter for I/O.

File Descriptor Table

POSIX programs think in file descriptors. capOS has capabilities. The translation is a per-process fd-to-cap mapping table inside libcapos-posix:

#![allow(unused)]
fn main() {
struct FdTable {
    entries: BTreeMap<i32, FdEntry>,
    next_fd: i32,
}

enum FdEntry {
    /// Backed by a Console cap (stdout/stderr)
    Console { cap_id: u32 },
    /// Backed by a Namespace + hash (opened "file")
    StoreObject { namespace_cap: u32, hash: Vec<u8>, cursor: usize },
    /// Backed by a TcpSocket cap
    TcpSocket { cap_id: u32 },
    /// Backed by a UdpSocket cap
    UdpSocket { cap_id: u32 },
    /// Backed by a TcpListener cap
    Listener { cap_id: u32 },
    /// Pipe (IPC channel between two caps)
    Pipe { read_cap: u32, write_cap: u32 },
}
}

On process startup, libcapos-posix pre-populates:

  • fd 0 (stdin): if a Console or StdinReader cap is in the CapSet
  • fd 1 (stdout): mapped to Console cap
  • fd 2 (stderr): mapped to Console cap (or a separate Log cap)

Path Resolution

POSIX open("/etc/config.toml", O_RDONLY) becomes:

  1. libcapos-posix looks up the process’s Namespace cap (from CapSet, name "fs" or "root")
  2. Strips leading / (there is no global root – the namespace IS the root)
  3. Calls namespace.resolve("etc/config.toml") to get a store hash
  4. Calls store.get(hash) to retrieve the object data
  5. Creates an FdEntry::StoreObject with cursor at 0
  6. Returns the fd number

Relative paths work the same way – there’s no cwd concept by default, but libcapos-posix can maintain a synthetic cwd string and prepend it.

Path scoping is automatic. If the process was granted a Namespace scoped to "myapp/", then open("/data.db") resolves to "myapp/data.db" in the store. The process can’t escape its namespace – there’s no .. traversal because namespaces are flat prefix scopes, not hierarchical directories.

Supported POSIX Functions

Grouped by what capability backs them:

Console cap -> stdio:

POSIXcapOS translation
write(1, buf, len)console.write(buf[..len])
write(2, buf, len)console.write(buf[..len]) (or log cap)
read(0, buf, len)stdin.read(buf, len) if stdin cap exists

Namespace + Store caps -> file I/O:

POSIXcapOS translation
open(path, flags)namespace.resolve(path) -> store.get(hash) -> fd
read(fd, buf, len)memcpy from cached store object at cursor
write(fd, buf, len)buffer writes, flush to store.put() on close
close(fd)if modified: store.put(data) -> namespace.bind(path, hash)
lseek(fd, off, whence)update cursor in FdEntry
stat(path, buf)namespace.resolve(path) -> synthesize stat from object metadata
unlink(path)namespace.unbind(path) (object remains in store if referenced elsewhere)
opendir/readdirnamespace.list() filtered by prefix
mkdir(path)no-op or create empty namespace prefix (namespaces are implicit)

TcpSocket/UdpSocket caps -> networking:

POSIXcapOS translation
socket(AF_INET, SOCK_STREAM, 0)net_mgr.create_tcp_socket() -> fd
connect(fd, addr, len)tcp_socket.connect(addr)
bind(fd, addr, len)tcp_listener.bind(addr)
listen(fd, backlog)no-op (listener cap is already listening)
accept(fd)tcp_listener.accept() -> new fd
send(fd, buf, len, 0)tcp_socket.send(buf[..len])
recv(fd, buf, len, 0)tcp_socket.recv(buf, len)

Not supported (returns ENOSYS or EACCES):

POSIXWhy not
fork()No address space cloning. Use posix_spawn() (maps to ProcessSpawner)
exec()No in-place replacement. Use posix_spawn()
kill(pid, sig)No signals. Future lifecycle work may add ProcessHandle kill semantics
chmod/chownNo permission bits. Authority is structural
mmap(MAP_SHARED)No shared memory yet (future: SharedMemory cap)
ioctlNo device files. Use typed capability methods
ptraceNo debugging interface yet
pipe()Possible via IPC caps, but not in initial version
select/poll/epollRequires async cap invocation (Stage 5+). Initial version is blocking only

Process Creation Compatibility

capOS process creation is spawn-style, not fork/exec-style. A new process is a fresh ELF instance selected by ProcessSpawner, with an explicit initial CapSet assembled from granted capabilities. The parent address space is not cloned, and an existing process image is not replaced in place.

posix_spawn() is the compatibility primitive for subprocess creation. A libcapos-posix implementation maps it to ProcessSpawner.spawn(), translates file actions into fd-table setup and capability grants, and passes argv and environment data through the process bootstrap channel once that ABI exists. Programs that use the common fork() followed immediately by exec() pattern should be patched to call posix_spawn() directly.

Full fork() is intentionally not a native kernel primitive. Supporting it would require copy-on-write address-space cloning, parent/child register return semantics, fd-table duplication, a per-capability inheritance policy, safe handling for outstanding SQEs/CQEs, and defined behavior for endpoint calls, timers, waits, and process handles that are in flight at the fork point. Threaded POSIX processes add another constraint: only the calling thread is cloned, while locks and async-signal-safe state must remain coherent in the child.

If a concrete port needs more than posix_spawn(), the next step should be a narrow compatibility shim with vfork()/fork-for-exec semantics backed by ProcessSpawner, not a general kernel clone operation. That shim would suspend the parent, restrict child actions to exec-or-exit, and avoid pretending that arbitrary address-space cloning exists.

Security Model

The POSIX layer does NOT weaken capability security. Every POSIX call translates to a capability invocation on caps the process was actually granted:

  • open("/etc/passwd") fails if the process’s namespace doesn’t contain "etc/passwd" – not because of permission bits, but because the name doesn’t resolve
  • socket(AF_INET, SOCK_STREAM, 0) fails if the process wasn’t granted a NetworkManager cap
  • fork() fails unconditionally – there’s no way to synthesize it from caps

A POSIX binary on capOS is more constrained than on Linux, not less. The compatibility layer provides familiar function signatures, not familiar authority.

Building POSIX-Compatible Binaries

my-app/
  Cargo.toml        # depends on capos-posix (which depends on capos-rt)
  src/main.rs       # uses libc-style APIs

Or for C:

#include <capos/posix.h>   // open, read, write, close, socket, ...
#include <capos/capos.h>   // cap_call, capset_get, ...

int main() {
    // Works -- stdout is mapped to Console cap
    write(1, "hello\n", 6);

    // Works -- if "data" namespace cap was granted
    int fd = open("/config.toml", O_RDONLY);
    char buf[4096];
    ssize_t n = read(fd, buf, sizeof(buf));
    close(fd);

    // Works -- if NetworkManager cap was granted
    int sock = socket(AF_INET, SOCK_STREAM, 0);
    // ...
}

The linker pulls in libcapos-posix.a -> libcapos.a -> startup code. Same ELF output, same kernel loader.

musl as a Base (Optional, Later)

For broader C compatibility (printf, string functions, math), libcapos-posix can be layered under musl libc. musl has a clean syscall interface – all system calls go through a single __syscall() function. Replacing that function with capability-based dispatch gives you full libc on top of capOS capabilities:

// musl's syscall entry point -- we replace this
long __syscall(long n, ...) {
    switch (n) {
        case SYS_write: return capos_write(fd, buf, len);
        case SYS_open:  return capos_open(path, flags, mode);
        case SYS_socket: return capos_socket(domain, type, protocol);
        // ...
        default: return -ENOSYS;
    }
}

This is the same approach Fuchsia uses with fdio + musl, and Redox OS uses with relibc. It works and it gives you printf, fopen, getaddrinfo, and most of the C standard library.

Priority: after native capos-rt and libcapos are stable. musl integration is a significant engineering effort and should only be done when there’s actual software to port.

Part 5: WASI as an Alternative to POSIX

Why WASI Fits capOS Better Than POSIX

WASI (WebAssembly System Interface) was designed from the start as a capability-based system interface. Its concepts map almost directly to capOS:

WASI conceptcapOS equivalent
fd (pre-opened directory)Namespace cap
fd (socket)TcpSocket/UdpSocket cap
fd_write on stdoutConsole.write()
Pre-opened dirs at startupCapSet at spawn
No ambient filesystem accessNo ambient authority
path_open scoped to pre-opened dirnamespace.resolve() scoped to granted prefix

WASI programs already assume they get no ambient authority. A WASI binary compiled for capOS would need essentially zero translation for the security model – just a thin ABI adapter.

Architecture: Wasm Runtime as a capOS Service

WASI binary (.wasm)
  │
  │  WASI syscalls (fd_read, fd_write, path_open, ...)
  │
  v
wasm-runtime process (Wasmtime/wasm-micro-runtime, native capOS binary)
  │
  │  Translates WASI calls to capability invocations
  │  Each wasm instance gets its own CapSet
  │
  v
libcapos (native capability invocation)
  │
  v
Kernel

The wasm runtime is itself a native capOS process. It receives caps from its parent and partitions them among the wasm modules it hosts. This gives you:

  • Language independence. Any language that compiles to WASI (Rust, C, C++, Go, Python, JS, …) runs on capOS
  • Sandboxing for free. Wasm’s memory isolation + capOS capability scoping
  • No porting effort for software that already targets WASI
  • Density. Multiple wasm modules in one process, each with different caps

WASI vs Native Performance

Wasm adds overhead: bounds-checked memory, indirect calls, no SIMD (WASI preview 2 adds some). For system services (drivers, network stack), native Rust is the right choice. For application-level code (business logic, CLI tools, web services), wasm overhead is acceptable and the portability is worth it.

WASI Implementation Phases

Phase 1: wasm-micro-runtime as a capOS service. WAMR is a lightweight C wasm runtime designed for embedded/OS use. Build it as a native capOS C binary (via libcapos). Support fd_write (Console), proc_exit, and args_get – enough to run “hello world” wasm modules.

Phase 2: WASI filesystem via Namespace. Map WASI path_open/fd_read/ fd_write to Namespace + Store caps. Pre-opened directories become Namespace caps.

Phase 3: WASI sockets. Map WASI socket APIs to TcpSocket/UdpSocket caps.

Phase 4: WASI component model. WASI preview 2 components can expose and consume typed interfaces. These map naturally to capOS capability interfaces – a wasm component that exports an HTTP handler becomes a capability that other processes can invoke.

Part 6: Putting It All Together – Porting Strategy

Spectrum of Integration

Most native                                              Most compatible
     |                                                          |
     v                                                          v
Native Rust    C with libcapos    C with POSIX layer    WASI binary
(capos-rt)     (typed caps)       (libcapos-posix)      (wasm runtime)

- Best perf     - Good perf        - Familiar API        - Any language
- Full cap      - Full cap         - Auto sandboxing     - Auto sandboxing
  control         control            via cap scoping       via wasm + caps
- Most work     - Moderate work    - Least rewrite       - Zero rewrite
  to write        to write           for existing C        for WASI targets

Example: Porting a DNS Resolver

Native Rust: Rewrite using capos-rt. Receives UdpSocket cap, serves DNS lookups as a DnsResolver capability. Other processes get a DnsResolver cap instead of calling getaddrinfo(). Clean, typed, minimal authority.

C with POSIX layer: Take an existing DNS resolver (e.g., musl’s getaddrinfo implementation or a standalone resolver). Compile against libcapos-posix. Give it a UdpSocket cap and a Namespace cap for /etc/resolv.conf. It calls socket(), sendto(), recvfrom() – all translated to cap invocations. Works with minimal changes, but can’t export a typed DnsResolver cap (it speaks POSIX, not caps).

WASI: Compile a Rust DNS resolver to WASI. Run it in the wasm runtime. Same capability scoping, but through the wasm sandbox.

  1. System services: native Rust only. Drivers, network stack, store, init – these are the foundation and must use capabilities natively. No POSIX layer here.

  2. First applications: native Rust. While the ecosystem is young, applications should use capos-rt directly. This validates the cap model.

  3. C compatibility: when porting specific software. Don’t build the POSIX layer speculatively. Build it when there’s a specific C program to port (e.g., a DNS resolver, an HTTP server, a database). Let real porting needs drive which POSIX functions to implement.

  4. WASI: as the general-purpose application runtime. Once the native runtime is stable, the wasm runtime becomes the “run anything” answer. Lower priority than native Rust, but higher priority than full POSIX/musl compat, because WASI’s capability model is a natural fit.

Part 7: Schema Extensions

New schema types needed for the userspace runtime:

# Extend schema/capos.capnp

struct InitialCaps {
    entries @0 :List(InitialCapEntry);
}

struct InitialCapEntry {
    name @0 :Text;
    id @1 :UInt32;
    interfaceId @2 :UInt64;
}

interface ProcessSpawner {
    spawn @0 (name :Text, binaryName :Text, grants :List(CapGrant)) -> (handleIndex :UInt16);
}

struct CapGrant {
    name @0 :Text;
    capId @1 :UInt32;
    interfaceId @2 :UInt64;
}

interface ProcessHandle {
    wait @0 () -> (exitCode :Int64);
}

These definitions now live in schema/capos.capnp as the single source of truth. spawn() returns the ProcessHandle through the ring result-cap list; handleIndex identifies that transferred cap in the completion. The first slice passes a boot-package binaryName instead of raw ELF bytes so spawn requests stay inside the bounded ring parameter buffer; manifest-byte exposure and bulk-buffer spawning remain later work. kill, post-spawn grants, and exported-cap lookup are deferred until their lifecycle semantics are implemented.

Implementation Phases

Phase 1: capos-rt (parallel with Stage 4)

  • Create capos-rt/ crate (no_std + alloc, path dependency)
  • Implement syscall wrappers (sys_exit, sys_cap_enter) and ring helpers
  • Implement CapSet parsing from well-known page
  • Implement typed Console wrapper (first cap used from userspace)
  • Rewrite init/ to use capos-rt
  • Entry point macro, panic handler, allocator setup

Deliverable: init prints “Hello” via Console cap invocation through capos-rt, not raw asm.

Phase 2: Service binaries (after Stage 6)

  • Add capnp codegen to capos-rt build.rs (shared with kernel)
  • Implement typed wrappers for all schema-defined caps
  • Build the first multi-process demo: init spawns server + client, client invokes server cap
  • Establish the pattern for service binaries (Cargo.toml template, linker script, build integration)

Deliverable: two userspace processes communicate via typed capabilities.

Phase 3: libcapos for C (after Phase 2)

  • Expose capos-rt functionality via extern "C" API
  • Write capos.h header
  • Build system support for C userspace binaries (linker script, startup)
  • Port one small C program as validation

Deliverable: a C “hello world” using console_write_line().

Phase 4: POSIX compatibility (driven by need)

  • Implement FdTable and path resolution
  • Start with file I/O (open/read/write/close over Namespace + Store)
  • Add socket wrappers when networking is userspace
  • Optionally integrate musl for full libc

Deliverable: an existing C program (e.g., a simple HTTP server) runs on capOS with minimal source changes.

Phase 5: WASI runtime (after Phase 3)

  • Build wasm-micro-runtime as a native capOS C binary
  • Map WASI fd_write/proc_exit to caps
  • Extend to filesystem and socket WASI APIs
  • Run a “hello world” wasm module

Deliverable: hello.wasm runs on capOS.

Open Questions

  1. Allocator strategy. Should the userspace heap be a fixed-size region (simple, but limits memory), or should it grow by invoking a FrameAllocator cap (flexible, but every allocation might syscall)? Likely answer: fixed initial region + grow-on-demand via cap.

  2. Async I/O. The SQ/CQ ring is inherently asynchronous (submit SQEs, poll CQEs), but the initial capos-rt wrappers provide blocking convenience (submit one CALL SQE + cap_enter(1, MAX)). Real services need batched async patterns. Options:

    • Submit multiple SQEs, poll CQEs in an event loop (io_uring style)
    • Green threads in capos-rt, each blocking on its own cap_enter
    • Userspace executor (like tokio) driving the ring
  3. Cap passing in POSIX layer. POSIX has SCM_RIGHTS for passing fds over Unix sockets. Should the POSIX layer support something similar for passing caps? Or is this native-only?

  4. Dynamic linking. Currently all binaries are statically linked. Should capOS support shared libraries? Probably not initially – static linking is simpler and the binaries are small. Revisit if binary size becomes a concern.

  5. WASI component model integration. WASI preview 2 components have typed imports/exports that could map to capnp interfaces. Should the wasm runtime auto-generate capnp-to-WIT adapters from schemas? This would let wasm components participate natively in the capability graph.

  6. Build system. How are userspace binaries packed into the boot image? Currently the Makefile builds init/ separately. With multiple service binaries, need a more scalable approach (build manifest that lists all binaries, Makefile target that builds and packs them all).

Relationship to Other Proposals

  • Service architecture proposal – defines what services exist and how they compose. This proposal defines how those service binaries are built, what runtime they use, and how non-Rust software fits in.
  • Storage and naming proposal – the POSIX open()/read()/write() translation targets the Store and Namespace caps defined there.
  • Networking proposal – the POSIX socket translation targets the TcpSocket/UdpSocket caps from the network stack.

Proposal: Native Shell, Agent Shell, and POSIX Shell

How interactive operation should work on capOS without reintroducing ambient authority through a Unix-like command line.

Problem

capOS deliberately avoids global paths, inherited file descriptors, ambient network access, and process-wide privilege bits. A conventional shell assumes all of those. If capOS copied a Unix shell model directly, the shell would either be mostly useless or become an ambiently privileged escape hatch around the capability model.

The system needs three related, but distinct, shell layers:

  • Native shell: schema-aware capability REPL and scripting language.
  • Agent shell: natural-language planning layer over the native shell.
  • POSIX shell: compatibility personality for existing programs and scripts.

All three must be ordinary userspace processes. None of them should receive special kernel privilege. The kernel and trusted capability-serving processes remain the enforcement boundary.

The first boot-to-shell milestone is text-only: local console login/setup and, later in the same family, a browser-hosted terminal gateway. Graphical shells, desktop UI, compositors, and GUI app launchers are a later tier. See boot-to-shell-proposal.md.

Design Principles

  • A shell starts with only the capabilities it was granted.
  • Natural language is not authority.
  • A shell command compiles to typed capability calls, not stringly syscalls.
  • Child processes receive explicit grants. There is no implicit inheritance of the shell’s full authority.
  • Elevation is a capability request mediated by a trusted broker, not a flag inside the shell.
  • Shell startup is a workload launch from a UserSession, service principal, or recovery profile. Session metadata informs policy and audit; it is not authority.
  • Default interactive cap sets are broker-issued session bundles, not hard-coded shell privileges.
  • POSIX behavior is an adapter over scoped Directory, File, socket factory, and process capabilities. It is not the native authority model.

User identity and policy sit above this shell model. A shell session may be associated with a human, service, guest, anonymous, or pseudonymous principal, but the session’s capabilities remain the authority. RBAC, ABAC, and mandatory policy decide which scoped caps a broker may grant; they do not create a kernel-side uid, role bit, or label check on ordinary capability calls. See user-identity-and-policy-proposal.md.

Layering

flowchart TD
    Input[Login, guest, anonymous, or service request] --> SessionMgr[SessionManager]
    SessionMgr --> Session[UserSession metadata cap]
    Session --> Broker[AuthorityBroker / PolicyEngine]
    Broker --> Bundle[Scoped session cap bundle]

    Bundle --> Agent[Agent shell]
    Bundle --> Native[Native shell]
    Bundle --> Posix[POSIX shell]

    Agent --> Plan[Typed action plan]
    Plan --> Native
    Posix --> Compat[POSIX compatibility runtime]

    Native --> Ring[capos-rt capability transport]
    Compat --> Ring
    Ring --> Kernel[Kernel cap ring]
    Ring --> Services[Userspace services]

    Agent --> Approval[Approval client cap]
    Approval --> Broker
    Broker --> Services
    Broker --> Audit[AuditLog]

The native shell is the primitive interactive surface. The agent shell emits native-shell plans after inspecting available schemas, current caps, and the session-bound policy context exposed to it. The POSIX shell is a compatibility consumer of capOS capabilities, not the model other shells are built on.

A shell may display a principal name, profile, role set, label, or POSIX UID, but those values are descriptive unless a trusted broker uses them to return a specific capability. Losing a home, logs, launcher, or approval cap cannot be repaired by presenting the same session ID back to the kernel.

Native Shell

The native shell is a typed capability graph operator. Its job is to inspect, invoke, pass, attenuate, release, and trace capabilities.

Example init or development session with explicit spawn authority:

capos:init> caps
log        Console
spawn      ProcessSpawner
boot       BootPackage
vm         VirtualMemory

capos:init> call @log.writeLine({ text: "hello" })
ok

capos:init> spawn "tls-smoke" with {
  log: @log
} -> $child
started pid 12

capos:init> wait $child
exit 0

Values

Native shell values should include:

  • @name: a named capability in the current shell context.
  • $name: a local value, result, promise, or process handle.
  • structured values: text, bytes, integers, booleans, lists, and structs.
  • result-cap values returned through the capOS transfer-result path.
  • trace values representing CQE and call-history slices.

The shell should preserve interface metadata with every capability value. A method call is valid only if the target cap exposes the method’s schema.

Commands

Initial commands should be small and explicit:

caps
inspect @log
methods @spawn
call @log.writeLine({ text: "boot complete" })
spawn "ipc-server" with { log: @log, ep: @serverEp } -> $server
wait $server
release @temporary
trace $server
bind scratch = @store.sub("scratch")
derive readonly = @home.sub("config").readOnly()

inspect should show the interface ID, label, transferability, revocation state when available, and callable methods. It should not imply that two caps with the same interface ID are the same authority.

Syntax

The syntax should be structured rather than shell-token based. A CUE-like or Cap’n-Proto-literal-like shape fits capOS better than POSIX word splitting:

spawn "net-stack" with {
  log: @log
  nic: @virtioNic
  timer: @timer
}

The shell can still provide abbreviations, but the executable representation should be an ActionPlan object with typed fields.

Composition

Native composition should pass typed caps or structured values, not inherited byte streams by default:

pipe @camera.frames()
  |> spawn "resize" with { input: $, width: 640, height: 480 }
  |> spawn "jpeg-encode" with { input: $, quality: 85 }
  |> call @photos.write({ name: "frame.jpg", data: $ })

If a byte stream is desired, it should be explicit through a ByteStream, File, or POSIX adapter capability. This keeps the “pipe” operator from silently turning every interface into untyped bytes.

Namespaces

There is no global root. A native shell may have a current Directory or Namespace capability, but that is just a default argument:

capos:user> ls @config
services
network

capos:user> cd @config.sub("services")
capos:@config/services> ls
logger
net-stack

The shell cannot traverse above a scoped directory or namespace unless it holds another capability that names that authority.

Session Context

A session-aware shell may hold a self or session cap for UserSession.info() and audit context. That cap is metadata. It can identify the principal, auth strength, expiry, quota profile, and audit identity, but it cannot widen the shell’s CapSet or authorize kernel operations by itself.

The launcher or supervisor starts the shell with a CapSet returned by AuthorityBroker(session, profile). For interactive work, that bundle should usually include scoped terminal, home, logs, launcher, status, and approval caps. For service accounts, guest sessions, anonymous workloads, and recovery mode, the broker returns different bundles under explicit policy profiles.

Shell-launched children inherit only the caps named in the spawn plan. A child may receive a UserSession or session badge for audit, per-client quotas, or service-side selection, but object access still comes from the scoped object caps passed to that child.

Agent Shell

The agent shell is a natural-language planner that emits typed native-shell plans. It should not directly own broad administrative authority.

Example:

capos:init> start the IPC demo, give the client only the server endpoint and console, then wait for both

The agent produces a plan:

Plan:
1. Spawn "ipc-server" with:
   - log: Console
   - ep: Endpoint(owner)

2. Spawn "ipc-client" with:
   - log: Console
   - server: Endpoint(client facet from server)

3. Wait for both ProcessHandle caps.

Required authority:
- ProcessSpawner
- Console
- Endpoint owner cap
- BootPackage binary access

Only after validation does the plan execute. Validation checks the current cap set, schema method IDs, transferability, grant names, quota limits, and policy.

What the Agent Adds

The useful AI-specific behavior is not raw command execution. It is:

  • intent decomposition into spawn, grant, wait, trace, and release steps.
  • schema-aware parameter construction.
  • least-authority grant selection.
  • explanation of missing capabilities.
  • diagnosis from structured errors, CQEs, logs, and process handles.
  • conversion of vague requests into an explicit plan that can be audited.
  • retry after typed failures without bypassing policy.

The agent should reason over capOS objects and schemas, not over an unbounded shell prompt.

Minimal Daily Cap Set

The daily-use agent shell should start with the user-identity proposal’s session bundle, minted by AuthorityBroker for one UserSession and profile:

terminal        TerminalSession or Console
self            self/session introspection
status          read-only SystemStatus
logs            read-only LogReader scoped to this user/session
home            Directory or Namespace scoped to user data
launcher        restricted launcher for approved user applications
approval        ApprovalClient

It should not receive these by default:

ProcessSpawner(all)
BootPackage(all)
DeviceManager
StoreAdmin
FrameAllocator
VirtualMemory for other processes
raw networking caps
global service supervisor caps

The shell can ask for more authority, but it cannot mint that authority for itself.

Guest and anonymous profiles should receive narrower variants. A guest shell may get terminal, tmp, and a restricted launcher, while an anonymous workload normally receives short-lived purpose caps, strict quotas, and no durable home namespace. An approval path exists only when the profile policy explicitly grants one.

Approval and Authentication

Elevation belongs in a trusted broker service that is outside the model-controlled agent process.

Conceptual interfaces:

interface ApprovalClient {
  request @0 (
    reason :Text,
    plan :ActionPlan,
    requestedCaps :List(CapRequest),
    durationMs :UInt64
  ) -> (grant :ApprovalGrant);
}

enum ApprovalState {
  pending @0;
  approved @1;
  denied @2;
  expired @3;
}

interface ApprovalGrant {
  state @0 () -> (state :ApprovalState, reason :Text);
  claim @1 () -> (caps :List(GrantedCap));
  cancel @2 () -> ();
}

interface AuthorityBroker {
  request @0 (
    session :UserSession,
    plan :ActionPlan,
    requestedCaps :List(CapRequest),
    durationMs :UInt64
  ) -> (grant :ApprovalGrant);
}

The agent shell holds only a session-bound ApprovalClient. It does not submit arbitrary PrincipalInfo, role, UID, label values, or authentication proofs as authority. The ApprovalClient forwards the bound UserSession and typed request to AuthorityBroker. The broker or a consent service wrapping it holds powerful caps, drives any trusted consent or step-up authentication path, and mints attenuated temporary caps after policy and authentication checks.

The conceptual API intentionally has no authProof argument on the agent-visible path. If a proof is needed, it is collected by SessionManager, the broker, or a trusted approval UI and reflected back to the agent only as pending, approved, denied, or expired.

Elevation Flow

User request:

restart the network stack

Agent plan:

Requested action:
- stop service "net-stack"
- spawn "net-stack"
- grant: nic, timer, log
- wait for health check

Missing authority:
- ServiceSupervisor(net-stack)

Requested duration:
- 60 seconds

Broker decision:

  • Which UserSession and profile is this request bound to?
  • Is that principal/profile allowed to restart net-stack?
  • Is the requested binary allowed?
  • Are the requested grants narrower than policy permits?
  • Do mandatory confidentiality and integrity constraints allow the grant?
  • Is there fresh user presence?
  • Does this require step-up authentication?

If approved, the broker returns a narrow leased capability:

supervisor: ServiceSupervisor(service="net-stack", expires=60s)

It should not return broad ProcessSpawner, BootPackage, or DeviceManager authority when a scoped supervisor cap can do the job.

Authentication

Authentication proof should be consumed by the SessionManager or broker boundary, not exposed as a secret to the agent. Suitable mechanisms include:

  • password or PIN for medium-risk local actions.
  • hardware key or WebAuthn-style challenge for administrative actions.
  • TPM-backed local presence for device or boot-policy operations.
  • multi-party approval for destructive policy, storage, or recovery actions.

The model should never receive raw tokens, private keys, recovery codes, or full environment dumps.

Agent Hardening

The agent shell must treat files, logs, web pages, service output, and CQE payloads as untrusted data. They are not instructions.

Required behavior:

  • show an executable typed plan before authority-changing actions.
  • keep elevated caps leased, narrow, and short-lived.
  • release temporary caps after the plan finishes or fails.
  • audit every approval request, grant, cap transfer, and release.
  • require exact targets for destructive actions.
  • refuse broad phrases such as “give it everything” unless a trusted policy explicitly allows a named emergency mode.
  • keep model memory separate from secrets and authentication proofs.

The enforcement rule is simple: the model may plan, explain, and request. Capabilities decide what can happen.

POSIX Shell

The POSIX shell is a compatibility layer for existing software and scripts. It should be useful, but it should not define native capOS administration.

Mapping

POSIX concepts map onto granted capabilities:

POSIX conceptcapOS backing
/synthetic root built from granted Directory or FileServer caps
cwdcurrent scoped Directory cap
fdlocal handle to File, ByteStream, pipe, terminal, or socket cap
pipeByteStream pair or userspace pipe service
PATHsearch inside the synthetic root or a command registry cap
execProcessSpawner or restricted launcher cap
socketssocket factory caps such as TcpProvider or HttpEndpoint
uid, gid, user, groupsynthetic POSIX profile derived from session metadata
$HOMEpath alias backed by a granted home directory or namespace cap
/etc/passwd, /etc/groupprofile service view, scoped to the compatibility environment
env varsdata only; never authority by themselves

If a POSIX process has no network cap, connect() fails. If it has no directory mounted at /etc, opening /etc/resolv.conf fails. If it has no device cap, /dev is empty or synthetic.

A POSIX shell is launched with both a CapSet and compatibility profile metadata. The profile controls what legacy APIs report. The CapSet controls what the process can actually do.

Compatibility Limits

Exact Unix semantics should not be promised early.

  • Prefer posix_spawn over full fork for the first implementation.
  • fork with arbitrary shared process state can be emulated later if needed.
  • setuid cannot grant caps. At most it asks a compatibility broker to replace the POSIX profile or launch a new process with a different broker-issued cap bundle.
  • Mode bits and ownership metadata do not create authority.
  • chmod can modify filesystem metadata exposed by a filesystem service, but it cannot grant caps outside that service’s policy.
  • /proc is a debugging service view, not kernel ambient introspection.
  • Device files exist only when a capability-backed adapter deliberately exposes them.

This is enough for many build tools and CLI programs without making POSIX the security model.

POSIX Session Caps

A normal POSIX shell session might receive:

terminal      TerminalSession
session       UserSession metadata
profile       POSIX profile view
root          Directory or FileServer synthetic root
launcher      restricted ProcessSpawner/command launcher
pipeFactory   ByteStream factory
clock         Timer

Optional caps:

tcp           scoped socket provider
home          writable user Directory
tmp           temporary Directory
proc          read-only process inspection tree

Administrative caps still require broker-mediated approval.

Recovery Shell

A recovery shell is a separate policy profile, not the normal agent shell with hidden extra privileges. It may receive a larger cap set, but only after strong local authentication and with full audit logging. Guest and anonymous profiles must not fall into recovery authority by omission.

Possible recovery bundle:

console
boot package read
system status read
service supervisor for critical services
read-only storage inspection
scoped repair caps
approval client

Destructive recovery operations should still go through exact-target approval. The recovery shell should be local-only unless a separate remote recovery policy explicitly grants network access.

Required Interfaces

This proposal implies several service interfaces beyond the current smoke-test surface:

  • UserSession / SessionManager: principal/session metadata, audit context, and guest or anonymous profile creation (user identity proposal).
  • TerminalSession: structured terminal I/O, window size, paste boundaries.
  • SchemaRegistry: maps interface IDs to method names and parameter schemas.
  • CommandRegistry: optional registry of native command capabilities.
  • SystemStatus: read-only process and service status.
  • LogReader: scoped log access.
  • ServiceSupervisor: restart/status authority for one service or subtree.
  • AuthorityBroker / ApprovalClient: session-bound base bundles, plan-specific leased grants, and policy/authentication mediation.
  • CredentialStore, ConsoleLogin, and WebShellGateway: boot-to-shell authentication services for password-verifier setup, passkey registration, and text terminal launch (boot-to-shell proposal).
  • AuditLog: append-only record of plans, approvals, grants, and releases.
  • POSIXProfile / compatibility broker: synthetic UID/GID, names, $HOME, cwd, and profile replacement without treating POSIX metadata as authority.
  • ByteStream / pipe factory: explicit byte-stream composition for POSIX and selected native pipelines.

These should be ordinary capabilities. A shell only sees the subset it has been granted.

Implementation Plan

  1. Native serial shell

    • Built on capos-rt.
    • Lists initial CapSet entries.
    • Invokes typed Console methods.
    • Spawns and waits on boot-package binaries through ProcessSpawner.
    • Provides caps, inspect, call, spawn, wait, release, and trace.
  2. Session-aware shell profile

    • Use the SessionManager -> UserSession metadata and AuthorityBroker(session, profile) -> cap bundle split.
    • Add self/session introspection without making identity metadata authoritative.
    • Start with guest, local-presence, and service-account profiles before durable account storage exists.
  3. Structured native scripting

    • Add typed variables, result-cap binding, and plan serialization.
    • Add schema registry support for method names and argument validation.
    • Add explicit byte-stream adapters for commands that need text streams.
  4. Approval broker

    • Define ActionPlan, CapRequest, ApprovalClient, and leased grant records.
    • Add local authentication and audit logging.
    • Make administrative native-shell operations request scoped caps through the broker instead of running from a permanently privileged shell.
  5. Boot-to-shell integration

    • Add local console login/setup in front of the native shell.
    • Require a configured password verifier when one exists.
    • Enter setup mode when no console password verifier exists.
    • Treat guest as an explicit local profile and anonymous as a separate remote/programmatic profile, not as missing-password fallbacks.
    • Support passkey-only web terminal setup through local/bootstrap authority, not unauthenticated remote first use.
  6. Agent shell

    • Natural-language frontend that emits native ActionPlan objects.
    • Starts with the broker-issued minimal daily session bundle.
    • Uses the approval broker for elevation.
    • Treats all external content as untrusted data.
  7. POSIX shell

    • Implement after Directory/File, ByteStream, and restricted process launch exist.
    • Start with posix_spawn, fd table emulation, cwd, scoped root, pipes, and terminal I/O, plus synthetic POSIX profile metadata.
    • Add broader compatibility only as real workloads demand it.

Non-Goals

  • No global root namespace.
  • No shell-owned root/admin bit.
  • No model-visible secrets.
  • No default inheritance of all shell caps into children.
  • No authorization from PrincipalInfo, UID/GID, role, or label values alone.
  • No promise that POSIX scripts observe exact Unix behavior without a compatibility profile that grants the needed caps.

Open Questions

  • Should the native shell syntax be CUE-derived, Cap’n-Proto-literal-derived, or a smaller custom grammar?
  • How should schema reflection be packaged before a full runtime SchemaRegistry exists?
  • What is the first minimal TerminalSession interface beyond Console?
  • Should approval be synchronous only, or can long-running agent plans request staged approvals?
  • How should audit logs be stored before persistent storage exists?

Proposal: Boot to Shell

How capOS should move from “boot runs smokes and halts” to an authenticated, text-only interactive shell without weakening the capability model.

Problem

The current boot path is still a systems bring-up path. It starts fixed services, proves kernel and userspace invariants, and exits cleanly. That is useful for validation, but it is not an operating environment.

The first interactive milestone should be deliberately modest:

  • Boot QEMU or a local machine to a text console login/setup prompt.
  • Start a native capability shell after local authentication or first-boot setup.
  • Offer a browser-hosted text terminal later in the same milestone family, with WebAuthn/passkey authentication.
  • Keep graphical shells, desktop UI, window systems, and app launchers as a later tier.

The risk is that “make it interactive” tends to smuggle ambient authority back into the system. A login prompt must not become a kernel uid, a web terminal must not become an unaudited remote root shell, and first-boot setup must not be a first-remote-client-wins race.

Scope

In scope:

  • Serial/local text console login and first-boot credential setup.
  • Native text shell as the post-login workload.
  • Minimal SessionManager, CredentialStore, AuthorityBroker, and AuditLog pieces needed to launch that shell with an explicit CapSet.
  • Password verifier records stored with a memory-hard password hash.
  • Passkey registration and authentication for a web text shell.
  • A passkey-only account path that does not require creating a password first.
  • Local recovery/setup policy for machines with no credential records.

Out of scope:

  • Graphical shell, desktop session, compositor, GUI app launcher, clipboard, or remote desktop.
  • POSIX /bin/login, PAM, sudo, su, or Unix uid/gid semantics.
  • Password reset by policy fiat. Recovery is a separate authenticated setup or operator action.
  • Making authentication proofs visible to the shell, agent, logs, or ordinary application processes.

Design Principles

  • Authentication creates a UserSession; capabilities remain the authority.
  • The shell is an ordinary process launched with a broker-issued CapSet.
  • Console authentication and web authentication feed the same session model.
  • Passwords are verified against versioned password-verifier records; raw passwords are never stored, logged, or passed to the shell.
  • Passkeys store public credential material only; private keys stay in the authenticator.
  • First-boot setup requires local setup authority or an explicitly configured bootstrap credential. Remote first-come setup is not acceptable.
  • A missing credential store does not imply an unlocked system.
  • Guest and anonymous sessions are explicit policy profiles, not fallbacks for missing credentials.
  • Development images may have an explicit insecure profile, but that must be visible in the manifest and serial output.

Architecture

The boot-to-shell path is a userspace service graph started by init after the manifest executor milestone is complete:

flowchart TD
    Kernel[kernel starts init only]
    Init[init manifest executor]
    Boot[BootPackage]
    Cred[CredentialStore]
    Session[SessionManager]
    Broker[AuthorityBroker]
    Audit[AuditLog]
    Console[ConsoleLogin]
    Web[WebShellGateway]
    Launcher[RestrictedShellLauncher]
    Shell[Native text shell]

    Kernel --> Init
    Init --> Boot
    Init --> Cred
    Init --> Session
    Init --> Broker
    Init --> Audit
    Init --> Console
    Init --> Web
    Console --> Session
    Web --> Session
    Session --> Broker
    Broker --> Launcher
    Launcher --> Shell
    Cred --> Session
    Audit --> Session
    Audit --> Broker

init owns broad boot authority long enough to start the authentication and session services. It should not spawn the interactive shell directly with broad boot caps. The broker returns a narrow shell bundle such as:

terminal        TerminalSession or Console
self            UserSession metadata
status          read-only SystemStatus
logs            scoped LogReader
home            scoped Namespace or temporary Namespace
launcher        RestrictedLauncher
approval        ApprovalClient

Early builds can omit storage-backed home and use a temporary namespace. They still should not hand the shell broad BootPackage, ProcessSpawner, FrameAllocator, raw device, or global service-supervisor authority by default.

Console Login

The local console path has three states.

Password Configured

If CredentialStore has an enabled console password verifier for the selected principal or profile, ConsoleLogin prompts for the password before launching the shell.

The verifier record should be versioned:

PasswordVerifier {
  algorithm: "argon2id"
  params: { memoryKiB, iterations, parallelism, outputLen }
  salt: random bytes
  hash: verifier bytes
  createdAtMs
  credentialId
  principalId
}

Argon2id is the default target because it is memory-hard and widely reviewed. The record must include parameters so stronger settings can be introduced without invalidating older records. A deployment may add a TPM- or secret-store-backed pepper later, but the design must not depend on a pepper being present.

On failed attempts, ConsoleLogin records an audit event and applies bounded backoff. The backoff state is not a security boundary by itself, because local attackers may reboot; the password hash strength still matters.

No Console Password

If no console password verifier exists, the console does not launch an ordinary shell. It enters setup mode.

Setup mode can:

  • create the first console password verifier,
  • enroll a first passkey for the web text shell,
  • create both credentials,
  • choose an explicit local guest or development profile if the manifest permits it.

For normal images, the setup flow must create at least one usable credential or leave the machine without an ordinary interactive shell. This matches the operator expectation: no configured password means “setup required”, not “open console”.

Passkey-Only Deployment

Passkey-only should be possible without creating a password. It still needs a bootstrap authority path.

Acceptable first-passkey bootstrap paths:

  • local console setup enrolls the first passkey and then never creates a password verifier,
  • the manifest or cloud metadata includes a predeclared passkey public credential for an operator principal,
  • the console prints a short-lived setup challenge that a web enrollment flow must redeem before registering the first passkey.

Unacceptable path:

  • the first remote browser to reach the web endpoint becomes administrator because no password exists.

If a machine is passkey-only, the local console can still expose setup, recovery, guest, or diagnostic profiles according to policy. It should not silently become an unauthenticated administrator shell.

Guest and Anonymous Profiles

The user-identity proposal distinguishes authenticated, guest, anonymous, and pseudonymous sessions. Boot-to-shell should consume that model directly.

Authenticated password login creates a human or operator UserSession with auth strength password. Authenticated passkey login normally creates a human, operator, or pseudonymous UserSession with auth strength hardwareKey. Neither proof is authority by itself; both feed the broker.

Guest is the only unauthenticated profile that belongs on the local interactive console by default. It is a deliberate SessionManager.guest() path with a local interactive affordance, weak or no authentication, short expiry, tight quotas, no durable home unless policy grants one, and a bundle such as:

terminal        TerminalSession
self            guest UserSession metadata
tmp             temporary Namespace
launcher        RestrictedLauncher(allowed = ["help", "settings"])
logs            scoped LogReader for this guest session

Guest should not receive ApprovalClient for administrative actions unless a named policy grants it. If no console password exists, setup may offer a guest session only when the manifest explicitly enables a guest profile. Otherwise the operator must create a credential or leave the ordinary shell unavailable.

Anonymous is different. It is usually remote or programmatic, has a random ephemeral principal ID, receives a smaller cap bundle than guest, and has no elevation path except “authenticate” or “create account”. It is not the console fallback for missing credentials, and it should not be counted as “booted to shell” unless the product goal is an explicitly anonymous demo.

If the web gateway later supports anonymous access, it should be a purpose-scoped workload or very restricted text terminal with no durable home, strict quotas, short expiry, and audit keyed by network context plus ephemeral session ID. It must not share the passkey setup path, because passkey-only bootstrap is a credential-enrollment flow, not anonymous access.

An empty CapSet remains the “Unprivileged Stranger” case. It is useful for attack-surface demonstration, but it is not a session profile and not a shell login mode.

Web Text Shell and Passkeys

The web shell in this milestone is a browser-hosted terminal transport, not a graphical shell. It should display the same native text shell protocol through a terminal UI and should launch the same kind of session bundle as the local console path.

Required pieces:

  • network stack and HTTP/WebSocket or equivalent streaming transport,
  • TLS or a deployment mode acceptable to browsers for WebAuthn,
  • stable relying-party ID and origin policy,
  • random challenge generation,
  • passkey credential storage,
  • user-verification policy,
  • audit and rate limiting.

Passkey credential records should store public material:

PasskeyCredential {
  credentialId
  principalId
  publicKey
  relyingPartyId
  userHandle
  signCount
  transports
  userVerificationRequired
  createdAtMs
}

The authentication flow is:

  1. Browser requests a login challenge.
  2. WebShellGateway asks SessionManager or CredentialStore for a bounded, random challenge tied to the relying-party ID and intended principal.
  3. Browser calls the platform authenticator.
  4. Gateway verifies the WebAuthn assertion, origin, challenge, credential ID, public-key signature, user-presence/user-verification flags, and sign-count behavior.
  5. SessionManager mints a UserSession with auth strength hardwareKey.
  6. AuthorityBroker returns the shell bundle for that session/profile.
  7. RestrictedShellLauncher starts the native text shell connected to the web terminal stream.

Registration requires an existing authenticated session, local setup authority, or an explicit bootstrap path. Passwordless registration is allowed; unauthenticated remote registration is not.

Required Interfaces

These are ordinary capabilities, not kernel modes.

CredentialStore

Owns credential verifier records and challenge state.

Responsibilities:

  • list whether setup is required without exposing hashes,
  • create password verifier records from setup authority,
  • verify password attempts without returning the password or verifier bytes,
  • register passkey public credentials,
  • issue and consume bounded WebAuthn challenges,
  • rotate or disable credentials through an authenticated admin path.

SessionManager

Creates UserSession metadata after successful authentication, explicit local guest policy, purpose-scoped anonymous policy, or setup policy. It should record auth method, auth strength, freshness, expiry, profile, and audit context. It should not hand out broad system caps directly. Boot-to-shell uses authenticated sessions and optional local guest sessions for ordinary interactive shells; anonymous sessions are narrower remote/programmatic contexts unless a manifest explicitly defines an anonymous demo terminal.

AuthorityBroker

Maps a session/profile to a narrow CapSet. Early policy can be static and manifest-backed. The important constraint is that the broker returns capabilities, not roles or strings that downstream services treat as authority.

ConsoleLogin

Consumes TerminalSession, CredentialStore, SessionManager, broker access, and a restricted shell launcher. It never receives broad boot-package or device authority unless a recovery profile explicitly grants it.

WebShellGateway

Terminates the browser terminal session, handles passkey challenge/response, and connects the authenticated session to the shell process. It should not own general administrative caps. It should ask the broker for the same narrow shell bundle as any other session.

AuditLog

Records setup entry, credential creation, failed attempts, successful session creation, broker decisions, shell launch, credential disablement, and logout. Audit entries must not include passwords, password hashes, passkey private material, bearer tokens, or complete environment dumps.

Prerequisites

Boot-to-shell should not be selected before these pieces are credible:

  • Default boot uses init-owned manifest execution; the kernel starts only init with fixed bootstrap authority.
  • init can start long-lived services and not just short smoke binaries.
  • ProcessSpawner can launch the shell and login services with exact grants.
  • A terminal input path exists. Current Console is output-oriented; login needs line input, paste boundaries later, and cancellation behavior.
  • The native text shell exists as a capos-rt binary with caps, inspect, call, spawn, wait, release, and basic error display.
  • Secure randomness exists for salts, session IDs, WebAuthn challenges, and setup tokens.
  • There is at least boot-config-backed credential storage. Durable credential storage can come later, but the first implementation must be honest about whether credentials survive reboot.
  • Minimal SessionManager, AuthorityBroker, and AuditLog services exist.
  • A restricted launcher or broker wrapper prevents the shell from receiving broad init authority.
  • Web text shell requires networking, HTTP/WebSocket or equivalent, TLS/origin handling, and WebAuthn verification. It can lag local console boot-to-shell.

Milestone Definition

The “Boot to Shell” milestone is complete when:

  • make run-shell or the default boot path reaches a text login/setup prompt.
  • With a configured password verifier, the console refuses the shell on a bad password and launches it on the correct password.
  • With no console password verifier, the console enters setup mode and requires creating a credential or selecting an explicitly configured local guest or development policy before launching a normal shell.
  • Guest console sessions, when enabled, are created through SessionManager.guest() and receive only terminal/tmp/restricted-launcher style caps with no administrative approval path by default.
  • Anonymous sessions are not used as the missing-password console fallback and are not accepted as proof that the ordinary boot-to-shell milestone works.
  • The shell starts with a broker-issued CapSet and can prove at least one typed capability call plus one exact-grant child spawn.
  • Audit output records setup/auth/session/broker/shell-launch events without leaking secrets.
  • The web text shell can authenticate with a registered passkey and launch the same native text shell profile.
  • A passkey-only account can be enrolled through local setup authority or an explicit bootstrap credential, with no password verifier present.
  • Graphical shell work is not part of the acceptance criteria.

Implementation Plan

  1. Text console substrate. Add TerminalSession or extend the console service enough for authenticated line input, echo control, paste/framing markers later, and cancellation.

  2. Native shell binary. Land the shell proposal’s minimal REPL over capos-rt: list CapSet entries, inspect metadata, call Console, spawn a boot-package binary, wait, release, and print typed errors.

  3. Credential store prototype. Add boot-config-backed credential records and Argon2id verification. If Argon2id is too heavy for the first kernel/userspace environment, use a host-generated verifier in the manifest only as a temporary gate and keep the milestone open until in-system verification is real.

  4. Console setup/login. Implement the configured-password path and no-password setup path. The setup code should create credential records through CredentialStore, not write ad hoc config in the shell process.

  5. Minimal session and broker. Create UserSession metadata and a static policy broker that returns a narrow shell bundle. Add a manifest-gated local guest bundle and keep anonymous bundles separate from ordinary shell login. Prove the shell cannot obtain broad boot authority by default.

  6. Audit and failure policy. Add audit records and bounded attempt backoff. Verify logs do not contain raw passwords, verifier bytes, passkey private data, or challenge secrets.

  7. Web text shell gateway. After networking and a terminal transport exist, add WebAuthn registration and authentication for the browser-hosted terminal. Support passkey-only enrollment through local setup or explicit bootstrap authority.

  8. Durability and recovery. Move credential records from boot config or RAM into a storage-backed service once storage exists. Define recovery as a credential-admin operation, not an implicit bypass.

Security Notes

  • Password hashing belongs in userspace auth services, not the kernel fast path.
  • WebAuthn challenge state must be single-use and bounded by expiry.
  • The web gateway must validate origin and relying-party ID; otherwise passkey authentication is meaningless.
  • Setup tokens are credentials. They must be short-lived, single-use, audited, and hidden from ordinary process output.
  • Credential records are sensitive even though they are not raw secrets; avoid printing them in debug logs.
  • The shell and any agent running inside it must treat logs, terminal input, files, web pages, and service output as untrusted data.

Non-Goals

  • No graphical shell in this milestone.
  • No passwordless remote first-use takeover.
  • No kernel uid, gid, root, or login mode.
  • No default shell access to broad BootPackage, raw ProcessSpawner, DeviceManager, raw storage, or global supervisor caps.
  • No authentication proof passed through command-line arguments, environment variables, shell variables, audit records, or agent prompts.

Open Questions

  • Which Argon2id parameters fit the early userspace memory budget while still resisting offline guessing?
  • Should the first credential store be manifest-backed, RAM-backed, or wait for the first storage service?
  • How should local console setup prove physical presence on cloud VMs where serial console access may itself be remote?
  • What is the first acceptable TLS/origin story for QEMU and local development WebAuthn testing?
  • Should passkey-only machines keep a disabled console password slot for later recovery, or should recovery be entirely credential-admin/passkey based?

Proposal: Symmetric Multi-Processing (SMP)

How capOS goes from single-CPU execution to utilizing all available processors.

This document has three phases: a per-CPU foundation (prerequisite plumbing), AP startup (bringing secondary CPUs online), and SMP correctness (making shared state safe under concurrency).

Depends on: Stage 5 (Scheduling) – needs a working timer, context switch, and run queue on the BSP before adding more CPUs.

Can proceed in parallel with: Stage 6 (IPC and Capability Transfer).


Current State

Everything is single-CPU. Specific assumptions that SMP breaks:

ComponentFileAssumption
Syscall stack switchingkernel/src/arch/x86_64/syscall.rsGlobal SYSCALL_KERNEL_RSP / SYSCALL_USER_RSP statics
GDT, TSS, kernel stackskernel/src/arch/x86_64/gdt.rsOne static GDT, one TSS, one kernel stack, one double-fault stack
IDTkernel/src/arch/x86_64/idt.rsSingle static IDT (shareable – IDT can be the same across CPUs)
SYSCALL MSRskernel/src/arch/x86_64/syscall.rsSTAR/LSTAR/SFMASK/EFER set once on BSP only
Current processkernel/src/sched.rsSCHEDULER with BTreeMap<Pid, Process> + current: Option<Pid> — single global behind Mutex
Frame allocatorkernel/src/mem/frame.rsSingle global ALLOCATOR behind one spinlock
Heap allocatorkernel/src/mem/heap.rslinked_list_allocator behind one spinlock

The comment in syscall.rs:12 already anticipates the fix: “Will be replaced by per-CPU data (swapgs) for SMP.”


Phase A: Per-CPU Foundation

Establish per-CPU data structures on the BSP. No APs are started yet – this phase makes the BSP’s own code SMP-ready so Phase B is a clean addition.

Per-CPU Data Region

Each CPU needs a private data area accessible via the GS segment base. On x86_64, swapgs switches between user-mode GS (usually zero) and kernel-mode GS (pointing to per-CPU data). The kernel sets KernelGSBase MSR on each CPU during init.

#![allow(unused)]
fn main() {
/// Per-CPU data, one instance per processor.
/// Accessed via GS-relative addressing after swapgs.
#[repr(C)]
struct PerCpu {
    /// Self-pointer for accessing the struct from GS:0.
    self_ptr: *const PerCpu,
    /// Kernel stack pointer for syscall entry (replaces SYSCALL_KERNEL_RSP).
    kernel_rsp: u64,
    /// Saved user RSP during syscall (replaces SYSCALL_USER_RSP).
    user_rsp: u64,
    /// CPU index (0 = BSP).
    cpu_id: u32,
    /// LAPIC ID (from Limine SMP info or CPUID).
    lapic_id: u32,
    /// Pointer to the currently running process on this CPU.
    current_process: *mut Process,
}
}

The syscall entry stub changes from:

movq %rsp, SYSCALL_USER_RSP(%rip)
movq SYSCALL_KERNEL_RSP(%rip), %rsp

to:

swapgs
movq %rsp, %gs:16          ; PerCpu.user_rsp
movq %gs:8, %rsp           ; PerCpu.kernel_rsp

And symmetrically on return:

movq %gs:16, %rsp          ; restore user RSP
swapgs
sysretq

Per-CPU GDT, TSS, and Stacks

Each CPU needs its own:

  • GDT – the TSS descriptor encodes a physical pointer to the CPU’s TSS, so each CPU needs a GDT with its own TSS entry. The segment layout (kernel CS/DS, user CS/DS) is identical across CPUs.
  • TSSprivilege_stack_table[0] (kernel stack for interrupts from Ring 3) and IST entries (double-fault stack) must be per-CPU.
  • Kernel stack – each CPU needs its own stack for syscall/interrupt handling. Current size: 16 KB (4 pages). Same size per CPU.
  • Double-fault stack – each CPU needs its own IST stack. Current size: 20 KB (5 pages).
#![allow(unused)]
fn main() {
/// Allocate and initialize per-CPU structures for one CPU.
fn init_per_cpu(cpu_id: u32, lapic_id: u32) -> &'static PerCpu {
    // Allocate kernel stack (4 pages) and double-fault stack (5 pages)
    let kernel_stack = alloc_stack(4);
    let df_stack = alloc_stack(5);

    // Create TSS with per-CPU stacks
    let mut tss = TaskStateSegment::new();
    tss.privilege_stack_table[0] = kernel_stack.top();
    tss.interrupt_stack_table[DOUBLE_FAULT_IST_INDEX] = df_stack.top();

    // Create GDT with this CPU's TSS
    let (gdt, selectors) = create_gdt(&tss);

    // Allocate and populate PerCpu struct
    let per_cpu = Box::leak(Box::new(PerCpu {
        self_ptr: core::ptr::null(),  // filled below
        kernel_rsp: kernel_stack.top().as_u64(),
        user_rsp: 0,
        cpu_id,
        lapic_id,
        current_process: core::ptr::null_mut(),
    }));
    per_cpu.self_ptr = per_cpu as *const PerCpu;
    per_cpu
}
}

LAPIC Initialization

Stage 5 uses the 8254 PIT (100 Hz) and 8259A PIC (IRQ0 → vector 32) for preemption on the BSP. Phase A must migrate from PIT to LAPIC timer before bringing APs online, since the PIT is a single shared device that cannot provide per-CPU timer interrupts. Phase A sets up the full LAPIC, which is needed for:

  • Per-CPU timer – replace PIT with LAPIC timer (required for SMP)
  • IPI – inter-processor interrupts for TLB shootdown and AP startup
  • Spurious interrupt vector – must be configured per-CPU

Crate Dependencies

CratePurposeno_std
x2apic or manual MMIOLAPIC/IOAPIC accessyes

The x86_64 crate (already a dependency) provides MSR access. LAPIC register access can use the existing HHDM for MMIO, or x2apic crate for the MSR-based interface.

Migration Path

Phase A is a refactor of existing single-CPU code, not an addition:

  1. Add PerCpu struct, allocate one instance for BSP
  2. Set BSP’s KernelGSBase MSR, add swapgs to syscall entry/exit
  3. Replace SYSCALL_KERNEL_RSP/SYSCALL_USER_RSP globals with GS-relative accesses
  4. Replace scheduler’s global SCHEDULER.current with PerCpu.current_pid
  5. Move GDT/TSS creation into init_per_cpu(), call it for BSP
  6. Migrate BSP from PIT to LAPIC timer (PIT initialized in Stage 5)

After Phase A, the kernel still runs on one CPU but the per-CPU plumbing is in place. Existing tests (make run) continue to pass.


Phase B: AP Startup

Bring Application Processors (APs) online. Each AP runs the same kernel code with its own per-CPU state.

Limine SMP Request

Limine provides an SMP response with per-CPU records. Each record contains the LAPIC ID and a goto_address field – writing a function pointer there starts the AP at that address.

#![allow(unused)]
fn main() {
use limine::request::SmpRequest;

#[used]
#[unsafe(link_section = ".requests")]
static SMP_REQUEST: SmpRequest = SmpRequest::new();

fn start_aps() {
    let smp = SMP_REQUEST.get_response().expect("no SMP response");
    for cpu in smp.cpus() {
        if cpu.lapic_id == smp.bsp_lapic_id {
            continue; // skip BSP
        }
        let per_cpu = init_per_cpu(cpu.id, cpu.lapic_id);
        // Limine starts the AP at ap_entry with the cpu info pointer
        cpu.goto_address.write(ap_entry);
    }
}
}

AP Entry

Each AP must:

  1. Load its per-CPU GDT and TSS
  2. Load the shared IDT
  3. Set KernelGSBase MSR to its PerCpu pointer
  4. Configure SYSCALL MSRs (STAR, LSTAR, SFMASK, EFER.SCE)
  5. Initialize its LAPIC (enable, set timer, set spurious vector)
  6. Signal “ready” to BSP (atomic flag or counter)
  7. Enter the scheduler idle loop
#![allow(unused)]
fn main() {
/// AP entry point. Called by Limine with the SMP info pointer.
unsafe extern "C" fn ap_entry(info: &limine::smp::Cpu) -> ! {
    let per_cpu = /* retrieve PerCpu for this LAPIC ID */;

    // Load this CPU's GDT + TSS
    per_cpu.gdt.load();
    unsafe { load_tss(per_cpu.selectors.tss); }

    // Shared IDT (same across all CPUs)
    idt::load();

    // Set GS base for swapgs
    unsafe { wrmsr(IA32_KERNEL_GS_BASE, per_cpu as *const _ as u64); }

    // Configure syscall MSRs (same values as BSP)
    syscall::init_msrs();

    // Initialize local APIC
    lapic::init_local();

    // Signal ready
    AP_READY_COUNT.fetch_add(1, Ordering::Release);

    // Enter scheduler
    scheduler::idle_loop();
}
}

Scheduler Integration

Stage 5 establishes a single run queue. Phase B extends it:

  • Per-CPU run queues – each CPU pulls work from its local queue. Avoids global lock contention on the scheduler hot path.
  • Global overflow queue – when a CPU’s local queue is empty, it steals from the global queue (or from other CPUs’ queues).
  • CPU affinity – optional, not needed initially. All processes are eligible to run on any CPU.
  • Idle loop – when no work is available, hlt until the next timer interrupt or IPI.

The Process struct gains a cpu field indicating which CPU it’s currently running on (or None if queued).

Boot Sequence

BSP: kernel init (GDT, IDT, memory, heap, caps, scheduler)
BSP: init_per_cpu(0, bsp_lapic_id)
BSP: start_aps()
  AP1: ap_entry() → init GDT/TSS/LAPIC → idle_loop()
  AP2: ap_entry() → init GDT/TSS/LAPIC → idle_loop()
  ...
BSP: wait for all APs ready
BSP: load init process, schedule it
BSP: enter scheduler

Phase C: SMP Correctness

With multiple CPUs running, shared mutable state needs careful handling.

TLB Shootdown

When the kernel modifies page tables that other CPUs may have cached in their TLBs, it must send an IPI to those CPUs to invalidate the affected entries.

Scenarios requiring shootdown:

  • Process exit – unmapping user pages. Only the CPU running the process has the mapping cached, but if the process migrated recently, stale TLB entries may exist on the old CPU.
  • Shared kernel mappings – changes to the kernel half of page tables (e.g., heap growth, MMIO mappings) require all-CPU shootdown.
  • Capability-granted shared memory – if future stages allow shared memory regions between processes, modifications require targeted shootdown.

Implementation: IPI vector + bitmap of target CPUs + invlpg on each target. Linux uses a more sophisticated batching scheme, but a simple broadcast IPI with single-page invlpg is sufficient initially.

#![allow(unused)]
fn main() {
/// Flush TLB entry on all CPUs except the caller.
fn tlb_shootdown(addr: VirtAddr) {
    // Record the address to flush
    SHOOTDOWN_ADDR.store(addr.as_u64(), Ordering::Release);

    // Send IPI to all other CPUs
    lapic::send_ipi_all_excluding_self(TLB_SHOOTDOWN_VECTOR);

    // Wait for all CPUs to acknowledge
    wait_for_shootdown_ack();
}

/// IPI handler on receiving CPU.
fn handle_tlb_shootdown_ipi() {
    let addr = VirtAddr::new(SHOOTDOWN_ADDR.load(Ordering::Acquire));
    x86_64::instructions::tlb::flush(addr);
    SHOOTDOWN_ACK.fetch_add(1, Ordering::Release);
}
}

Lock Audit

Existing spinlocks need review for SMP safety:

LockCurrent UseSMP Concern
SERIALCOM1 outputSafe but high contention if many CPUs print. Acceptable for debug output.
ALLOCATORFrame bitmapHot path. Holding lock during full bitmap scan is O(n). Consider per-CPU free lists.
KERNEL_CAPSKernel cap tableLow contention (init only). Safe.
SCHEDULER.currentSingle global running-process slotSplit into PerCpu.current_process in Phase A.

Interrupt + spinlock deadlock: if CPU A holds a spinlock and takes an interrupt whose handler tries to acquire the same lock, deadlock. This is already noted in REVIEW.md. Fix: disable interrupts while holding locks that interrupt handlers may need (frame allocator, serial). The spin crate supports MutexIrq for this pattern, or use manual cli/sti wrappers.

Allocator Scaling

The frame allocator is behind a single spinlock with O(n) bitmap scan. Under SMP, this becomes a contention bottleneck.

Options (in order of complexity):

  1. Per-CPU free list cache – each CPU maintains a small cache of free frames (e.g., 64 frames). Refill from the global allocator when empty, return batch when full. Reduces lock acquisitions by ~64x.
  2. Region partitioning – divide physical memory into per-CPU regions. Each CPU owns a bitmap partition. Cross-CPU allocation falls back to a global lock. More complex, better NUMA behavior (future).

Option 1 is recommended for initial SMP. ~50-100 lines.

The heap allocator (linked_list_allocator) is also behind a single lock. For a research OS this is acceptable initially – heap allocations in the kernel should be infrequent compared to frame allocations.


Cap’n Proto Schema Additions

SMP introduces a kernel-internal CpuManager capability for inspecting and controlling CPU state. This is not exposed to userspace initially but follows the “everything is a capability” principle.

interface CpuManager {
    # Number of online CPUs.
    cpuCount @0 () -> (count :UInt32);

    # Per-CPU info.
    cpuInfo @1 (cpuId :UInt32) -> (lapicId :UInt32, online :Bool);
}

This capability would be held by init (or a system monitor process) for diagnostics. It’s additive and can be deferred until the mechanism is useful.


Estimated Scope

PhaseNew/Changed CodeDepends On
Phase A: Per-CPU foundation~300-400 lines (PerCpu struct, swapgs migration, per-CPU GDT/TSS)Stage 5
Phase B: AP startup~200-300 lines (SmpRequest, ap_entry, scheduler extension)Phase A
Phase C: SMP correctness~200-300 lines (TLB shootdown, allocator cache, lock audit)Phase B
Total~700-1000 lines

Milestones

  • M1: Per-CPU data on BSPswapgs-based syscall entry, per-CPU GDT/TSS, global current-process state split. Single CPU still. make run passes.
  • M2: APs running – secondary CPUs reach idle_loop(). BSP prints “N CPUs online”. make run still runs init on BSP.
  • M3: Multi-CPU scheduling – processes can run on any CPU. The existing boot-manifest service set still works, but the scheduler distributes work across CPUs once runnable processes are available (runtime spawning still depends on ProcessSpawner).
  • M4: TLB shootdown – page table modifications are safe across CPUs. Process exit on one CPU doesn’t leave stale mappings on others.

Open Questions

  1. LAPIC vs x2APIC. Modern hardware supports x2APIC (MSR-based, no MMIO). Should we require x2APIC, support both, or start with xAPIC? QEMU supports both. x2APIC is simpler (no MMIO mapping needed).

  2. Idle strategy. hlt is the simplest idle. mwait is more power-efficient and can be used to wake on memory writes. Overkill for QEMU, but worth noting for future hardware targets.

  3. CPU hotplug. Limine starts all CPUs at boot. Runtime CPU online/offline is a future concern, not needed initially.

  4. NUMA awareness. Multi-socket systems have non-uniform memory access. Per-CPU frame allocator regions could be NUMA-aware. Deferred – QEMU emulates flat memory by default.

  5. Scheduler policy. Round-robin per-CPU queues with global overflow is the simplest starting point. Work stealing, priority scheduling, and CPU affinity are future refinements.


References

Specifications

Limine

Prior Art

  • Redox SMP – per-CPU contexts, LAPIC timer, IPI-based TLB shootdown
  • xv6-riscv SMP – minimal multi-core OS, clean per-CPU implementation
  • Hermit SMP – Rust unikernel with SMP support via per-core data and APIC
  • BlogOS – educational x86_64 Rust OS (single-CPU, but good APIC coverage)

Other Proposals

This page keeps the mdBook sidebar compact by grouping proposal documents that are not listed individually in the main Design Proposals section.

Active Support Proposals

ProposalStatusPurpose
mdBook Documentation SitePartially implementedDefines the documentation site structure, status vocabulary, and curation rules for architecture, proposal, security, and research pages.

Future Runtime and Deployment

ProposalStatusPurpose
Go RuntimeFuture designPlans a custom GOOS=capos userspace port and runtime services for Go programs.
Cloud MetadataFuture designDescribes cloud bootstrap inputs and manifest deltas without importing cloud-init.
Cloud DeploymentFuture designTracks hardware abstraction, cloud VM support, storage/network dependencies, and aarch64 deployment direction.
Browser/WASMFuture designExplores a browser-hosted capOS model using WebAssembly and workers.

Future Security, Policy, and Lifecycle

ProposalStatusPurpose
User Identity and PolicyFuture designDefines user/session identity and policy layers over capability grants.
System MonitoringFuture designDefines scoped observability capabilities for logs, metrics, traces, health, status, crash records, and audit.
Formal MAC/MICFuture designDefines a formal access-control and integrity model for later proof work.
Live UpgradeFuture designDesigns service replacement while preserving handles, calls, and authority.
GPU CapabilityFuture designSketches isolated GPU device, memory, and compute authority.

Rejected

ProposalStatusPurpose
Cap’n Proto SQE EnvelopeRejectedRecords the rejected idea of encoding SQEs themselves as Cap’n Proto messages.
Sleep(INF) Process TerminationRejectedRecords the rejected idea of using infinite sleep as process termination.

Proposal: mdBook Documentation Site

Turn the existing Markdown documentation into a navigable mdBook site that explains capOS as a working system, while keeping proposals and research as deep reference material.

The current docs are useful for agents and maintainers who already know what they are looking for. They are weaker as a reader path: a new contributor has to jump between README.md, ROADMAP.md, WORKPLAN.md, proposal files, research reports, and source code before they can form an accurate model of the system. The mdBook site should fix that by adding a concise, current system manual above the existing archive.

Status: Partially implemented. The book structure exists and is deployed; this proposal remains the editorial contract for keeping the site navigable and honest about current versus future behavior.

Goals

  • Make the first reading path obvious: what capOS is, how to build it, what works today, and where the important subsystems live.
  • Separate implemented behavior from future design, rejected ideas, and research background.
  • Preserve existing long-form proposal and research documents instead of rewriting them prematurely.
  • Give architecture pages a repeatable structure so future edits do not turn into ad hoc status notes.
  • Make validation visible: each architecture page should name the host tests, QEMU smokes, fuzz targets, Kani proofs, Loom models, or manual checks that support its claims.
  • Keep the docs useful from a local clone, without requiring hosted services, databases, or custom frontend code.

Non-Goals

  • Replacing ROADMAP.md, WORKPLAN.md, or REVIEW_FINDINGS.md. Those files remain operational planning documents.
  • Turning proposals into user manuals by bulk editing every existing document. Long proposal files stay as references until a subsystem needs a targeted refresh.
  • Building a marketing site, blog, changelog, or public product page.
  • Adding MDX, React, Vue, custom components, or a JavaScript application layer.
  • Automatically generating API reference documentation from Rust or Cap’n Proto. That can be evaluated later as a separate documentation track.

Audience

The site should serve three readers:

  • New contributor: wants to build the ISO, boot QEMU, understand the current architecture, and find the right files to edit.
  • Reviewer: wants to verify whether a change preserves the intended ownership, authority, lifecycle, and validation rules.
  • Future agent: wants current project context without having to infer the system from stale proposals or source code alone.

The primary audience is maintainers and agents, not end users. This matters: accuracy, status labels, and code maps are more important than a polished external landing page.

Current State

The repository already has a substantial Markdown corpus:

  • README.md explains the project and core commands.
  • ROADMAP.md describes long-range stages and visible milestones.
  • WORKPLAN.md tracks the selected milestone and active implementation order.
  • REVIEW_FINDINGS.md tracks open remediation and verification history.
  • docs/capability-model.md is a real architecture reference.
  • docs/proposals/ contains accepted, future, exploratory, and rejected design material.
  • docs/research.md and docs/research/ contain prior-art analysis.
  • docs/*-design.md and inventory files capture targeted design/security decisions.

The weakness is not lack of content. The weakness is lack of a stable reader path and status model.

Site Shape

The mdBook site should be structured as a book, not as a mirror of the file tree.

# Summary

[Introduction](index.md)

# Start Here
- [What capOS Is](overview.md)
- [Current Status](status.md)
- [Build, Boot, and Test](build-run-test.md)
- [Repository Map](repo-map.md)

# System Architecture
- [Boot Flow](architecture/boot-flow.md)
- [Process Model](architecture/process-model.md)
- [Capability Model](capability-model.md)
- [Capability Ring](architecture/capability-ring.md)
- [IPC and Endpoints](architecture/ipc-endpoints.md)
- [Userspace Runtime](architecture/userspace-runtime.md)
- [Manifest and Service Startup](architecture/manifest-startup.md)
- [Memory Management](architecture/memory.md)
- [Scheduling](architecture/scheduling.md)

# Security and Verification
- [Trust Boundaries](security/trust-boundaries.md)
- [Verification Workflow](security/verification-workflow.md)
- [Panic Surface Inventory](panic-surface-inventory.md)
- [Trusted Build Inputs](trusted-build-inputs.md)
- [DMA Isolation](dma-isolation-design.md)
- [Authority Accounting](authority-accounting-transfer-design.md)

# Design Proposals
- [Proposal Index](proposals/index.md)
- [Service Architecture](proposals/service-architecture-proposal.md)
- [Storage and Naming](proposals/storage-and-naming-proposal.md)
- [Networking](proposals/networking-proposal.md)
- [Error Handling](proposals/error-handling-proposal.md)
- [Userspace Binaries](proposals/userspace-binaries-proposal.md)
- [Shell](proposals/shell-proposal.md)
- [SMP](proposals/smp-proposal.md)
- [Other Proposals](proposals/other.md)
  - [Security and Verification](proposals/security-and-verification-proposal.md)
  - [mdBook Documentation Site](proposals/mdbook-docs-site-proposal.md)
  - [Go Runtime](proposals/go-runtime-proposal.md)
  - [User Identity and Policy](proposals/user-identity-and-policy-proposal.md)
  - [Cloud Metadata](proposals/cloud-metadata-proposal.md)
  - [Cloud Deployment](proposals/cloud-deployment-proposal.md)
  - [Live Upgrade](proposals/live-upgrade-proposal.md)
  - [GPU Capability](proposals/gpu-capability-proposal.md)
  - [Formal MAC/MIC](proposals/formal-mac-mic-proposal.md)
  - [Browser/WASM](proposals/browser-wasm-proposal.md)
  - [Rejected: Cap'n Proto SQE Envelope](proposals/rejected-capnp-ring-sqe-proposal.md)

# Research
- [Research Index](research.md)
- [seL4](research/sel4.md)
- [Zircon](research/zircon.md)
- [Genode](research/genode.md)
- [Plan 9 and Inferno](research/plan9-inferno.md)
- [EROS, CapROS, Coyotos](research/eros-capros-coyotos.md)
- [LLVM Target](research/llvm-target.md)
- [Cap'n Proto Error Handling](research/capnp-error-handling.md)
- [OS Error Handling](research/os-error-handling.md)
- [IX-on-capOS Hosting](research/ix-on-capos-hosting.md)
- [Out-of-Kernel Scheduling](research/out-of-kernel-scheduling.md)

The exact page list may change during implementation, but the hierarchy should stay stable:

  • Start Here: reader orientation and commands.
  • System Architecture: current implementation, with code maps and invariants.
  • Security and Verification: threat boundaries, validation workflow, and security inventories.
  • Design Proposals: accepted/future/rejected design documents.
  • Research: prior art and its consequences for capOS.

Page Standard

Every architecture page should use this shape:

# Page Title

What problem this subsystem solves and why a reader should care.

**Status:** Implemented / Partially implemented / Proposal / Research note.

## Current Behavior
What exists in the repo today.

## Design
How it works, with concrete data flow.

## Invariants
Security, lifetime, ownership, ordering, or failure rules.

## Code Map
Important files and entry points.

## Validation
Relevant host tests, QEMU smokes, fuzz/Kani/Loom checks.

## Open Work
Concrete known gaps, linked to WORKPLAN or REVIEW_FINDINGS when relevant.

Architecture pages should normally stay between 100 and 300 lines. Longer background belongs in proposals or research reports.

Status Vocabulary

Use explicit inline status labels near the top of proposal, research, and architecture pages when the label distinguishes current behavior from future or rejected design, after the lead paragraph:

  • Implemented: behavior exists in the mainline code and has validation.
  • Partially implemented: some behavior exists, but the page also describes missing work.
  • Accepted design: intended direction, not fully implemented.
  • Future design: plausible direction, not selected for near-term work.
  • Rejected: explicitly not the chosen direction.
  • Research note: background used to inform design, not a direct plan.

Avoid status labels on orientation, index, command-reference, and workflow pages where the sidebar section or title already gives the page role. Avoid ambiguous language like “planned” without a stage, dependency, or status label. When a page mixes current and future behavior, split those sections.

Content Rules

  • Start with operational facts, not motivation.
  • Prefer concrete nouns: process, cap table, ring, endpoint, manifest, init, QEMU smoke.
  • Name source files when a claim depends on implementation.
  • State authority and ownership rules explicitly.
  • State failure behavior explicitly.
  • Link to proposals and research instead of duplicating long rationale.
  • Keep ROADMAP.md and WORKPLAN.md as planning sources, not as content to paste into the book.
  • Do not describe behavior as implemented unless validation exists or the code map makes the claim directly checkable.
  • Do not bury current limitations at the bottom of a long proposal.

Proposal Index

docs/proposals/index.md should classify proposal files instead of listing them alphabetically. A useful first classification:

  • Active or near-term:
    • service architecture
    • storage and naming
    • error handling
    • security and verification
  • Future architecture:
    • networking
    • SMP
    • userspace binaries
    • shell
    • user identity and policy
    • cloud metadata
    • cloud deployment
    • live upgrade
    • GPU capability
    • formal MAC/MIC
    • browser/WASM
  • Rejected or superseded:
    • rejected Cap’n Proto ring SQE envelope

Each proposal entry should have a one-sentence purpose and a status label.

Research Index

docs/research.md should remain the top-level research index, but it should gain a short “Design consequences for capOS” section near the top. Readers should not need to read every long report to learn which ideas were accepted.

Each long research report should eventually end with:

## Used By

- Architecture or proposal page that relies on this research.
- Concrete design decision influenced by this report.

Diagrams

Use Mermaid only where it clarifies flow or authority:

  • boot flow: firmware, Limine, kernel, manifest, init
  • capability ring: SQE submission, cap_enter, CQE completion
  • endpoint IPC: client CALL, server RECV, server RETURN
  • manifest startup: boot package, init, ProcessSpawner, child caps

Avoid diagrams that duplicate file layout or become stale when a function is renamed. Every diagram should have nearby text that states the same key invariant in prose.

Migration Plan

Phase 1: Skeleton and Reader Path

  • Add book.toml with docs as the source directory and output under target/docs-site.
  • Add docs/SUMMARY.md.
  • Add docs/index.md.
  • Add docs/overview.md.
  • Add docs/status.md.
  • Add docs/build-run-test.md.
  • Add docs/repo-map.md.

Acceptance criteria:

  • mdbook build succeeds.
  • The first section explains what capOS is, how to build it, how to boot it, and where to find the major code areas.
  • Existing proposal and research files are reachable through the sidebar.

Phase 2: Current Architecture Pages

  • Add the first architecture pages:
    • boot flow
    • process model
    • capability ring
    • IPC and endpoints
    • userspace runtime
    • manifest and service startup
    • memory management
    • scheduling
  • Keep docs/capability-model.md as a first-class architecture page.

Acceptance criteria:

  • Each architecture page has status, current behavior, invariants, code map, validation, and open work.
  • Each page distinguishes implemented behavior from future design.
  • At least boot flow, capability ring, IPC, and manifest startup include a concise Mermaid diagram.

Phase 3: Security and Verification Pages

  • Add docs/security/trust-boundaries.md.
  • Add docs/security/verification-workflow.md.
  • Link existing inventories and designs from the security section.
  • Make each security page name the relevant validation commands and review documents.

Acceptance criteria:

  • A reviewer can find the hostile-input boundaries, trusted inputs, and verification workflow without reading all proposals.
  • The security section links to REVIEW.md, REVIEW_FINDINGS.md, docs/trusted-build-inputs.md, and docs/panic-surface-inventory.md.

Phase 4: Proposal and Research Curation

  • Add docs/proposals/index.md.
  • Add docs/proposals/other.md only if the sidebar would otherwise become too noisy.
  • Add status labels to proposal files as they are touched.
  • Add “Used By” sections to research files incrementally.

Acceptance criteria:

  • Proposal status is visible before a reader opens a long document.
  • Rejected and future proposals are not confused with implemented behavior.
  • Research pages point back to the architecture or proposal pages they influence.

Maintenance Rules

  • When implementation changes a subsystem, update the corresponding architecture page in the same change when the page would otherwise become misleading.
  • When a proposal is accepted, rejected, or partially implemented, update its status and the proposal index.
  • When WORKPLAN.md changes the selected milestone, update docs/status.md only if the public current-system summary changes. Do not mirror every operational task into the docs site.
  • When validation commands change, update docs/build-run-test.md and the affected architecture page.

Tooling Follow-Up

The content proposal assumes mdBook because it matches the repo’s Rust toolchain and plain Markdown corpus. A minimal tooling follow-up should add:

  • book.toml
  • make docs
  • make docs-serve
  • optional link checking after the first site build is stable

Do not add a frontend package manager, theme framework, or generated site assets unless the content structure proves insufficient.

Open Questions

  • Should ROADMAP.md, WORKPLAN.md, and REVIEW_FINDINGS.md be included in the mdBook sidebar, or only linked from status.md?
  • Should long proposal files keep their current filenames, or should accepted designs eventually move from docs/proposals/ into docs/architecture/?
  • Should docs/status.md be manually maintained, or generated from a smaller checked-in status data file later?
  • Should Cap’n Proto schema documentation be generated into the book once the interface surface stabilizes?

The first implementation commit should be deliberately small:

  1. Add mdBook config.
  2. Add SUMMARY.md.
  3. Add the Start Here pages.
  4. Link existing proposal and research files without rewriting them.
  5. Verify mdbook build.

That gives the project a usable docs site quickly, without blocking on a full architecture rewrite.

Proposal: Go Language Support via Custom GOOS

Running Go programs natively on capOS by implementing a GOOS=capos target in the Go runtime.

Motivation

Go is the implementation language of CUE, the configuration language planned for system manifests. Beyond CUE, Go has a large ecosystem of systems software (container runtimes, network tools, observability agents) that would be valuable to run on capOS without rewriting.

The userspace-binaries proposal (Part 3) places Go in Tier 4 (“managed runtimes, much later”) and suggests WASI as the pragmatic path. This proposal explores the native alternative: a custom GOOS=capos that lets Go programs run directly on capOS hardware, without a WASM interpreter in between.

Why Go is Hard

Go’s runtime is a userspace operating system. It manages its own:

  • Goroutine scheduler — M:N threading (M OS threads, N goroutines), work-stealing, preemption via signals or cooperative yield points
  • Garbage collector — concurrent, tri-color mark-sweep, requires write barriers, stop-the-world pauses, and memory management syscalls
  • Stack management — segmented/copying stacks with guard pages, grow/shrink on demand
  • Network poller — epoll/kqueue-based async I/O for net.Conn
  • Memory allocator — mmap-based, spans, mcache/mcentral/mheap hierarchy
  • Signal handling — goroutine preemption, crash reporting, profiling

Each of these assumes a specific OS interface. The Go runtime calls ~40 distinct syscalls on Linux. capOS currently has 2.

Syscall Surface Required

The Go runtime’s Linux syscall usage, grouped by subsystem:

Memory Management (critical, blocks everything)

Go runtime needsLinux syscallcapOS equivalent
Heap allocationmmap(MAP_ANON)FrameAllocator cap + page table manipulation
Heap deallocationmunmapUnmap + free frames
Stack guard pagesmmap(PROT_NONE) + mprotectMap page with no permissions
GC needs contiguous arenasmmap with hintsAllocate contiguous frames, map contiguously
Commit/decommit pagesmadvise(DONTNEED)Unmap or zero pages

capOS needs: A sys_mmap-like capability or syscall that can:

  • Map anonymous pages at arbitrary user addresses
  • Set per-page permissions (R, W, X, none)
  • Allocate contiguous virtual ranges
  • Decommit without unmapping (for GC arena management)

This could be a VirtualMemory capability:

interface VirtualMemory {
    # Map anonymous pages at hint address (0 = kernel chooses)
    map @0 (hint :UInt64, size :UInt64, prot :UInt32) -> (addr :UInt64);
    # Unmap pages
    unmap @1 (addr :UInt64, size :UInt64) -> ();
    # Change permissions on mapped range
    protect @2 (addr :UInt64, size :UInt64, prot :UInt32) -> ();
    # Decommit (release physical frames, keep virtual range reserved)
    decommit @3 (addr :UInt64, size :UInt64) -> ();
}

Threading (critical for goroutines)

Go runtime needsLinux syscallcapOS equivalent
Create OS threadclone(CLONE_THREAD)Thread cap or in-process threading primitive
Thread-local storagearch_prctl(SET_FS)Per-thread FS base (kernel sets on context switch)
Block threadfutex(WAIT)Futex cap or kernel-side futex
Wake threadfutex(WAKE)Futex cap
Thread exitexit(thread)Thread exit syscall

capOS needs: Threading support within a process. Options:

Option A: Kernel threads. The kernel manages threads (multiple execution contexts sharing one address space). Each thread has its own stack, register state, and FS base, but shares page tables and cap table with the process. This is what Linux does and what Go expects.

Option B: User-level threading. The process manages its own threads (like green threads). The kernel only sees one execution context per process. Go’s scheduler already does M:N threading, so it could work with a single OS thread per process — but the GC’s stop-the-world relies on being able to stop other OS threads, and the network poller blocks an OS thread.

Option A is simpler for Go compatibility. Option B is more capability-aligned (threads are a process-internal concern) but requires Go runtime modifications.

Synchronization

Go runtime needsLinux syscallcapOS equivalent
Futex waitfutex(FUTEX_WAIT)Futex authority cap, ABI selected by measurement
Futex wakefutex(FUTEX_WAKE)Futex authority cap, ABI selected by measurement
Atomic compare-and-swapCPU instructionsAlready available (no kernel support needed)

Futexes are a kernel primitive (block/wake on a userspace address). capOS should expose futex authority through a capability from the start. The ABI is still a measurement question: generic capnp/ring method if overhead is close to a compact path, otherwise a compact capability-authorized operation.

Time

Go runtime needsLinux syscallcapOS equivalent
Monotonic clockclock_gettime(MONOTONIC)Timer cap .now()
Wall clockclock_gettime(REALTIME)Timer cap or RTC driver
Sleepnanosleep or futex with timeoutTimer cap .sleep() or futex timeout
Timer eventstimer_create / timerfdTimer cap with callback or poll

Timer cap already planned. Go needs monotonic time for goroutine scheduling and wall time for time.Now().

I/O

Go runtime needsLinux syscallcapOS equivalent
Network I/Oepoll_create, epoll_ctl, epoll_waitAsync cap invocation or poll cap
File I/Oread, write, open, closeNamespace + Store caps (via POSIX layer)
Stdout/stderrwrite(1, ...), write(2, ...)Console cap
Pipe (runtime internal)pipe2IPC caps or in-process channel

Go’s network poller (netpoll) is pluggable per-OS — each GOOS provides its own implementation. For capOS, it would use async capability invocations or a polling interface over socket caps.

Signals (for preemption)

Go runtime needsLinux syscallcapOS equivalent
Goroutine preemptiontgkill + SIGURGThread preemption mechanism
Crash handlingsigaction(SIGSEGV)Page fault notification
Profilingsigaction(SIGPROF) + setitimerProfiling cap (optional)

Go 1.14+ uses asynchronous preemption: the runtime sends SIGURG to a thread to interrupt a long-running goroutine. On capOS, alternatives:

  • Cooperative preemption only. Go inserts yield points at function prologues and loop back-edges. This works but means tight loops without function calls won’t yield. Acceptable for initial support.
  • Timer interrupt notification. The kernel notifies the process (via a cap invocation or a signal-like mechanism) when a time quantum expires. The notification handler in the Go runtime triggers goroutine preemption.

Implementation Strategy

Phase 1: Minimal GOOS (single-threaded, cooperative)

Fork the Go toolchain, add GOOS=capos GOARCH=amd64. Implement the minimum runtime changes:

What to implement:

  • osinit() — read Timer cap from CapSet for monotonic clock
  • sysAlloc/sysFree/sysReserve/sysMap — translate to VirtualMemory cap
  • newosproc() — stub (single OS thread, M:N scheduler still works with M=1)
  • futexsleep/futexwake — spin-based fallback (no real futex yet)
  • nanotime/walltime — Timer cap
  • write() (for runtime debug output) — Console cap
  • exit/exitThread — sys_exit
  • netpoll — stub returning “nothing ready” (no async I/O)

What to stub/disable:

  • Signals (no SIGURG preemption, cooperative only)
  • Multi-threaded GC (single-thread STW is fine initially)
  • CGo (no C interop)
  • Profiling
  • Core dumps

Deliverable: GOOS=capos go build ./cmd/hello produces an ELF that runs on capOS, prints “Hello, World!”, and exits.

Estimated effort: ~2000-3000 lines of Go runtime code (mostly in runtime/os_capos.go, runtime/sys_capos_amd64.s, runtime/mem_capos.go). Reference: runtime/os_js.go (WASM target) is ~400 lines; runtime/os_linux.go is ~700 lines. capOS sits between these.

Phase 2: Kernel Threading + Futex

Add kernel support for:

  • Multiple threads per process (shared address space, separate stacks)
  • Futex authority capability and measured wait/wake ABI
  • FS base per-thread (for goroutine-local storage)

Update Go runtime:

  • newosproc() creates a real kernel thread
  • futexsleep/futexwake use the selected futex capability ABI
  • GC runs concurrently across threads
  • Enable GOMAXPROCS > 1

Deliverable: Go programs use multiple CPU cores. GC is concurrent.

Phase 3: Network Poller

Implement runtime/netpoll_capos.go:

  • Register socket caps with the poller
  • Use an async notification mechanism (capability-based poll() or notification cap)
  • net.Dial(), net.Listen(), http.Get() work

This depends on the networking stack being available as capabilities.

Deliverable: Go HTTP client/server runs on capOS.

Phase 4: CUE on capOS

With Go working, CUE runs natively. This enables:

  • Runtime manifest evaluation (not just build-time)
  • Dynamic service reconfiguration via CUE expressions
  • CUE-based policy enforcement in the capability layer

Kernel Prerequisites

PrerequisiteRoadmap StageWhy
Capability syscallsStage 4 (sync path done)Go runtime invokes caps (VirtualMemory, Timer, Console)
SchedulingStage 5 (core done)Go needs timer interrupts for goroutine preemption fallback
IPC + cap transferStage 6Go programs are service processes that export/import caps
VirtualMemory capabilityStage 5mmap equivalent for Go’s memory allocator and GC
Thread supportExtends Stage 5Multiple execution contexts per process
Futex authority capabilityExtends Stage 5Go runtime synchronization

VirtualMemory Capability

This is the biggest new kernel primitive. Go’s allocator requires:

  1. Reserve large virtual ranges without committing physical memory (Go reserves 256 TB of virtual space on 64-bit systems)
  2. Commit pages within reserved ranges (back with physical frames)
  3. Decommit pages (release frames, keep virtual range reserved)
  4. Set permissions (RW for data, none for guard pages)

The existing page table code (kernel/src/mem/paging.rs) supports mapping and unmapping individual pages. It needs to be extended with:

  • Virtual range reservation (mark ranges as reserved in some bitmap/tree)
  • Lazy commit (map as PROT_NONE initially, page fault handler commits on demand — or explicit commit via cap call)
  • Permission changes on existing mappings

Thread Support

Extending the process model (kernel/src/process.rs). See the SMP proposal for the PerCpu struct layout (per-CPU kernel stack, saved registers, FS base); Thread extends this for multi-thread-per-process. See also the In-Process Threading section in ROADMAP.md for the roadmap-level view.

#![allow(unused)]
fn main() {
struct Process {
    pid: u64,
    address_space: AddressSpace,  // shared by all threads
    caps: CapTable,               // shared by all threads
    threads: Vec<Thread>,
}

struct Thread {
    tid: u64,
    state: ThreadState,
    kernel_stack: VirtAddr,
    saved_regs: RegisterState,    // rsp, rip, etc.
    fs_base: u64,                 // for thread-local storage
}
}

The scheduler (Stage 5) schedules threads, not processes. Each thread gets its own kernel stack and register save area. Context switch saves/restores thread state. Page table switch only happens when switching between threads of different processes.

Alternative: Go via WASI

For comparison, the WASI path from the userspace-binaries proposal:

Native GOOSWASI
PerformanceNative speed~2-5x overhead (wasm interpreter/JIT)
Go compatibilityFull (after Phase 3)Limited (WASI Go support is experimental)
GoroutinesReal M:N schedulingSingle-threaded (WASI has no threads yet)
Net I/ONative async via pollerBlocking only (WASI sockets are sync)
Kernel workVirtualMemory, threads, futexNone (wasm runtime handles it)
Go runtime forkYes (maintain a fork)No (upstream GOOS=wasip1)
GCFull concurrent GCConservative GC (wasm has no stack scanning)
Maintenance burdenHigh (track Go releases)Low (upstream supported)

WASI is easier but limited. Go on WASI (GOOS=wasip1) is officially supported but experimental — no goroutine parallelism, no async I/O, limited stdlib. For running CUE (which is CPU-bound evaluation, no I/O, single goroutine), WASI might be sufficient.

Native GOOS is harder but complete. Full Go with goroutines, concurrent GC, network I/O, and the entire stdlib. Required for Go network services or anything using net/http.

Recommendation: Start with WASI for CUE evaluation (Phase 4 of the WASI proposal in userspace-binaries). If Go network services become a goal, invest in the native GOOS.

Relationship to Other Proposals

  • Userspace binaries proposal — this extends Tier 4 (managed runtimes) with concrete Go implementation details. The POSIX layer (Part 4) is NOT sufficient for Go — Go doesn’t use libc on Linux, it makes raw syscalls. The GOOS approach bypasses POSIX entirely.
  • Service architecture proposal — Go services participate in the capability graph like any other process. The Go net poller (Phase 3) uses TcpSocket/UdpSocket caps from the network stack.
  • Storage and naming proposal — Go’s os.Open()/os.Read() map to Namespace + Store caps via the GOOS file I/O implementation. Go doesn’t use POSIX for this — it has its own runtime/os_capos.go with direct cap invocations.
  • SMP proposal — Go’s GOMAXPROCS uses multiple OS threads (Phase 2). Requires per-CPU scheduling from Stage 5/7.

Open Questions

  1. Fork maintenance. A GOOS=capos fork must track upstream Go releases. How much drift is acceptable? Could the capOS-specific code eventually be upstreamed (like Fuchsia’s was)?

  2. CGo support. Go’s FFI to C (cgo) requires a C toolchain and dynamic linking. Should capOS support cgo, or is pure Go sufficient? CUE doesn’t use cgo, but some Go libraries do.

  3. GOROOT on capOS. Go programs expect $GOROOT/lib at runtime for some stdlib features. Where does this live on capOS? In the Store? Baked into the binary via static compilation?

  4. Go module proxy. go get needs HTTP access. On capOS, this would use a Fetch cap. But cross-compilation on the host is more practical than building Go on capOS itself.

  5. Debugging. Go’s runtime/debug and pprof expect signals and /proc access. What debugging capabilities should capOS expose?

  6. GC tuning. Go’s GC is tuned for Linux’s mmap semantics (decommit is cheap, virtual space is nearly free). capOS’s VirtualMemory cap needs to match these assumptions or the GC will need retuning.

Estimated Scope

PhaseNew kernel codeGo runtime changesDependencies
Phase 1: Minimal GOOS~200 (VirtualMemory cap)~2000-3000Stages 4-5
Phase 2: Threading~500 (threads, futex)~500Stage 5, SMP
Phase 3: Net poller~100 (async notification)~300Networking, Stage 6
Phase 4: CUE on capOS00Phase 1 (or WASI)
Total~800~2800-3800

Plus ongoing maintenance to track Go upstream releases.

Proposal: Capability-Native System Monitoring

How capOS should expose logs, metrics, health, traces, crash records, and service status without introducing global /proc, ambient log access, or a privileged monitoring daemon that bypasses the capability model.

Problem

The current system is observable mostly through serial output, QEMU exit status, smoke-test lines, CQE error codes, and a small measurement-only build feature. That is enough for early kernel work, but it is not enough for a system whose claims depend on service decomposition, explicit authority, restart policy, auditability, and later cloud operation.

Monitoring is also not harmless. A monitoring service can reveal capability topology, service names, badges, timing, crash context, request payloads, and security decisions. If capOS imports a Unix-style “read everything under /proc” or “global syslog” model, monitoring becomes an ambient authority escape hatch. If it imports a kernel-programmable tracing model too early, it adds a large privileged execution surface before the basic service graph is stable.

The design target is narrower: make operational state visible through typed, attenuable capabilities. A process should observe only the services, logs, and signals it was granted authority to inspect.

Current State

Implemented signal sources:

  • Kernel diagnostics are printed through COM1 serial via kprintln!, timestamped with the PIT tick counter. Panic and fault paths use a mutex-free emergency serial writer.
  • Userspace logging currently goes through the kernel Console capability, backed directly by serial and bounded per call.
  • Runtime panics can use an emergency console path, then exit with a fixed code.
  • Capability-ring CQEs carry structured transport results, including negative CAP_ERR_* values and serialized CapException payloads.
  • The ring tracks cq_overflow, corrupted SQ/CQ recovery, and bounded SQE dispatch, but these facts are not exported as normal metrics.
  • ProcessSpawner and ProcessHandle.wait expose basic child lifecycle observation, but restart policy, health checks, and exported-cap lifecycle are future work.
  • capos-lib::ResourceLedger tracks cap slots, outstanding calls, scratch bytes, and frame grants, but only as local accounting state.
  • The measure feature adds benchmark-only counters and TSC helpers for controlled make run-measure boots.
  • SystemConfig.logLevel exists in the schema and is printed at boot, but there is no filtering, routing, or retention policy behind it.

That means the system has useful raw signals but lacks a capability-shaped monitoring architecture.

Design Principles

  1. Observation is authority. Reading logs, status, metrics, traces, crash records, or audit entries requires a capability.
  2. No global monitoring root. SystemStatus(all), LogReader(all), and ServiceSupervisor(all) are powerful caps. Normal sessions receive scoped wrappers.
  3. Kernel facts, userspace policy. The kernel may expose bounded facts about processes, rings, resources, and faults. Retention, filtering, aggregation, health semantics, restart policy, and user-facing views belong in userspace.
  4. Separate signal classes. Logs, metrics, lifecycle events, traces, health, crash records, and audit logs have different readers, retention rules, and security properties.
  5. Bounded by construction. Every producer path has a byte, entry, or time budget. Loss is explicit and summarized.
  6. Payload capture is exceptional. Default tracing records headers, interface IDs, method IDs, sizes, result codes, badges when authorized, and timing. Capturing method payloads needs a stronger cap because payloads may contain secrets.
  7. Serial remains emergency plumbing. Early boot, panic, and recovery still need direct serial output. Normal services should receive log caps rather than broad Console.
  8. Audit is not debug logging. Audit records security-relevant decisions and capability lifecycle events. It is append-only from producers and exposed through scoped readers.
  9. Pull by default, push when justified. Status and metrics are pull-shaped (reader polls a snapshot cap). Logs, lifecycle events, crash records, and audit entries are push-shaped (producer calls into a sink). Traces are pull with an explicit arm/drain lifecycle because capture is expensive. Each direction has its own cap surface; do not generalize one shape to cover all signals.
  10. Narrow kernel stats caps over one god-cap. The kernel exposes bounded facts through several small read-only caps (ring, scheduler, resource ledger, frames, endpoints, caps, crash) rather than one KernelDiagnostics that grants everything. Narrow caps let an init-owned status service be assembled by composition, and let a broker lease a subset to an operator without handing over the rest.

Signal Taxonomy

Logs

Human-oriented diagnostic records:

  • severity, component, service name, pid, optional session/service badge, monotonic timestamp, message text;
  • rate-limited at producer and log service boundaries;
  • suitable for serial forwarding, ring-buffer retention, and later storage;
  • not a source of truth for security decisions.

Metrics

Low-cardinality numeric state:

  • per-process ring SQ/CQ occupancy, cq_overflow, invalid SQE counts, opcode counts, transport error counts;
  • scheduler runnable/blocked counts, direct IPC handoffs, cap-enter timeouts, process exits;
  • resource ledger usage: cap slots, outstanding calls, scratch bytes, frame grants, endpoint queue occupancy, VM mapped pages;
  • heap/frame allocator pressure;
  • later device, network, storage, and CPU-time counters.

Metric shape is fixed to three forms:

  • Counter — monotonic u64, reset only by reboot. Cumulative semantics make aggregation composable.
  • Gaugei64 that moves both ways. Used for queue depths, free-frame counts, mapped-page counts.
  • Histogram — fixed bucket layout carried in the descriptor, u64 per bucket. Used for ring-dispatch duration, context-switch latency, IPC RTT.

Richer shapes (top-k tables, exponential histograms) are emitted as opaque typed payloads through the producer-scoped envelope described under “Core Interfaces”; the generic reader treats them as data, and a schema-aware viewer decodes them. Metrics should be snapshots or monotonic counters, not unbounded label streams.

Events

Discrete lifecycle facts:

  • process spawned, started, exited, waited, killed, or failed to load;
  • service declared healthy, unhealthy, restarting, quiescing, or upgraded;
  • endpoint queue overflow, cancellation, disconnected holder, transfer rollback;
  • resource quota rejection;
  • device reset, interrupt storm, link up/down, block I/O error once devices exist.

Events are useful for supervisors and status views. They may also feed logs.

Traces

Bounded high-detail capture for debugging:

  • SQE/CQE records around one pid, service subtree, endpoint, cap id, or error class;
  • optional capnp payload capture only with explicit authority;
  • offline schema-aware viewer for reproducing and explaining a failure;
  • short retention by default.

This is the Ring as Black Box milestone from WORKPLAN.md, not full replay.

Health

Declared service state:

  • ready, starting, degraded, draining, failed, stopped;
  • last successful health check and last failure reason;
  • dependency health summaries;
  • supervisor-owned restart intent and backoff state.

Health is not inferred only from process liveness. A process can be alive and unhealthy, or intentionally draining and still useful.

Crash Records

Panic, exception, and fatal userspace runtime records:

  • boot stage, current pid if known, fault vector, RIP/CR2/error code where applicable, recent SQE context when safe, and last serial line cursor;
  • bounded, redacted, and readable through a crash/debug capability;
  • serial fallback remains mandatory when no reader exists.

Audit

Security and policy records:

  • session creation, approval request, policy decision, cap grant, cap transfer, cap release/revocation, denial, declassification/relabel operation;
  • no raw authentication proofs, private keys, bearer tokens, or full environment dumps;
  • query access is scoped by session, service subtree, or operator role.

Proposed Architecture

flowchart TD
    Kernel[Kernel primitives] --> KD[KernelDiagnostics cap]
    Kernel --> Serial[Emergency serial]

    Init[init / root supervisor] --> LogSvc[Log service]
    Init --> MetricsSvc[Metrics service]
    Init --> StatusSvc[Status service]
    Init --> AuditSvc[Audit log]
    Init --> TraceSvc[Trace capture service]

    KD --> MetricsSvc
    KD --> StatusSvc
    KD --> TraceSvc

    Services[Services and drivers] --> LogSink[Scoped LogSink caps]
    Services --> Health[Health caps]
    Services --> AuditWriter[Scoped AuditWriter caps]

    LogSink --> LogSvc
    Health --> StatusSvc
    AuditWriter --> AuditSvc

    Broker[AuthorityBroker] --> Readers[Scoped readers]
    Readers --> Shell[Shell / agent / operator tools]

    StatusSvc --> Readers
    LogSvc --> Readers
    MetricsSvc --> Readers
    TraceSvc --> Readers
    AuditSvc --> Readers

The important property is that there is no ambient monitoring namespace. The graph is assembled by init and supervisors. Readers are capabilities, not paths.

Core Interfaces

These are conceptual interfaces. They should not be added to schema/capos.capnp until the current manifest-executor work is complete and a specific implementation slice needs them.

enum Severity {
  debug @0;
  info @1;
  warn @2;
  error @3;
  critical @4;
}

struct LogRecord {
  tick @0 :UInt64;
  severity @1 :Severity;
  component @2 :Text;
  pid @3 :UInt32;
  badge @4 :UInt64;
  message @5 :Text;
}

struct LogFilter {
  minSeverity @0 :Severity;
  componentPrefix @1 :Text;
  pid @2 :UInt32;
  includeDebug @3 :Bool;
}

interface LogSink {
  write @0 (record :LogRecord) -> ();
}

interface LogReader {
  read @0 (cursor :UInt64, maxRecords :UInt32, filter :LogFilter)
      -> (records :List(LogRecord), nextCursor :UInt64, dropped :UInt64);
}

LogSink is what ordinary services receive. LogReader is what shells, operators, supervisors, and diagnostic tools receive. A scoped reader can filter to one service subtree or session before the caller ever sees the record.

struct ProcessStatus {
  pid @0 :UInt32;
  serviceName @1 :Text;
  state @2 :Text;
  capSlotsUsed @3 :UInt32;
  capSlotsMax @4 :UInt32;
  outstandingCalls @5 :UInt32;
  cqReady @6 :UInt32;
  cqOverflow @7 :UInt64;
  lastExitCode @8 :Int64;
}

struct ServiceStatus {
  name @0 :Text;
  health @1 :Text;
  pid @2 :UInt32;
  restartCount @3 :UInt32;
  lastError @4 :Text;
}

interface SystemStatus {
  listProcesses @0 () -> (processes :List(ProcessStatus));
  listServices @1 () -> (services :List(ServiceStatus));
  service @2 (name :Text) -> (status :ServiceStatus);
}

SystemStatus is read-only. A broad instance can see the system; wrappers can expose one service, one supervision subtree, or one session.

enum MetricKind {
  counter @0;
  gauge @1;
  histogram @2;
}

struct MetricSample {
  # Well-known fixed-name slot for counters and gauges the aggregator
  # understands without additional schema lookup. Use this for stable
  # kernel counters to keep the hot path allocation-free.
  name @0 :Text;
  kind @1 :MetricKind;
  value @2 :Int64;
  tick @3 :UInt64;

  # Producer-scoped typed envelope for richer samples (histograms,
  # top-k tables, per-subsystem structs). Payload is a capnp message;
  # the schema is identified by `schemaHash` (capnp node id) and keyed
  # per producer. Opaque to the generic reader; a schema-aware viewer
  # decodes it.
  producerId @4 :UInt64;
  schemaHash @5 :UInt64;
  payload    @6 :Data;
}

struct MetricFilter {
  prefix @0 :Text;
  service @1 :Text;
}

interface MetricsReader {
  snapshot @0 (filter :MetricFilter, maxSamples :UInt32)
      -> (samples :List(MetricSample), truncated :Bool);
}

Early metrics should be fixed-name counters and gauges in the name/value slot. Avoid arbitrary labels until there is a concrete memory and cardinality policy. The producer-scoped envelope exists so richer samples do not force the generic reader to learn a string-key taxonomy — if a producer needs per-queue or per-device detail, it ships a typed capnp struct keyed by schemaHash rather than synthesizing name strings.

struct TraceSelector {
  pid @0 :UInt32;
  serviceName @1 :Text;
  errorCode @2 :Int32;
  includePayloadBytes @3 :Bool;
}

struct TraceRecord {
  tick @0 :UInt64;
  pid @1 :UInt32;
  opcode @2 :UInt16;
  capId @3 :UInt32;
  methodId @4 :UInt16;
  interfaceId @5 :UInt64;
  result @6 :Int32;
  flags @7 :UInt16;
  payload @8 :Data;
}

interface TraceCapture {
  arm @0 (selector :TraceSelector, maxRecords :UInt32, maxBytes :UInt32)
      -> (captureId :UInt64);
  drain @1 (captureId :UInt64, maxRecords :UInt32)
      -> (records :List(TraceRecord), complete :Bool, dropped :UInt64);
}

Payload capture should default off. A capture cap that can read payload bytes is closer to a debug privilege than a normal status cap.

enum HealthState {
  starting @0;
  ready @1;
  degraded @2;
  draining @3;
  failed @4;
  stopped @5;
}

interface Health {
  check @0 () -> (state :HealthState, reason :Text);
}

interface ServiceSupervisor {
  status @0 () -> (status :ServiceStatus);
  restart @1 () -> ();
}

ServiceSupervisor is authority-changing. Normal monitoring readers should not receive it. A broker can mint a leased ServiceSupervisor(net-stack) for one operator action.

Kernel Diagnostics Contract

The kernel should expose a small read-only diagnostics surface for facts only the kernel can know:

  • process table snapshot: pid, state, service name if known, wait state, exit code, ring physical identity hidden or omitted;
  • ring snapshot: SQ/CQ head/tail-derived occupancy, overflow count, corrupted head/tail recovery counts, opcode/error counters;
  • resource snapshot: cap slot usage, outstanding calls, scratch reservation, frame grant pages, mapped VM pages, free frame count, heap pressure;
  • scheduler snapshot: tick count, current pid, run queue length, blocked count, direct IPC handoff count, timeout wake count;
  • crash record: last panic/fault metadata and early boot stage.

The kernel should not implement log routing, alerting, dashboards, retention policy, restart decisions, RBAC, ABAC, or text search. Those are userspace service responsibilities.

Implementation shape:

  • Maintain fixed-size counters in existing kernel structures where the source event already occurs.
  • Prefer snapshots computed from existing state over duplicate counters when the cost is bounded.
  • Expose snapshots through a small set of narrow read-only capabilities, not one KernelDiagnostics god-cap. The initial decomposition:
    • SchedStats — tick count, current pid, run queue length, blocked count, direct IPC handoff count, cap_enter timeout/wake counts.
    • FrameStats — free/used frame counts, frame-grant pages, allocator pressure histogram.
    • RingStats — per-process SQ/CQ occupancy, cq_overflow, corrupted-head recovery counts, opcode counters, transport-error counters.
    • CapTableStats — per-process slot occupancy, generation-rollover counts, insertion/remove rates.
    • EndpointStats — per-endpoint waiter depth, RECV/RETURN match rate, abort/cancellation counts.
    • CrashSnapshot — last panic/fault metadata, early boot stage, recent SQE context when safe.
  • Each narrow cap exposes snapshot() -> (sample :MetricSample) or a typed struct; none of them enumerates processes or reads cap tables beyond what the subsystem owns. A trusted status service composes the ones it needs; a broker leases a subset for operator sessions without the rest.
  • ProcessInspector (pid-scoped process table, cap-table enumeration, VM map) is a distinct, stronger cap and lives with process-management authority, not with monitoring.
  • Convert broad diagnostics into scoped userspace wrappers before handing them to shells or applications.
  • Keep panic/fault serial writes independent of any diagnostics service.

Promotion from the measure feature: the benchmark counters in kernel/src/measure.rs graduate to always-on in RingStats / SchedStats when the per-event cost is provably a single relaxed atomic add. Cycle-counter instrumentation (rdtsc/rdtscp) stays behind cfg(feature = "measure") because it is serializing and benchmark-only. The promotion threshold keeps normal dispatch builds free of instrumentation cost without forcing monitoring into a second build configuration.

Logging Model

Early boot has only serial. After init starts the log service, ordinary services should receive LogSink rather than raw Console unless they need emergency console access.

Recommended path:

  1. Kernel serial remains for boot, panic, and fault records.
  2. Init starts a userspace log service and passes scoped LogSink caps to children.
  3. The log service forwards selected records to Console until persistent storage exists.
  4. SystemConfig.logLevel becomes an initial policy input for which records the log service forwards and retains.
  5. Session and operator tools receive scoped LogReader caps from a broker.

Services should not put secrets, raw capability payloads, full auth proofs, or arbitrary user input into logs without explicit redaction. Log records are data, not commands.

Metrics and Status

Status answers “what is alive and what state is it in.” Metrics answer “what is the numeric behavior over time.” Keeping them separate avoids a common failure mode where a human-readable status API grows into an unbounded metrics store.

Initial status fields should cover:

  • pid, service name, binary name, process state, exit code;
  • process handle wait state;
  • supervisor health and restart policy once supervision exists;
  • cap table occupancy and outstanding call count;
  • ring CQ availability and overflow;
  • endpoint queue occupancy where authorized.

Initial metrics should cover:

  • ring dispatches, SQEs processed, per-op counts, transport error counts;
  • cap-enter wait count, timeout count, wake count;
  • scheduler context switches and direct IPC handoffs;
  • frame free/used counts, frame grant pages, VM mapped pages;
  • log records accepted, suppressed, dropped, and forwarded;
  • trace records captured and dropped.

Avoid per-method, per-cap-id, per-badge, or per-user high-cardinality metrics by default. Those belong in short-lived traces or scoped logs.

Ring as Black Box

The first concrete monitoring milestone should be the existing WORKPLAN.md Ring-as-Black-Box item:

  • define a bounded capture format for SQE/CQE and endpoint transition records;
  • export capture through a debug capability or QEMU-only debug path;
  • build a host-side viewer that decodes records and capnp payloads when payload capture is authorized;
  • add one failing-call smoke whose captured log can be inspected offline.

This buys immediate debugging value without committing to durable audit, network export, service restart policy, or replay semantics.

This is inspection, not record/replay. Replay requires stronger determinism, payload retention, timer/input modeling, and capability-state checkpoints.

Capture path cost. The capture cap (working name RingTap) is feature-gated (cfg(feature = "debug_tap") analogous to measure). Every armed tap imposes a serializing fan-out on dispatch; keeping it out of the default kernel feature set prevents always-on cost. Arming a tap is itself an auditable event — the tapped process and the audit log observe it — and tap grants respect move-semantics so a tap cannot be silently cloned past its intended holder. Payload-capturing taps require a separately leased cap distinct from metadata-only capture because payloads may contain secrets.

Health and Supervision

Health and restart policy should live with supervisors, not in a central kernel daemon.

Each supervisor owns:

  • a narrowed ProcessSpawner;
  • child ProcessHandle caps;
  • the cap bundle needed to restart its subtree;
  • optional Health caps exported by children;
  • a LogSink and AuditWriter for its own decisions.

Status services aggregate supervisor-reported health. They should distinguish:

  • no process exists;
  • process exists but never reported ready;
  • process is alive and ready;
  • process is alive but degraded;
  • process exited normally;
  • process failed and supervisor is backing off;
  • process was intentionally stopped or draining.

Restart authority should be a separate ServiceSupervisor cap. A read-only SystemStatus cap must not be able to restart anything.

Audit Integration

Audit should share infrastructure with logging only at the storage or transport layer. Its semantics are different.

Audit producers:

  • AuthorityBroker for policy decisions and leased grants;
  • supervisors for restarts and service lifecycle actions;
  • session manager for session creation and logout;
  • kernel or status service for cap transfer/release/revocation summaries when those events become part of the exported authority graph;
  • recovery tools for repair actions.

Audit readers are scoped:

  • a user can read records for its own session;
  • an operator can read a service subtree;
  • a recovery or security role can read broader streams after policy approval.

Audit entries must avoid secrets and payload dumps. They should record object identity, service identity, policy decision summaries, and capability interface classes rather than raw data.

Security and Backpressure

Monitoring must not become the easiest denial-of-service path.

Required controls:

  • Per-process log token buckets, matching the S.9 diagnostic aggregation design.
  • Suppression summaries for repeated invalid submissions.
  • Fixed-size ring buffers with explicit dropped counts.
  • Maximum record size for logs, events, crash records, and traces.
  • Bounded formatting outside interrupt context.
  • No heap allocation in timer or panic paths.
  • No unbounded metric label creation from user-controlled strings.
  • Payload tracing disabled by default.
  • Redaction rules at producer boundaries and at reader wrappers.
  • Capability-scoped readers; no unauthenticated “debug all” endpoint.

When pressure forces dropping, preserve first-observation diagnostics and later summaries. Losing detailed logs is acceptable; corrupting scheduler progress or blocking the kernel on log I/O is not.

Relationship to Existing Proposals

  • Service Architecture: monitoring services are ordinary userspace services spawned by init or supervisors. Logging policy and service topology stay out of the pre-init kernel path.
  • Shell: the native and agent shell should receive scoped SystemStatus and LogReader caps in daily profiles, not global supervisor authority.
  • User Identity and Policy: AuthorityBroker mints scoped readers and leased supervisor caps based on session policy; AuditLog records the decisions.
  • Error Handling: transport errors and CapException payloads are monitoring signals, but retry policy remains userspace.
  • Authority Accounting: resource ledgers provide the first metrics substrate and define quota/backpressure boundaries.
  • Security and Verification: hostile-input tests should cover log flood aggregation and bounded diagnostic paths.
  • Live Upgrade: health, audit, and service status become prerequisites for credible upgrade orchestration.

Implementation Plan

  1. Document the model. Keep monitoring as a future architecture proposal and do not disturb the current manifest-executor milestone.

  2. Ring as Black Box. Implement bounded CQE/SQE capture, host-side decoding, and one failing-call smoke. This is the first useful monitoring artifact.

  3. Userspace log service. Add LogSink and LogReader schemas, start a log service from init, forward selected records to Console, and enforce logLevel, record size, and drop summaries.

  4. Narrow kernel stats caps and SystemStatus. Add the narrow read-only caps (SchedStats, FrameStats, RingStats, CapTableStats, EndpointStats, CrashSnapshot) as bounded snapshot surfaces. A userspace SystemStatus service composes the ones it needs and exposes scoped wrappers to shells and operator tools. Leave ProcessInspector out of this step — it belongs with process-management authority, not monitoring.

  5. Metrics snapshots. Add fixed counters and gauges for ring, scheduler, resource, log, and trace state. Keep labels static until a cardinality policy exists.

  6. Health and supervisor status. Add Health and read-only supervisor status once restart policy and exported service caps are concrete. Keep restart authority in separate ServiceSupervisor caps.

  7. Audit path. Add append-only audit records for broker decisions, cap grants, releases, revocations, restarts, and recovery actions. Start serial or memory backed; move to storage once the storage substrate exists.

  8. Crash records. Preserve bounded panic/fault metadata across the current boot where possible; later store records durably.

  9. Device, network, and storage metrics. Add driver metrics only after those drivers exist: interrupts, DMA/bounce usage, queue depth, RX/TX/drop/error counts, block latency, and reset events.

Non-Goals

  • No global /proc or /sys equivalent with ambient read access.
  • No kernel-resident dashboard, alert manager, text search, or policy engine.
  • No programmable kernel tracing language in the first monitoring design.
  • No promise of durable log retention before storage exists.
  • No default payload tracing.
  • No service restart authority bundled into ordinary read-only status caps.
  • No network export path until networking and policy can constrain it.

Open Questions

  • Should KernelDiagnostics expose snapshots only, or also a bounded event cursor?
  • What is the minimum timestamp model before wall-clock time exists?
  • Should log records carry local cap IDs, stable object IDs, or only interface and service metadata by default?
  • How should schema-aware trace decoding find schemas before a full SchemaRegistry exists?
  • Which crash fields are safe to expose to non-recovery sessions?
  • What retention policy is acceptable before persistent storage?
  • Should MetricsReader use typed structs for each subsystem instead of generic name/value samples?
  • Where should remote monitoring export fit once network transparency exists: a dedicated exporter service, capnp-rpc forwarding, or storage replication?

Proposal: User Identity, Sessions, and Policy

How capOS should represent human users, service identities, guests, anonymous callers, and policy systems without reintroducing Unix-style ambient authority.

Problem

capOS has processes, address spaces, capability tables, object identities, badges, quotas, and transfer rules. It deliberately does not have global paths, ambient file descriptors, a privileged root bit, or Unix uid/gid authorization in the kernel.

Interactive operation still needs a way to answer practical questions:

  • Who is using this shell session?
  • Which caps should a normal daily session receive?
  • How does a service distinguish Alice, Bob, a service account, a guest, and an anonymous network caller?
  • How do RBAC, ABAC, and mandatory policy fit a capability system?
  • How does POSIX compatibility expose users without letting uid become authority?

The answer should keep the enforcement model simple: capabilities are the authority. Identity and policy decide which capabilities get minted, granted, attenuated, leased, revoked, and audited.

Design Principles

  • user is not a kernel primitive.
  • uid, gid, role, and label values do not authorize kernel operations.
  • A process is authorized only by capabilities in its table.
  • Authentication proves or selects a principal; it does not itself grant authority.
  • A session is a live policy context that receives a cap bundle.
  • A workload is a process or supervision subtree launched with explicit caps.
  • POSIX user concepts are compatibility metadata over scoped caps.
  • Guest and anonymous access are explicit policy profiles, not missing policy.

Concepts

Principal

A principal is a durable or deliberately ephemeral identity known to auth and policy services. It is useful for policy decisions, ownership metadata, audit records, and user-facing display. It is not a kernel subject.

Examples:

  • human account
  • operator account
  • service account
  • cloud instance or deployment identity
  • guest profile
  • anonymous caller
  • pseudonymous key-bound identity
enum PrincipalKind {
  human @0;
  operator @1;
  service @2;
  guest @3;
  anonymous @4;
  pseudonymous @5;
}

struct PrincipalInfo {
  id @0 :Data;             # Stable opaque ID, or random ephemeral ID.
  kind @1 :PrincipalKind;
  displayName @2 :Text;
}

PrincipalInfo is intentionally descriptive. Possessing a serialized PrincipalInfo value must not grant authority.

Session

A session is a live context derived from a principal plus authentication and policy state. Sessions carry freshness, expiry, auth strength, quota profile, audit identity, and default cap-bundle selection.

enum AuthStrength {
  none @0;
  localPresence @1;
  password @2;
  hardwareKey @3;
  multiParty @4;
}

struct SessionInfo {
  sessionId @0 :Data;
  principal @1 :PrincipalInfo;
  authStrength @2 :AuthStrength;
  createdAtMs @3 :UInt64;
  expiresAtMs @4 :UInt64;
  profile @5 :Text;
}

interface UserSession {
  info @0 () -> (info :SessionInfo);
  defaultCaps @1 (profile :Text) -> (caps :List(GrantedCap));
  auditContext @2 () -> (sessionId :Data, principalId :Data);
  logout @3 () -> ();
}

interface SessionManager {
  login @0 (method :Text, proof :Data) -> (session :UserSession);
  guest @1 () -> (session :UserSession);
  anonymous @2 (purpose :Text) -> (session :UserSession);
}

GrantedCap should be the same transport-level result-cap concept used by ProcessSpawner, not a parallel authority encoding.

Workload

A workload is a process or supervision subtree started from a session, service, or supervisor. Workloads may carry session metadata for audit and policy, but they do not run “as” a user in the Unix sense. They run with a CapSet.

Common workload shapes:

  • interactive native shell
  • agent shell
  • POSIX shell compatibility session
  • user-facing application
  • per-user service instance
  • shared service handling many user sessions
  • service account process

Capability

A capability remains the actual authority. A process can only use what is in its local capability table. Policy services can choose to mint, attenuate, lease, transfer, or revoke capabilities, but they do not create a second authorization channel.

Session Startup Flow

flowchart TD
    Input[Login, guest, or anonymous request]
    Auth[Authentication or guest policy]
    Session[UserSession cap]
    Broker[AuthorityBroker / PolicyEngine]
    Bundle[Scoped cap bundle]
    Shell[Native, agent, or POSIX shell]
    Audit[AuditLog]

    Input --> Auth
    Auth --> Session
    Session --> Broker
    Broker --> Bundle
    Bundle --> Shell
    Broker --> Audit
    Shell --> Audit

The shell proposal’s minimal daily cap set is a session bundle:

terminal        TerminalSession or Console
self            self/session introspection
status          read-only SystemStatus
logs            read-only LogReader scoped to this user/session
home            Directory or Namespace scoped to user data
launcher        restricted launcher for approved user applications
approval        ApprovalClient

The shell still cannot mint additional authority. It can ask ApprovalClient for a plan-specific grant, and a trusted broker can return a narrow leased capability if policy and authentication allow it.

Multi-User Workloads

capOS should support two normal multi-user patterns.

Per-Session Subtree

The session owns a shell or supervisor subtree. Every child process receives an explicit CapSet assembled from the session bundle plus workflow-specific grants.

Example:

  • Alice’s shell receives home = Namespace("/users/alice").
  • Bob’s shell receives home = Namespace("/users/bob").
  • The same editor binary launched from each shell receives different home and terminal caps.
  • The editor cannot cross from Alice’s namespace into Bob’s unless a broker deliberately grants a sharing cap.

This is the right default for interactive applications and POSIX shells.

Shared Service With Per-Client Session Authority

A server process may handle many users in one address space. It should not infer authority from a caller’s self-reported user name. Instead, each client connection carries one or more of:

  • a badge on the endpoint cap identifying the client/session relation,
  • a UserSession or SessionContext cap,
  • a scoped object cap such as Directory, Namespace, LogWriter, or ApprovalClient,
  • a quota donation for server-side state.

The service uses these values to select scoped storage, enforce per-client limits, emit audit records, and return narrowed caps. This is the right shape for HTTP services, databases, log services, terminals, and shared daemons.

Service Accounts

Service identities are principals too. They are usually non-interactive and receive caps from init, a supervisor, or a deployment manifest rather than from a human login flow.

Service-account policy should be explicit:

  • which binary or measured package may use the identity,
  • which supervisor may spawn it,
  • which caps are in its base bundle,
  • which caps it may request from a broker,
  • which audit stream records its activity.

Anonymous, Guest, and Pseudonymous Access

These are distinct profiles.

Empty Cap Set

An untrusted ELF with an empty CapSet is not a user session. It is the roadmap’s “Unprivileged Stranger”: code with no useful authority. It can terminate itself and interact with the capability transport, but it cannot reach a resource because it has no caps.

Anonymous

Anonymous means unauthenticated and usually remote or programmatic. It should receive a random ephemeral principal ID and a very small cap bundle.

Typical properties:

  • no durable home namespace by default,
  • strict CPU, memory, outstanding-call, and log quotas,
  • short session expiry,
  • no elevation path except “authenticate” or “create account”,
  • audit records keyed by ephemeral session ID and network/service context.

Guest

Guest means an interactive local profile with weak or no authentication.

Typical properties:

  • terminal/UI access,
  • temporary namespace,
  • optional ephemeral home reset on logout,
  • restricted launcher,
  • no administrative approval path unless policy grants one explicitly,
  • clearer user-facing affordance than anonymous.

Pseudonymous

Pseudonymous means durable identity without necessarily naming a human. A public key, passkey, service token, or cloud identity can select the same principal across sessions. This can receive persistent storage and quotas while still remaining separate from a verified human account.

POSIX Compatibility

POSIX user concepts are compatibility metadata, not authority.

  • uid, gid, user names, groups, $HOME, /etc/passwd, chmod, and chown live in libcapos-posix, a filesystem service, or a profile service.
  • open("/home/alice/file") succeeds only if the process has a Directory or Namespace cap that resolves that synthetic path.
  • setuid cannot grant new caps. At most it asks a compatibility broker to replace the process’s POSIX profile or launch a new process with a different cap bundle.
  • POSIX ownership bits may influence one filesystem service’s policy, but they cannot authorize access to caps outside that service.

This lets existing programs inspect plausible user metadata without making Unix permission bits the capOS security model.

Policy Models

RBAC, ABAC, and mandatory access control fit capOS as grant-time and mint-time policy. They should mostly live in ordinary userspace services: AuthorityBroker, PolicyEngine, SessionManager, RoleDirectory, LabelAuthority, AuditLog, and service-specific attenuators.

The kernel should keep enforcing capability ownership, generation, transfer rules, revocation epochs, resource ledgers, and process isolation. It should not evaluate roles, attributes, or label lattices on every capability call.

RBAC

Role-based access control maps principals or sessions to named role sets. Roles select cap bundles and approval eligibility.

Examples:

  • developer can receive a launcher for development tools and read-only service logs.
  • net-operator can request a leased ServiceSupervisor(net-stack).
  • storage-admin can request repair caps for selected storage volumes.

Implementation shape:

interface RoleDirectory {
  rolesFor @0 (principal :Data) -> (roles :List(Text));
}

interface AuthorityBroker {
  request @0 (
    session :UserSession,
    plan :ActionPlan,
    requestedCaps :List(CapRequest),
    durationMs :UInt64
  ) -> (grant :ApprovalGrant);
}

Roles do not bypass capabilities. They only let a broker decide whether it may mint or return particular scoped caps.

ABAC

Attribute-based access control evaluates a richer decision context:

  • subject attributes: principal kind, roles, auth strength, session age, device posture, locality,
  • action attributes: requested method, target service, destructive flag, requested duration,
  • object attributes: service name, namespace prefix, data class, owner principal, sensitivity,
  • environment attributes: time, boot mode, recovery mode, network location, cloud instance metadata, quorum state.

ABAC is useful for contextual narrowing:

  • allow log read only for the caller’s session unless break-glass policy is active,
  • issue ServiceSupervisor(net-stack) only with fresh hardware-key auth,
  • grant Namespace("/shared/project") read-write only during a maintenance window,
  • deny network caps to guest sessions.

ABAC decisions should return capabilities, wrappers, or denials. They should not create hidden ambient checks downstream.

ABAC Policy Engine Choices

Do not invent a policy language first. The capOS-native interface should be small and capability-shaped, while the broker implementation can start with a mainstream engine behind that interface.

Recommended order:

  1. Cedar for runtime authorization. Cedar’s request shape is already close to capOS: principal, action, resource, and context. It supports RBAC and ABAC in one policy set, has schema validation, and has a Rust implementation. That makes it the best fit for AuthorityBroker and MacBroker service prototypes.

  2. OPA/Rego for host-side and deployment policy. OPA is widely used for cloud, Kubernetes, infrastructure-as-code, and admission-control style checks. It is useful for validating manifests, cloud metadata deltas, package/deployment policies, and CI rules. The Wasm compilation path is worth tracking for later capOS-side execution, but OPA should not be the first low-level runtime dependency.

  3. Casbin for quick prototypes only. Casbin is useful for simple RBAC/ABAC experiments and has Rust bindings, but its model/matcher style is less attractive as a long-term capOS policy substrate than Cedar’s schema-validated authorization model.

  4. XACML for interoperability and compliance, not native policy. XACML remains the classic enterprise ABAC standard. It is useful as a conceptual reference or import/export target, but it is too heavy and XML-centric to be the native capOS policy language.

The capOS service boundary should hide the selected engine:

interface PolicyEngine {
  decide @0 (request :PolicyRequest) -> (decision :PolicyDecision);
}

struct PolicyRequest {
  principal @0 :PrincipalInfo;
  action @1 :Text;
  resource @2 :ResourceRef;
  context @3 :List(Attribute);
}

struct PolicyDecision {
  allowed @0 :Bool;
  reason @1 :Text;
  leaseMs @2 :UInt64;
  constraints @3 :List(Attribute);
}

PolicyDecision is still not authority. It is input to a broker that returns actual caps, wrapper caps, leased caps, or denial.

References:

Mandatory Access Control

Mandatory access control is non-bypassable policy set by the system owner or deployment, not discretionary sharing by ordinary users. In capOS, MAC should be implemented as mandatory constraints on cap minting, attenuation, transfer, and service wrappers.

Examples:

  • a Secret cap labeled high cannot be transferred to a workload labeled low,
  • a LogReader for security logs cannot be granted to a guest session even if an application asks,
  • a recovery shell can inspect storage read-only but cannot write without a separate exact-target repair cap,
  • cloud user-data can add application services but cannot grant FrameAllocator, DeviceManager, or raw networking authority.

Implementation components:

enum Sensitivity {
  public @0;
  internal @1;
  confidential @2;
  secret @3;
}

struct SecurityLabel {
  domain @0 :Text;
  sensitivity @1 :Sensitivity;
  compartments @2 :List(Text);
}

interface LabelAuthority {
  labelOfPrincipal @0 (principal :Data) -> (label :SecurityLabel);
  labelOfObject @1 (object :Data) -> (label :SecurityLabel);
  canTransfer @2 (
    from :SecurityLabel,
    to :SecurityLabel,
    capInterface :UInt64
  ) -> (allowed :Bool, reason :Text);
}

For ordinary services, MAC can be enforced by brokers and wrapper caps. For high-assurance boundaries, the remaining question is whether transfer labels need kernel-visible hold-edge metadata. That should be added only for a concrete mandatory policy that cannot be enforced by controlling all grant paths through trusted services.

GOST-Style MAC and MIC

Russian GOST framing is stricter than the generic “MAC means labels” summary. The relevant standards split at least two policies that capOS should keep separate:

  • Mandatory access control for confidentiality. ГОСТ Р 59383-2021 describes mandatory access control as classification labels on resources and clearances for subjects. ГОСТ Р 59453.1-2021 goes further: a formal model that includes users, subjects, objects, containers, access levels, confidentiality levels, subject-control relations, and information flows. The safety goal is preventing unauthorized flow from an object at a higher or incomparable confidentiality level to a lower one.

  • Mandatory integrity control for integrity. ГОСТ Р 59453.1-2021 treats this separately from confidentiality MAC. The integrity model constrains subject integrity levels, object/container integrity levels, subject-control relationships, and information flows so lower-integrity subjects cannot control or corrupt higher-integrity subjects.

For capOS, this should map to labels on sessions, objects, wrapper caps, and eventually hold edges:

struct ConfidentialityLabel {
  level @0 :Text;              # e.g. public, internal, secret.
  compartments @1 :List(Text);
}

struct IntegrityLabel {
  level @0 :Text;              # ordered by deployment policy.
  domains @1 :List(Text);
}

struct MandatoryLabel {
  confidentiality @0 :ConfidentialityLabel;
  integrity @1 :IntegrityLabel;
}

Capability methods need a declared flow class. capOS cannot rely on generic read and write syscalls:

  • read-like: File.read, Secret.read, LogReader.read;
  • write-like: File.write, Namespace.bind, ManifestUpdater.apply;
  • control-like: ProcessSpawner.spawn, ServiceSupervisor.restart;
  • transfer-like: CAP_OP_CALL, CAP_OP_RETURN, and result-cap insertion when they carry caps or data across labeled domains.

Initial rules can be expressed as broker/wrapper checks:

read data-bearing cap:
  subject.clearance dominates object.classification

write data-bearing cap:
  target.classification dominates source.classification
  # no write down

control process or supervisor:
  controlling subject is same label, or is an explicitly trusted subject

integrity write/control:
  writer.integrity >= target.integrity

This is not enough for a GOST-style formal claim, because uncontrolled cap transfer can bypass the broker. A higher-assurance design needs:

  • kernel object identity for every labeled object,
  • label metadata on kernel objects or per-process hold edges,
  • transfer-time checks for copy, move, result caps, and endpoint delivery,
  • explicit trusted-subject/declassifier caps,
  • an audit trail for every label-changing or declassifying operation,
  • a formal state model covering users, subjects, objects, containers, access rights, accesses, and memory/time information flows.

The proposal therefore has two levels:

  • Pragmatic capOS MAC/MIC: userspace brokers and wrapper caps enforce labels on grants and method calls.
  • GOST-style MAC/MIC: a formal information-flow model plus kernel-visible labels/hold-edge constraints for transfers that cannot be forced through trusted wrappers. See formal-mac-mic-proposal.md for the dedicated abstract-automaton and proof track.

References:

Composition Order

When policies compose, use this order:

  1. Mandatory policy defines the maximum possible authority.
  2. RBAC selects coarse eligibility and default bundles.
  3. ABAC narrows the decision for context, freshness, object attributes, and requested duration.
  4. The broker returns specific capabilities or denies the request.
  5. Audit records the plan, decision, grant, use, release, and revocation.

The composition result is still a CapSet, leased cap, wrapper cap, or denial.

Service Architecture

The policy stack should be decomposed into ordinary capOS services. Init or a trusted supervisor grants broad authority only to the small services that need to mint narrower caps.

SessionManager

Creates session metadata caps:

  • guest() for local guest sessions,
  • anonymous(purpose) for ephemeral unauthenticated callers,
  • later login(method, proof) for authenticated users.

The first implementation can be boot-config backed. It does not need a persistent account database. UserSession should describe the principal, session ID, profile, auth strength, expiry, and audit context. It should not be a general-purpose authority vending machine unless it was itself minted as a narrow wrapper around a fixed cap bundle.

Safer first split:

SessionManager -> UserSession metadata cap
AuthorityBroker(session, profile) -> base cap bundle
Supervisor/Launcher -> spawn shell with that bundle

AuthorityBroker

The broker owns or receives powerful caps from init/supervisors and returns narrow caps after RBAC, ABAC, and mandatory checks.

Examples:

  • broad ProcessSpawner -> RestrictedLauncher(allowed = ["shell", "editor"]),
  • broad NamespaceRoot -> Namespace("/users/alice"),
  • broad ServiceSupervisor -> LeasedSupervisor("net-stack", expires = 60s),
  • broad BootPackage -> BinaryProvider(allowed = ["shell", "editor"]).

The broker is the normal policy decision and cap minting point.

AuditLog

Append-only audit interface. Initially this can write to serial or a bounded log buffer; later it should be Store-backed.

Record at least:

  • session creation,
  • cap request,
  • policy input summary,
  • policy decision,
  • cap grant,
  • cap release or revocation,
  • denial,
  • declassification or relabel operation.

Audit entries must not contain raw auth proofs, private keys, bearer tokens, or broad environment dumps.

RoleDirectory

Role lookup should start static and boot-config backed:

guest -> guest-shell
alice -> developer
ops -> net-operator
net-stack -> service:network

This is enough for early RBAC bundles. Dynamic role assignment can wait for persistent storage and administrative tooling.

LabelAuthority

Owns the label lattice and dominance checks. In the pragmatic phase, it is a userspace dependency of brokers and wrappers. In a GOST-style phase, the same lattice needs a kernel-visible representation for transfer checks.

Wrapper Caps

Wrappers are the main mechanism. Prefer them over per-call ACL checks in a central service:

  • RestrictedLauncher wraps ProcessSpawner.
  • ScopedNamespace wraps a broader namespace/store.
  • ScopedLogReader filters by session ID or service subtree.
  • LeasedSupervisor wraps a broader supervisor with expiry and target binding.
  • ApplicationManifestUpdater rejects kernel/device/service-manager grants.
  • LabelledEndpoint enforces declared data-flow and control-flow constraints.

This keeps authority visible in the capability graph.

Bootstrap Sequence

Early boot can be static:

init
  -> starts AuditLog
  -> starts SessionManager
  -> starts AuthorityBroker with broad caps
  -> asks broker for a system, guest, or operator shell bundle
  -> spawns shell through a restricted launcher

Before durable storage exists, policy config comes from BootPackage / manifest config. Before real authentication exists, support guest, anonymous, and localPresence only.

Revocation, Audit, and Quotas

User/session policy depends on the Stage 6 authority graph work:

  • badge metadata lets shared services distinguish session/client relations,
  • resource ledgers and session quotas prevent denial-of-service through session creation,
  • CAP_OP_RELEASE and process-exit cleanup reclaim local hold edges,
  • epoch revocation lets a broker invalidate leased or compromised caps,
  • audit logs record the cap grant and release lifecycle.

Audit should record identity and policy metadata, but it should not contain secrets, raw authentication proofs, or broad environment dumps.

Implementation Plan

  1. Document the model. Keep user identity out of the kernel architecture and link this proposal from the shell, service, storage, and roadmap docs.

  2. Session-aware native shell profile. Treat the shell proposal’s minimal daily cap set as a session bundle. Add self/session introspection and scoped logs/home caps once the underlying services exist.

  3. Authority broker and audit log. Add ActionPlan, CapRequest, ApprovalClient, leased grant records, and an append-only audit path. Start with RBAC-style profile bundles and explicit local authentication.

  4. ABAC policy engine. Extend the broker decision with session freshness, auth strength, object attributes, requested duration, and environment state. Prefer Cedar for the runtime broker interface; use OPA/Rego for host-side manifest and deployment checks. Keep decisions visible in audit records.

  5. Mandatory policy labels. Add pragmatic labels to policy-managed services and wrappers. Keep confidentiality and integrity separate. Defer kernel-visible labels until a specific MAC/MIC policy cannot be enforced by trusted grant paths.

  6. Guest and anonymous demos. Show a guest shell with terminal, tmp, and restricted launcher, and show an anonymous workload with strict quotas and no persistent storage.

  7. POSIX profile adapter. Provide synthetic uid/gid, $HOME, /etc/passwd, and cwd behavior from a session profile and granted namespace caps.

  8. GOST-style formalization checkpoint. If capOS later claims GOST-style MAC/MIC, write the abstract state model before implementation: users, subjects, objects, containers, access rights, accesses, labels, control relations, and information flows. Then decide which labels must become kernel-visible.

Non-Goals

  • No kernel uid/gid.
  • No ambient root.
  • No global login namespace in the kernel.
  • No authorization from serialized identity structs.
  • No model-visible authentication secrets.
  • No POSIX permission bits as system-wide authority.
  • No per-call role/attribute/label interpreter in the kernel fast path.
  • No claim of GOST-style MAC/MIC until the formal model and transfer enforcement story exist.

Open Questions

  • Which session interfaces are needed before persistent storage exists?
  • Should UserSession.defaultCaps() return actual caps or a plan that must be executed by ProcessSpawner?
  • Which audit store is acceptable before durable storage and replay exist?
  • Which MAC policies, if any, justify kernel-visible hold-edge labels?
  • How should remote capnp-rpc identities map into local principals?
  • Should the first broker prototype embed Cedar directly, or use a simpler hand-written evaluator until the policy surface stabilizes?

Proposal: Cloud Instance Bootstrap

Picking up instance-specific configuration — SSH keys, hostname, network config, user-supplied payload — from cloud provider metadata sources, without porting the Canonical cloud-init stack.

Problem

A capOS ISO built once has to boot on any cloud VM and adapt to its environment: different instance IDs, different public IPs, different operator-supplied SSH keys, different user-data payloads. Without this, every instance needs a custom-baked ISO — and the content-addressed-boot story (“same hash boots identically on N machines”) devalues itself at the point where it would actually matter for operations.

The Linux convention is cloud-init: a Python daemon that reads metadata from provider-specific sources and applies it by writing files under /etc, invoking systemctl, creating users, and running shell scripts. Porting it is a non-starter:

  • Python, POSIX, systemd-dependent.
  • Runs as root with ambient authority: parses untrusted user-data as shell scripts, mutates arbitrary system state.
  • ~100k lines covering hundreds of rarely-used modules (chef, puppet, seed_random, phone_home).
  • Assumes a package manager and init system that do not exist on capOS.

capOS needs the pattern — consume provider metadata, use it to bootstrap the instance — reshaped to the capability model.

Metadata Sources

All major clouds expose instance metadata through one or more of:

  • HTTP IMDS. 169.254.169.254. AWS IMDSv2 requires a PUT token-exchange handshake; GCP and Azure accept direct GET. Paths differ per provider. Needs a running network stack.
  • ConfigDrive. An ISO9660 filesystem attached as a block device, containing meta_data.json (or equivalent) and optional user-data file. OpenStack, older Azure. Needs a block driver and filesystem reader, no network.
  • SMBIOS / DMI. Vendor, product, serial-number, UUID fields populated by the hypervisor. Good for provider detection before networking comes up.
  • NoCloud. Seed files baked into the image or on an attached FAT disk. Useful for development and bare-metal.

The bootstrap service should read from whichever source is present rather than hardcoding one. Provider detection via SMBIOS runs first (no dependencies), then the appropriate transport is initialized.

CloudMetadata Capability

A single capnp interface; one or more implementations:

interface CloudMetadata {
    # Instance identity
    instanceId    @0 () -> (id :Text);
    instanceType  @1 () -> (type :Text);
    hostname      @2 () -> (name :Text);
    region        @3 () -> (region :Text);

    # Network configuration (primary interface addresses, gateway, DNS)
    networkConfig @4 () -> (config :NetworkConfig);

    # Authentication material
    sshKeys       @5 () -> (keys :List(Text));

    # User-supplied payload. Opaque to the metadata provider.
    userData      @6 () -> (data :Data, contentType :Text);

    # Vendor-supplied payload. Separate from userData so the
    # bootstrap policy can trust them differently.
    vendorData    @7 () -> (data :Data, contentType :Text);
}

struct NetworkConfig {
    interfaces @0 :List(Interface);

    struct Interface {
        macAddress @0 :Text;
        ipv4       @1 :List(IpAddress);
        ipv6       @2 :List(IpAddress);
        gateway    @3 :Text;
        dnsServers @4 :List(Text);
        mtu        @5 :UInt16;
    }
}

Implementations:

  • HttpMetadata — fetches from 169.254.169.254; one variant per provider because paths and auth handshakes differ (AWS IMDSv2 token, GCP Metadata-Flavor: Google, Azure API version).
  • ConfigDriveMetadata — reads an ISO9660 seed disk.
  • NoCloudMetadata — reads a seed blob from the initial manifest.

Detection lives in a small probe service that inspects SMBIOS (System Manufacturer: Google, Amazon EC2, Microsoft Corporation, …) and grants the cloud-bootstrap service the appropriate CloudMetadata implementation as part of a manifest delta.

Bootstrap Service

A single service — cloud-bootstrap — runs once per boot:

cloud-bootstrap:
  caps:
    - metadata: CloudMetadata        # from probe service
    - manifest: ManifestUpdater      # narrow authority to extend the graph
    - network:  NetworkConfigurator  # apply interface addresses
    - ssh_keys: KeyStore             # target store for authorized keys
  user_data_handlers:
    - application/x-capos-manifest: ManifestDeltaHandler
    # operator-installed handlers for other content types

Sequence:

  1. Gather identity and declarative config (instanceId, hostname, networkConfig, sshKeys), apply through the narrow caps above.
  2. (data, ct) = metadata.userData() — dispatch by content type. If no handler is registered, log and skip.
  3. Exit.

The service never holds ProcessSpawner directly. It holds ManifestUpdater, a wrapper that accepts capnp-encoded ManifestDelta messages and applies them through the existing init spawn path. The decoder and apply path are shared with the build-time pipeline (same capos-config crate, same spawn loop). The precise shape of ManifestDelta is an open question — see “Open Questions” below — but at minimum it covers hostname, network config, SSH keys, and authorized application-level service additions:

struct ManifestDelta {
    addServices      @0 :List(ServiceEntry);
    addBinaries      @1 :List(NamedBlob);
    setHostname      @2 :Text;
    setNetworkConfig @3 :NetworkConfig;
}

Relationship to the Build-Time Manifest Pipeline

The existing build-time pipeline (system.cuetools/mkmanifestmanifest.bin → Limine boot module → capos-config decoder → init spawn loop) and the cloud-metadata bootstrap path are not two parallel systems. They are the same pipeline with different transports and different trust scopes.

StageBuild-time (baked ISO)Runtime (cloud metadata)
Authoringsystem.cue in the repouser-data.cue on the operator’s host
Compilemkmanifest (CUE → capnp)same tool, same output
TransportLimine boot moduleHTTP IMDS / ConfigDrive / NoCloud disk
Wire formatcapnp-encoded SystemManifestcapnp-encoded ManifestDelta
Decodercapos-configcapos-config
Applyinit spawn loopsame spawn loop, invoked via ManifestUpdater

Three practical consequences:

  • CUE is a host-side authoring convenience, not an on-wire format. Neither kernel nor init evaluates CUE. An operator supplying user-data writes user-data.cue, runs `mkmanifest user-data.cue

    user-data.binon their host, and ships the capnp bytes (base64 into–metadata user-data=@user-data.bin` for GCP/AWS, or as a file on a ConfigDrive ISO).

  • NoCloud is a Limine boot module by another name. A NoCloud seed blob is the same bytes as a baked-in manifest.bin, attached via a disk or bundled into the ISO instead of handed over by the bootloader. The only difference is who hands the bytes to the parser.
  • No new schema surface. ManifestDelta is defined alongside SystemManifest in schema/capos.capnp, and sharing the decoder means ManifestUpdater’s apply path is a thin merge-and-spawn on top of code that already boots the base system.

The trust model stays clean precisely because ManifestDelta is not SystemManifest. The base manifest is inside the content-addressed ISO hash (fully trusted, reproducible). The runtime delta is applied by a narrowly-permitted service whose caps define what fields of the delta can actually take effect — the content-addressed-boot story is preserved because cloud metadata augments the base graph, it cannot replace it.

User-Data Model

User-data on the wire is a capnp blob, not a shell script. Content type application/x-capos-manifest identifies the canonical case: the payload is a ManifestDelta message produced by mkmanifest on the operator’s host and consumed directly by the bootstrap service.

For cross-cloud-vendor compatibility, operators can install user-data dispatcher services for other content types (YAML, other capnp schemas, signed manifests, etc.). The bootstrap service holds a handler cap per content type; unknown types are logged and ignored, not executed.

Shell-script user-data — the Linux default — has nowhere to run on capOS because there is no shell and no ambient-authority process to execute it under. An operator who insists on this can install a shell service and a handler that routes text/x-shellscript to it, but that is a deliberate choice, not a default fallback.

Trust Model

The capability angle earns its keep here.

  • The metadata endpoint is assumed as trustworthy as the hypervisor running the VM — the same assumption Linux cloud-init makes.
  • The bootstrap service holds narrow caps (ManifestUpdater, NetworkConfigurator, KeyStore), not ambient root. A bug or a malicious metadata response can at most spawn services the ManifestUpdater accepts, set network config the NetworkConfigurator accepts, and drop keys into the KeyStore. It cannot reach for arbitrary system state.
  • vendorData and userData are separated on the wire. A policy that trusts the cloud provider but not the operator (e.g., apply vendorData as-is, route userData through a signature check) is expressible by granting different handler caps to each.
  • User-data content-type dispatch is capability-mediated: the bootstrap service cannot execute a content type it wasn’t given a handler for. There is no fallback “try to run it as shell.”

Phased Implementation

Most of the manifest-handling machinery already exists from the build-time pipeline (capos-config, mkmanifest, init’s spawn loop). The new work is transports, provider detection, and the ManifestDelta merge semantics.

  1. ManifestDelta schema and ManifestUpdater cap. Add the delta type to schema/capos.capnp alongside SystemManifest, extend capos-config with a merge routine (SystemManifest + ManifestDelta → new services to spawn), and expose ManifestUpdater as a cap in init. NoCloudMetadata seeded from a test fixture is enough to demo the apply path end-to-end without any cloud dependency.
  2. Provider detection via SMBIOS. Kernel-side primitive or capability that reads SMBIOS DMI tables and exposes manufacturer / product strings. No network required.
  3. ConfigDrive support. ISO9660 reader plus ConfigDriveMetadata. Gives a working real-transport metadata source with no dependency on userspace networking. QEMU can attach one via -drive file=configdrive.iso,if=virtio for local testing.
  4. HttpMetadata per provider. Requires the userspace network stack (Stage 6+). GCP first (simplest auth), then AWS (IMDSv2 token flow), then Azure.
  5. Cross-provider Cloud Metadata demo. Same ISO hash boots under QEMU, GCP, AWS, and Azure; the only difference is the SMBIOS manufacturer string, which the probe service uses to pick the right HttpMetadata variant. This is the Cloud Metadata observable milestone.

Open Questions

Which fields of system.cue are runtime-modifiable?

system.cue today is a handful of service entries with kernel Console cap grants encoded as structured source variants. That will grow. Plausible additions as capOS matures: driver process definitions (virtio-net, virtio-blk, NVMe) with device MMIO, interrupt, and frame allocator grants; scheduler tuning (priority, budget, CPU pinning); filesystem driver services; memory-policy hooks; ACPI/SMBIOS consumers.

Most of those are either fragile (kernel-adjacent; a bad value bricks the instance), sensitive (granting kernel:frame_allocator to a user-data-declared service is effectively root), or both. A ManifestDelta with full SystemManifest equivalence hands every such knob to whoever controls user-data.

The narrowing has to happen somewhere, but there are several places it could live:

  1. Different schema. ManifestDelta is not structurally a subset of SystemManifest — it omits driver entries, scheduler config, and kernel cap sources entirely. Schema-level guarantee; rigid but unambiguous.
  2. Shared schema, policy-narrowing cap. ManifestUpdater accepts a full delta but validates at apply time: kernel source variants are rejected unless explicitly allow-listed by the cap’s parameters; additions that touch driver-level service entries fail. Flexible, but the narrowing logic is code that has to be audited, not a schema that is self-documenting.
  3. Tiered deltas. PrivilegedDelta (drivers, scheduler) and ApplicationDelta (hostname, SSH keys, app services), minted by different caps. An operator supervisor holds PrivilegedManifestUpdater; cloud-bootstrap holds only ApplicationManifestUpdater. Compositional; matches the capability-model grain but doubles the schema surface.
  4. Tag-based field permissions. Fields in ServiceEntry carry a privilege tag; ManifestUpdater is parameterized with a permitted-tag set. One schema, orthogonal policy.

Picking one prematurely would either over-constrain the cloud path (option 1 before we know what apps legitimately need) or under-constrain it (option 2 without clarity on what to check against). This proposal commits only to the shared pipeline (decoder, spawn loop, authoring tool). The shape of the public type(s) the cap accepts is deferred until system.cue has grown enough that the privileged vs. application split is visible in concrete form.

Related open question: whether kernel cap sources should be expressible in system.cue at all, or whether the build-time manifest should also declare them through a narrower mechanism so that the same discipline that protects cloud user-data also protects the baked-in manifest from accidental over-grants. If they remain expressible, they should be structured enum/union variants, not free-form strings; the associated interface TYPE_ID is only a schema compatibility check and does not identify the authority being granted.

Non-Goals

  • cloud-init compatibility. No parsing of #cloud-config YAML, no #!/bin/bash execution, no include-url, no MIME multipart handling. Operators who need these install their own dispatcher services; the base system does not.
  • Runtime package installation. The capOS equivalent of “install nginx on boot” is “include nginx in the manifest.” User-data can add services to the manifest; it cannot install packages (there is no package manager to install into).
  • Re-running on every boot. cloud-init distinguishes per-boot, per-instance, and per-once modules. The capOS bootstrap service runs once per boot; the manifest it produces is cached under the instance ID, and subsequent boots read the cache and skip the metadata round-trip. A full mode matrix is future work.
  • IPv6-only bring-up in the first iteration. Many clouds expose both; the schema supports both; the first implementations do whichever is easier per provider (typically IPv4).
  • Automatic secret rotation. Metadata often exposes short-lived credentials (IAM role tokens on AWS, service-account tokens on GCP). Refresh logic belongs to the service that consumes the credential, not to cloud-bootstrap.
  • cloud-init (Canonical). The Linux reference. Huge scope, shell-script-centric, assumes root and POSIX. The capOS design intentionally takes the pattern and drops everything that depends on ambient authority.
  • ignition (CoreOS/Flatcar). Runs once in initramfs, consumes a JSON spec, fails-fast if the spec can’t be applied. Closer in spirit to the capOS design — small, single-pass, declarative. Worth studying for its rollback and error-handling approach.
  • AWS IMDSv2. The token-exchange handshake is the one thing the HTTP client needs to handle that is not plain GETs. Designing the HttpMetadata interface without accounting for it up front leads to a rewrite later.

Proposal: Hardware Abstraction and Cloud Deployment

How capOS goes from “boots in QEMU” to “boots on a real cloud VM” (GCP, AWS, Azure). This covers the hardware abstraction infrastructure missing between the current QEMU-only kernel and real x86_64 hardware, plus the build system changes needed to produce deployable images.

Depends on: Kernel Networking Smoke Test (for PCI enumeration), Stage 5 (for LAPIC timer), Stage 7 / SMP proposal Phase A (for LAPIC init).

Complements: Networking proposal (extends virtio-net to cloud NICs), Storage proposal (extends virtio-blk to NVMe), SMP proposal (LAPIC infrastructure shared).


Current State

The kernel boots via Limine UEFI, outputs to COM1 serial, and uses QEMU-specific features (isa-debug-exit). No PCI, no ACPI, no interrupt controller beyond the legacy PIC (implicitly via Limine setup). The only build artifact is an ISO.

What Cloud VMs Provide

GCP (n2-standard), AWS (m6i/c7i), and Azure (Dv5) all expose:

ResourceCloud interfacecapOS status
Boot firmwareUEFI (all three)Limine UEFI works
Serial consoleCOM1 0x3F8Works (serial.rs)
Boot mediaGPT disk image (raw/VMDK/VHD)Missing (ISO only)
StorageNVMe (EBS, PD, Managed Disk)Missing
NICENA (AWS), gVNIC (GCP), MANA (Azure)Missing
Virtio NICGCP (fallback), some bare-metalMissing (planned)
TimerLAPIC, TSC, HPETMissing
Interrupt deliveryI/O APIC, MSI/MSI-XMissing
Device discoveryACPI + PCI/PCIeMissing
DisplayNone (headless)N/A

What Already Works

  • UEFI boot – Limine ISO includes BOOTX64.EFI. The boot path itself is cloud-compatible.
  • Serial output – all three clouds expose COM1. gcloud compute instances get-serial-port-output, aws ec2 get-console-output, and Azure serial console all read from it.
  • x86_64 long mode – cloud VMs are KVM-based x86_64. Architecture matches.

Phase 1: Bootable Disk Image

Goal: Produce a GPT disk image that cloud VMs can boot from, alongside the existing ISO for QEMU.

The Problem

Cloud VMs boot from disk images, not ISOs. Each cloud has a preferred format:

CloudImage formatImport method
GCPraw (tar.gz)gcloud compute images create --source-uri=gs://...
AWSraw, VMDK, VHDaws ec2 import-image or register-image with EBS snapshot
AzureVHD (fixed size)az image create --source

All require a GPT-partitioned disk with an EFI System Partition (ESP) containing the bootloader.

Disk Layout

GPT disk image (64 MB minimum)
  Partition 1: EFI System Partition (FAT32, ~32 MB)
    /EFI/BOOT/BOOTX64.EFI     (Limine UEFI loader)
    /limine.conf               (bootloader config)
    /boot/kernel               (capOS kernel ELF)
    /boot/init                 (init process ELF)
  Partition 2: (reserved for future use -- persistent store backing)

Build Tooling

New Makefile target make image using standard tools:

IMAGE := capos.img
IMAGE_SIZE := 64  # MB

image: kernel init $(LIMINE_DIR)
	# Create raw disk image
	dd if=/dev/zero of=$(IMAGE) bs=1M count=$(IMAGE_SIZE)
	# Partition with GPT + ESP
	sgdisk -n 1:2048:+32M -t 1:ef00 $(IMAGE)
	# Format ESP as FAT32, copy files
	# (mtools or loop mount + mkfs.fat)
	mformat -i $(IMAGE)@@1M -F -T 65536 ::
	mcopy -i $(IMAGE)@@1M $(LIMINE_DIR)/BOOTX64.EFI ::/EFI/BOOT/
	mcopy -i $(IMAGE)@@1M limine.conf ::/
	mcopy -i $(IMAGE)@@1M $(KERNEL) ::/boot/kernel
	mcopy -i $(IMAGE)@@1M $(INIT) ::/boot/init
	# Install Limine
	# bios-install is for hybrid BIOS/UEFI boot in local QEMU testing.
	# For cloud-only images (UEFI-only), this line can be omitted.
	$(LIMINE_DIR)/limine bios-install $(IMAGE)

New QEMU target to test disk boot locally:

run-disk: $(IMAGE)
	qemu-system-x86_64 -drive file=$(IMAGE),format=raw \
		-bios /usr/share/edk2/x64/OVMF.4m.fd \
		-display none $(QEMU_COMMON); \
	test $$? -eq 1

Cloud upload helpers (scripts, not Makefile targets):

# GCP
tar czf capos.tar.gz capos.img
gsutil cp capos.tar.gz gs://my-bucket/
gcloud compute images create capos --source-uri=gs://my-bucket/capos.tar.gz

# AWS
aws ec2 import-image --disk-containers \
  "Format=raw,UserBucket={S3Bucket=my-bucket,S3Key=capos.img}"

Dependencies

  • sgdisk (gdisk package) – GPT partitioning
  • mtools (mformat, mcopy) – FAT32 manipulation without root/loop mount

Scope

~30 lines of Makefile + a helper script for cloud uploads. No kernel changes.


Phase 2: ACPI and Device Discovery

Goal: Parse ACPI tables to discover hardware topology, interrupt routing, and PCI root complexes. This replaces QEMU-specific hardcoded assumptions.

Why ACPI

On QEMU with default settings, you can hardcode PCI config space at 0xCF8/0xCFC and assume legacy interrupt routing. On real cloud hardware:

  • PCI root complex addresses come from ACPI MCFG table (PCIe ECAM)
  • Interrupt routing comes from ACPI MADT (I/O APIC entries) and _PRT
  • CPU topology comes from ACPI MADT (LAPIC entries)
  • Timer info comes from ACPI HPET/PMTIMER tables

Limine provides the RSDP (Root System Description Pointer) address via its protocol. From there, the kernel can walk RSDT/XSDT to find specific tables.

Required Tables

TablePurposePriority
MADTLAPIC and I/O APIC addresses, CPU enumerationHigh (Phase 2)
MCFGPCIe Enhanced Configuration Access Mechanism baseHigh (Phase 2)
HPETHigh Precision Event Timer addressMedium (fallback timer)
FADTPM timer, shutdown/reset methodsLow (future)

Implementation

#![allow(unused)]
fn main() {
// kernel/src/acpi.rs

/// Minimal ACPI table parser.
/// Walks RSDP -> XSDT -> individual tables.
/// Does NOT implement AML interpretation -- static tables only.

pub struct AcpiInfo {
    pub lapics: Vec<LapicEntry>,
    pub io_apics: Vec<IoApicEntry>,
    pub iso_overrides: Vec<InterruptSourceOverride>,
    pub mcfg_base: Option<u64>,  // PCIe ECAM base address
    pub hpet_base: Option<u64>,
}

pub fn parse_acpi(rsdp_addr: u64, hhdm: u64) -> AcpiInfo { ... }
}

Use the acpi crate (no_std, well-maintained) for parsing rather than hand-rolling. It handles RSDP, RSDT/XSDT, MADT, MCFG, and HPET.

Limine RSDP

#![allow(unused)]
fn main() {
use limine::request::RsdpRequest;

static RSDP: RsdpRequest = RsdpRequest::new();

// In kmain:
let rsdp_addr = RSDP.response().expect("no RSDP").address() as u64;
let acpi_info = acpi::parse_acpi(rsdp_addr, hhdm_offset);
}

Crate Dependencies

CratePurposeno_std
acpiACPI table parsing (MADT, MCFG, etc.)yes

Scope

~200-300 lines of glue code wrapping the acpi crate. The crate does the heavy lifting.


Phase 3: Interrupt Infrastructure

Goal: Set up I/O APIC for device interrupt routing and MSI/MSI-X for modern PCI devices. This replaces the implicit legacy PIC setup.

I/O APIC

The I/O APIC routes external device interrupts (keyboard, serial, PCI devices) to specific LAPIC entries (CPUs). Its address and configuration come from the ACPI MADT (Phase 2).

#![allow(unused)]
fn main() {
// kernel/src/ioapic.rs

pub struct IoApic {
    base: *mut u32,  // MMIO registers via HHDM
}

impl IoApic {
    /// Route an IRQ to a specific LAPIC/vector.
    pub fn route_irq(&mut self, irq: u8, lapic_id: u8, vector: u8) { ... }

    /// Mask/unmask an IRQ line.
    pub fn set_mask(&mut self, irq: u8, masked: bool) { ... }
}
}

The I/O APIC must respect Interrupt Source Override entries from MADT (e.g., IRQ 0 might be remapped to GSI 2 on real hardware).

MSI/MSI-X

Modern PCI/PCIe devices (NVMe, cloud NICs) use Message Signaled Interrupts instead of pin-based IRQs routed through the I/O APIC. MSI/MSI-X writes directly to the LAPIC’s interrupt command register, bypassing the I/O APIC entirely.

This is critical for cloud deployment because:

  • NVMe controllers require MSI or MSI-X (no legacy IRQ fallback on many controllers)
  • Cloud NICs (ENA, gVNIC) use MSI-X exclusively
  • MSI-X supports per-queue interrupts (one vector per virtqueue/submission queue), enabling better SMP scalability
#![allow(unused)]
fn main() {
// kernel/src/pci/msi.rs

/// Configure MSI for a PCI device.
pub fn enable_msi(device: &PciDevice, vector: u8, lapic_id: u8) { ... }

/// Configure MSI-X for a PCI device.
pub fn enable_msix(
    device: &PciDevice,
    table_bar: u8,
    entries: &[(u16, u8, u8)],  // (index, vector, lapic_id)
) { ... }
}

MSI/MSI-X capability structures are found by walking the PCI capability list (already needed for PCI enumeration in the networking proposal).

Integration with SMP

LAPIC initialization is shared between this phase and the SMP proposal (Phase A). If SMP is implemented first, LAPIC is already available. If this phase comes first, it initializes the BSP’s LAPIC and the SMP proposal extends to APs.

Scope

~300-400 lines total:

  • I/O APIC driver: ~150 lines
  • MSI/MSI-X setup: ~100-150 lines
  • Integration/routing logic: ~50-100 lines

Phase 4: PCI/PCIe Infrastructure

Goal: Standalone PCI bus enumeration and device management, usable by all device drivers (virtio-net, NVMe, cloud NICs).

The networking proposal includes PCI enumeration as a substep for finding virtio-net. This phase promotes it to a reusable kernel subsystem that all device drivers build on.

PCI Configuration Access

Two mechanisms, determined by ACPI:

  1. Legacy I/O ports (0xCF8/0xCFC) – works in QEMU, limited to 256 bytes of config space per function. Insufficient for PCIe extended capabilities.
  2. PCIe ECAM (Enhanced Configuration Access Mechanism) – memory-mapped config space, 4 KB per function. Base address from ACPI MCFG table. Required for MSI-X capability parsing and NVMe BAR discovery on real hardware.

Start with legacy I/O for QEMU, add ECAM when ACPI parsing (Phase 2) is available.

Device Enumeration

#![allow(unused)]
fn main() {
// kernel/src/pci/mod.rs

pub struct PciDevice {
    pub bus: u8,
    pub device: u8,
    pub function: u8,
    pub vendor_id: u16,
    pub device_id: u16,
    pub class: u8,
    pub subclass: u8,
    pub bars: [Option<Bar>; 6],
    pub interrupt_pin: u8,
    pub interrupt_line: u8,
}

pub enum Bar {
    Memory { base: u64, size: u64, prefetchable: bool },
    Io { base: u16, size: u16 },
}

/// Scan all PCI buses and return discovered devices.
pub fn enumerate() -> Vec<PciDevice> { ... }

/// Find a device by vendor/device ID.
pub fn find_device(vendor: u16, device: u16) -> Option<PciDevice> { ... }

/// Walk the PCI capability list for a device.
pub fn capabilities(device: &PciDevice) -> Vec<PciCapability> { ... }
}

BAR Mapping

Device drivers need MMIO access to BAR regions. The kernel maps BAR physical addresses into virtual address space (via HHDM for kernel-mode drivers, or via a DeviceMmio capability for userspace drivers as described in the networking proposal).

PCI Device IDs for Cloud Hardware

DeviceVendor:DeviceCloud
virtio-net1AF4:1000 (transitional) or 1AF4:1041 (modern)QEMU, GCP fallback
virtio-blk1AF4:1001 (transitional) or 1AF4:1042 (modern)QEMU
NVMe8086:various, 144D:various, etc.All clouds (EBS, PD, Managed Disk)
AWS ENA1D0F:EC20 / 1D0F:EC21AWS
GCP gVNIC1AE0:0042GCP
Azure MANA1414:00BAAzure

Scope

~400-500 lines:

  • Config space access (I/O + ECAM): ~100 lines
  • Bus enumeration: ~150 lines
  • BAR parsing and mapping: ~100 lines
  • Capability list walking: ~50-100 lines

Phase 5: NVMe Driver

Goal: Basic NVMe block device driver, sufficient to read/write sectors. This is the storage equivalent of virtio-net for networking – the first real storage driver.

Why NVMe Over virtio-blk

The storage-and-naming proposal mentions virtio-blk for Phase 3 (persistent store). On cloud VMs, all three providers expose NVMe:

  • AWS EBS – NVMe interface (even for gp3/io2 volumes)
  • GCP Persistent Disk – NVMe or SCSI (NVMe is default for newer VMs)
  • Azure Managed Disks – NVMe on newer VM series (Ev5, Dv5)

virtio-blk is QEMU-only. An NVMe driver unlocks persistent storage on all cloud platforms. For QEMU testing, QEMU also emulates NVMe well: -drive file=disk.img,if=none,id=d0 -device nvme,drive=d0,serial=capos0.

NVMe Architecture

NVMe is a register-level standard with well-defined queue-pair semantics:

Application
    |
    v
Submission Queue (SQ) -- ring buffer of 64-byte command entries
    |
    | doorbell write (MMIO)
    v
NVMe Controller (hardware)
    |
    | DMA completion
    v
Completion Queue (CQ) -- ring buffer of 16-byte completion entries
    |
    | MSI-X interrupt
    v
Driver processes completions

Minimum viable driver needs:

  1. Admin Queue Pair (for identify, create I/O queues)
  2. One I/O Queue Pair (for read/write commands)
  3. MSI-X for completion notification (or polling)

Implementation Sketch

#![allow(unused)]
fn main() {
// kernel/src/nvme.rs (or kernel/src/drivers/nvme.rs)

pub struct NvmeController {
    bar0: *mut u8,          // MMIO registers
    admin_sq: SubmissionQueue,
    admin_cq: CompletionQueue,
    io_sq: SubmissionQueue,
    io_cq: CompletionQueue,
    namespace_id: u32,
    block_size: u32,
    block_count: u64,
}

impl NvmeController {
    pub fn init(pci_device: &PciDevice) -> Result<Self, NvmeError> { ... }
    pub fn read(&self, lba: u64, count: u16, buf: &mut [u8]) -> Result<(), NvmeError> { ... }
    pub fn write(&self, lba: u64, count: u16, buf: &[u8]) -> Result<(), NvmeError> { ... }
    pub fn identify(&self) -> NvmeIdentify { ... }
}
}

DMA Considerations

NVMe uses DMA for data transfer. The controller reads/writes directly from physical memory addresses provided in commands. Requirements:

  • Buffers must be physically contiguous (or use PRP lists / SGLs for scatter-gather)
  • Physical addresses must be provided (not virtual)
  • Cache coherence is handled by hardware on x86_64 (DMA-coherent architecture)

The existing frame allocator can provide physically contiguous pages. For larger transfers, PRP (Physical Region Page) lists allow scatter-gather.

Crate Dependencies

CratePurposeno_std
(none)NVMe register-level protocol is simple enough to implement directlyN/A

The NVMe spec is cleaner than virtio and the register interface is straightforward. A minimal driver (admin + 1 I/O queue pair, read/write) is ~500-700 lines without external dependencies.

Integration with Storage Proposal

The storage proposal’s Phase 3 (Persistent Store) specifies virtio-blk as the backing device. This can be generalized to a BlockDevice trait:

#![allow(unused)]
fn main() {
trait BlockDevice {
    fn read(&self, lba: u64, count: u16, buf: &mut [u8]) -> Result<(), Error>;
    fn write(&self, lba: u64, count: u16, buf: &[u8]) -> Result<(), Error>;
    fn block_size(&self) -> u32;
    fn block_count(&self) -> u64;
}
}

Both NVMe and virtio-blk implement this trait. The store service doesn’t care which backing driver it uses.

Scope

~500-700 lines for a minimal in-kernel NVMe driver (admin queue + 1 I/O queue pair, read/write, identify). Userspace decomposition follows the same pattern as the networking proposal (kernel driver first, then extract to userspace process with DeviceMmio + Interrupt caps).


Phase 6: Cloud NIC Strategy

Goal: Define the path to networking on cloud VMs, given that each cloud uses a different proprietary NIC.

The Landscape

CloudPrimary NICVirtio NIC available?Open-source driver?
GCPgVNIC (1AE0:0042)Yes (fallback option)Yes (Linux, ~3000 LoC)
AWSENA (1D0F:EC20)No (Nitro only)Yes (Linux, ~8000 LoC)
AzureMANA (1414:00BA)No (accelerated networking)Yes (Linux, ~6000 LoC)

Short term: virtio-net on GCP

GCP allows selecting VIRTIO_NET as the NIC type when creating instances. This is a first-class option, not a legacy fallback. Combined with the virtio-net driver from the networking proposal, this gives cloud networking with zero additional driver work.

gcloud compute instances create capos-test \
    --image=capos \
    --machine-type=e2-micro \
    --network-interface=nic-type=VIRTIO_NET

Medium term: gVNIC driver

gVNIC is a simpler device than ENA or MANA. The Linux driver is ~3000 lines (vs ~8000 for ENA). It uses standard PCI BAR MMIO + MSI-X interrupts. A minimal gVNIC driver (init, link up, send/receive) would be ~800-1200 lines.

gVNIC is worth prioritizing because:

  • GCP is the only cloud with a virtio-net fallback, making it the natural first target
  • Graduating from virtio-net to gVNIC on the same cloud is a clean progression
  • The gVNIC register interface is documented in the Linux driver source

Long term: ENA and MANA

ENA and MANA are more complex and less well-documented outside their Linux drivers. These should be deferred until the driver model is mature (userspace drivers with DeviceMmio caps, as described in the networking proposal Part 2).

At that point, the kernel only needs to provide PCI enumeration + BAR mapping + MSI-X routing. The actual NIC driver logic runs in a userspace process, making it feasible to port from the Linux driver source with appropriate licensing considerations.

Alternative: Paravirt Abstraction Layer

Instead of writing native drivers for each cloud NIC, an alternative is a thin paravirt layer:

Application -> NetworkManager cap -> Net Stack (smoltcp) -> NIC cap -> [driver]

Where [driver] is one of:

  • virtio-net (QEMU, GCP fallback)
  • gvnic (GCP)
  • ena (AWS)
  • mana (Azure)

All drivers implement the same Nic capability interface from the networking proposal. The network stack and applications are driver-agnostic.

This is already the architecture described in the networking proposal. The only addition is recognizing that multiple driver implementations will exist behind the same Nic interface.


Phase Summary and Dependencies

graph TD
    P1[Phase 1: Disk Image Build] --> BOOT[Boots on Cloud VM]
    P2[Phase 2: ACPI Parsing] --> P3[Phase 3: Interrupt Infrastructure]
    P2 --> P4[Phase 4: PCI/PCIe]
    P3 --> P5[Phase 5: NVMe Driver]
    P4 --> P5
    P4 --> NET[Networking Smoke Test<br>virtio-net driver]
    P3 --> NET
    P4 --> P6[Phase 6: Cloud NIC Drivers]
    P3 --> P6
    NET --> P6

    S5[Stage 5: Scheduling] --> P3
    SMP_A[SMP Phase A: LAPIC] --> P3

    style P1 fill:#2d5,stroke:#333
    style BOOT fill:#2d5,stroke:#333
PhaseDepends onEstimated scopeEnables
1: Disk imageNothing~30 lines MakefileCloud boot
2: ACPINothing (kernel code)~200-300 linesPhases 3, 4
3: InterruptsPhase 2, LAPIC (SMP/Stage 5)~300-400 linesNVMe, cloud NICs
4: PCI/PCIePhase 2~400-500 linesAll device drivers
5: NVMePhases 3, 4~500-700 linesCloud storage
6: Cloud NICsPhases 3, 4, networking smoke test~800-1200 lines eachCloud networking

Minimum Path to “Boots on Cloud VM, Prints Hello”

Phases 1 only. Everything else (serial, UEFI) already works. This is a build system change, not a kernel change.

Minimum Path to “Useful on Cloud VM”

Phases 1-5 (disk image + ACPI + interrupts + PCI + NVMe) plus the existing roadmap items (Stages 4-6 for capability syscalls, scheduling, IPC). With GCP’s virtio-net fallback, networking can use the existing networking proposal without Phase 6.


QEMU Testing

All phases can be tested in QEMU before deploying to cloud:

PhaseQEMU flags
Disk image-drive file=capos.img,format=raw -bios OVMF.4m.fd
ACPIDefault QEMU provides ACPI tables (MADT, MCFG, etc.)
I/O APICDefault QEMU emulates I/O APIC
PCI/PCIe-device ... adds PCI devices; QEMU has PCIe root complex
NVMe-drive file=disk.img,if=none,id=d0 -device nvme,drive=d0,serial=capos0
MSI-XSupported by QEMU’s NVMe and virtio-net-pci emulation
Multi-CPU-smp 4 (already works with Limine SMP)

aarch64 and ARM Cloud Instances

This proposal focuses on x86_64 because that’s the current kernel target, but ARM-based cloud instances are significant and growing:

CloudARM offeringInstance types
AWSGraviton2/3/4m7g, c7g, r7g, etc.
GCPTau T2A (Ampere Altra)t2a-standard-*
AzureCobalt 100 (Arm Neoverse)Dpsv6, Dplsv6

ARM cloud VMs have the same general requirements (UEFI boot, ACPI tables, PCI/PCIe, NVMe storage) but different specifics:

  • Interrupt controller: GIC (Generic Interrupt Controller) instead of APIC. GICv3 is standard on cloud ARM instances.
  • Boot: UEFI via Limine (already targets aarch64). Limine handles the architecture differences at boot time.
  • Timer: ARM generic timer (CNTPCT_EL0) instead of LAPIC/PIT/TSC.
  • Serial: PL011 UART instead of 16550 COM1. Different register interface.
  • NIC: Same PCI devices (ENA, gVNIC, MANA) with the same register interfaces – PCI/PCIe is architecture-neutral.
  • NVMe: Same NVMe register interface – PCIe is architecture-neutral.

The arch-neutral parts of this proposal (PCI enumeration, NVMe, disk image format, ACPI table parsing) apply equally to aarch64. The arch-specific parts (I/O APIC, MSI delivery address format, LAPIC) need aarch64 equivalents (GIC, ARM MSI translation).

The existing roadmap lists “aarch64 support” as a future item. For cloud deployment, aarch64 should be considered as soon as the x86_64 hardware abstraction is stable, since:

  1. Device drivers (NVMe, virtio-net, cloud NICs) are architecture-neutral – they talk to PCI config space and MMIO BARs, which are the same on both architectures
  2. The acpi crate handles both x86_64 and aarch64 ACPI tables
  3. Limine already targets aarch64
  4. AWS Graviton instances are often cheaper than x86_64 equivalents

The main aarch64 kernel work is: exception handling (EL0/EL1 instead of Ring 0/3), GIC driver (instead of APIC), ARM generic timer, PL011 serial, and the MMU setup (4-level page tables exist on both but with different register interfaces).


Open Questions

  1. ACPI scope. The acpi crate can parse static tables (MADT, MCFG, HPET, FADT). Full ACPI requires AML interpretation (for _PRT interrupt routing, dynamic device enumeration). Do we need AML, or are static tables sufficient for cloud VMs? Cloud VM firmware typically provides simple, static ACPI tables – AML interpretation is likely unnecessary initially.

  2. PCIe ECAM vs legacy. Should we support both config access methods, or require ECAM (which all cloud VMs and modern QEMU provide)? Supporting both adds ~50 lines but makes bare-metal testing on older hardware possible.

  3. NVMe queue depth. A single I/O queue pair with depth 32 is sufficient for initial use. Per-CPU queues (leveraging MSI-X per-queue interrupts) improve SMP throughput but add complexity. Defer per-CPU queues to after SMP is working.

  4. Driver model unification. Resolved: PCI enumeration is the standalone PCI/PCIe Infrastructure item in the roadmap. The networking smoke test and NVMe driver both consume this shared subsystem. The networking proposal’s Part 1 Step 1 has been updated to reference this phase.

  5. GCP vs AWS as first cloud target. GCP has virtio-net fallback, making it the easiest first target. AWS has the largest market share and EBS/NVMe is well-documented. Recommendation: GCP first (virtio-net path), then AWS (requires ENA or a workaround).


References

Specifications

Crates

  • acpi – no_std ACPI table parser
  • virtio-drivers – no_std virtio (already in networking proposal)

Prior Art

Cloud Documentation

Proposal: Live Upgrade

Replacing a running service with a new binary, without dropping outstanding capability references or losing in-flight work.

Problem

In a Linux-like system, “upgrading a service” is one of:

  • Restart: stop the old process, start the new one. Clients holding file descriptors, sockets, or pipes to the old process receive ECONNRESET or EPIPE and must reconnect. Session state is lost unless clients serialize it themselves.
  • Graceful restart (nginx -s reload, unicorn, systemd socket activation): new process starts alongside old, inherits the listening socket, old drains in-flight requests. Works only for request/response protocols where the session is the request. Does nothing for stateful sessions.
  • Live patch (kpatch, ksplice): binary-level function replacement. Narrow, fragile, no schema for state layout changes.

None of these compose with a capability OS. A CapId held by a client points at a specific process; if that process exits, the cap is dead. There is no “the service” abstraction the kernel could re-bind — the point of capabilities is that they identify a specific reference, not a name that could be redirected after the fact.

But capOS has a kernel-side primitive the Linux model lacks: the kernel already owns the authoritative table of every CapId and which process serves it. Rewriting “cap X is served by process v1” → “cap X is served by process v2” is a table update. The question is when it is safe, and how v2 inherits enough state to answer the next call.

Three Cases

Live upgrade has three distinct cost profiles. The right design is to make each one explicit rather than pretend the hard case doesn’t exist.

Case 1: Stateless services

Each SQE is independent; the service holds no state that matters across calls. A request router, a pure codec, a logger that flushes to an external sink.

Upgrade is trivial: start v2, retarget every CapId from v1 to v2, exit v1. Clients may observe a small latency spike; no DISCONNECTED CQE fires. Only the kernel primitive is needed.

Case 2: State externalized into other caps

The service’s in-memory data is a cache or dispatch table; durable state lives behind caps the service holds (Store, SessionMap, Namespace). v1’s held caps are passed to v2 at spawn time (via the supervisor, per the manifest), kernel retargets client caps, v1 exits.

Architecturally this is the idiomatic capOS pattern: services stay thin, state is factored into dedicated holders with their own caps. The Fetch/HttpEndpoint split in the service-architecture proposal already pushes in this direction. In that world, most services fall into this bucket by construction.

Case 3: Stateful services requiring migration

The service has in-memory state that matters: a JIT’s code cache, a codec’s ring buffer, a parser’s arena, session data not yet flushed. Upgrade requires v1 to hand its state to v2.

capOS’s contribution here is that the state wire format is already capnp — the same format the service uses for IPC. v1 serializes its state as a capnp message; v2 consumes it. There is no separate serialization layer to build and no opportunity for it to drift from the IPC format.

The contract extends the service’s capnp interface:

interface Upgradable {
    # Called on v1 by the supervisor. Returns a snapshot of service
    # state and stops accepting new calls. Calls already in flight
    # complete before the snapshot returns.
    quiesce @0 () -> (state :Data);

    # Called on v2 after spawn. Loads state from the snapshot. After
    # this returns, v2 is ready to serve calls.
    resume @1 (state :Data) -> ();
}

The state schema is service-defined. Schema evolution follows capnp’s standard rules: adding fields is backward-compatible, renaming requires care, removing requires a major version bump.

Kernel Primitive: CapRetarget

The kernel exposes the retarget as a capability method, not a syscall:

interface ProcessControl {
    # Atomically redirect every CapId currently served by `old` to
    # be served by `new`. Requires: `new` implements a schema
    # superset of `old` (schema-id compatibility), `new` is Ready,
    # `old` is Quiesced (graceful) or the caller has permission to
    # force.
    retargetCaps @0 (old :ProcessHandle, new :ProcessHandle,
                     mode :RetargetMode) -> ();
}

enum RetargetMode {
    graceful @0;  # old must be Quiesced; in-flight calls drain on old
    force    @1;  # caps redirect immediately; in-flight calls fail
}

Only a process holding a ProcessControl cap to both processes can perform this — typically the supervisor that spawned them. The kernel never initiates upgrades.

Atomicity is per-CapId. From a client’s perspective, the retarget is a single point in time: a CALL SQE submitted before retarget goes to v1; a CALL SQE submitted after goes to v2. A CALL already dispatched to v1 either completes there (graceful) or returns a DISCONNECTED CQE (force).

Supervisor-Level Upgrade Protocol

The primitives above compose into a protocol the supervisor runs:

1. spawn v2 from the new binary in the manifest
2. Case 1 & 2: v2.resume(EMPTY_STATE)
   Case 3:     state = v1.quiesce()
               v2.resume(state)
3. kernel.retargetCaps(v1, v2, graceful)
4. wait for v1 to drain (graceful mode)
5. v1.exit()

If any step fails, the supervisor rolls back: kill v2, resume v1 (if quiesced), log the failure. Because the retarget hasn’t happened yet, clients never observe the aborted attempt.

In-Flight Calls

The subtle case is a client that has already posted a CALL SQE to v1 when the retarget happens. Two options:

  • Graceful mode. v1 finishes the call, kernel routes the CQE back to the client on v1’s ring. v1 exits only after its ring is empty. This preserves call semantics; v1 and v2 coexist briefly.
  • Force mode. The in-flight CALL returns DISCONNECTED. Client retries against v2. Appropriate when v1 is wedged and quiesce won’t return.

In graceful mode the client cannot distinguish “call landed on v1” from “call landed on v2” — which is the point. Capability identity survives the upgrade; process identity does not.

Relationship to Fault Containment

Live upgrade and fault containment (driver panics → supervisor respawns) share machinery. The difference is one step of the protocol:

  • Fault containment: v1 has crashed; kernel has already marked it dead and epoch-bumped its caps. Supervisor spawns v2, issues a graceful retarget (no quiesce — v1 is gone; in-flight CALLs already delivered DISCONNECTED). Clients reconnect to v2.
  • Live upgrade: v1 is healthy; supervisor initiates quiesce → state transfer → retarget, and no CQE ever reports DISCONNECTED to any caller.

The epoch-based revocation work from Stage 6 is the foundation for both. CapRetarget is one additional primitive layered on top.

Security and Trust

Live upgrade does not expand the trust model. The supervisor already holds the authority to kill, restart, and reassign caps for services it spawned — upgrade is a refinement of that authority, not a new principal. Requirements:

  • Only a holder of ProcessControl caps to both old and new can call retargetCaps. By construction this is the supervisor that spawned them.
  • The new binary must be legitimately obtained — in practice, loaded from the same content-addressed store as everything else (ties to Content-Addressed Boot).
  • Schema compatibility (new is a superset of old) is checked by the kernel before retarget. This prevents an upgrade from silently narrowing the interface clients depend on.

Non-Goals

  • Code hot-patching. No binary-level function replacement. Upgrade is at the process boundary, not the symbol boundary.
  • Kernel live replacement. Covered by Reboot-Proof / process persistence (reboot with state preserved, not live replacement). The kernel is a single trust domain; replacing it in place needs a different design.
  • Automatic schema migration across incompatible changes. If v2’s state schema is not a capnp-evolution-compatible superset of v1’s, the service author writes the migration. The kernel does not.
  • System-wide registry of upgradable services. The supervisor knows what it spawned; there is no ambient discovery.

Phased Implementation

  1. CapRetarget primitive. Kernel operation + ProcessControl cap. Useful immediately for stateless services (Case 1) and as the foundation of Fault Containment (respawn with a new process, point its caps to a fresh instance).
  2. Upgradable interface. Schema, contract documentation, and a Rust helper in capos-rt that services derive.
  3. Graceful drain. Quiesce + in-flight call completion + v1 exit synchronization.
  4. Stateful demo. A service maintaining session state, upgraded live with zero session loss. This is the Live Upgrade observable milestone.
  • Erlang/OTP code_change/3 is the closest prior art: processes upgrade their behavior module in place, with a callback to migrate state. capOS differs only in that state transport goes through capnp rather than Erlang term format, and that the process boundary is an OS process rather than a BEAM process.
  • Fuchsia component updates rebind component instances in the routing graph. Similar primitive in a different mechanism.
  • nginx -s reload is graceful restart for request/response servers. The design here generalizes it by exposing the state migration point explicitly rather than relying on “the session is the request.”

Proposal: Capability-Oriented GPU/CUDA Integration

Purpose

Define a minimal, capability-safe path to integrate NVIDIA/CUDA-capable GPUs into the capOS architecture without expanding kernel trust.

The kernel keeps direct control of hardware arbitration and trust boundaries. GPU hardware interaction is performed by a dedicated userspace service that is invoked through capability calls.

Positioning Against Current Project State

capOS currently provides:

  • Process lifecycle, page tables, preemptive scheduling (PIT 100 Hz, round-robin, context switching).
  • A global and per-process capability table with CapObject dispatch.
  • Shared-memory capability ring (io_uring-inspired) with syscall-free SQE writes. cap_enter syscall for ordinary CALL dispatch and completion waits.
  • No ACPI/PCI/interrupt infrastructure yet in-kernel.

That means GPU integration must be staged and should begin as a capability model exercise first, with real hardware I/O added after the underlying kernel subsystems exist.

Design Principles

  • Keep policy in kernel, execution in userspace.
  • Never expose raw PCI/MMIO/IRQ details to untrusted processes.
  • Make GPU access explicit through narrow capabilities.
  • Treat every stateful resource (session, buffer, queue, fence) as a capability.
  • Require revocability and bounded lifetime for every GPU-facing object.
  • Avoid a Linux-driver-in-kernel compatibility dependency.

Proposed Architecture

capOS kernel (minimal) exposes only resource and mediation capabilities. gpu-device service (userspace) receives device-specific caps and exposes a stable GPU capability surface to clients. application receives only GpuSession/GpuBuffer/GpuFence capabilities.

Kernel responsibilities

  • Discover GPUs from PCI/ACPI layers.
  • Map/register BAR windows and grant a scoped DeviceMmio capability.
  • Set up interrupt routing and expose scoped IRQ signaling capability.
  • Enforce DMA trust boundaries for process memory offered to the driver.
  • Enforce revocation when sessions are closed.
  • Handle all faulting paths that would otherwise crash the kernel.

User-space GPU service responsibilities

  • Open/initialize one GPU device from device-scoped caps.
  • Allocate and track GPU contexts and queues.
  • Implement command submission, buffer lifecycle, and synchronization.
  • Translate capability calls into driver-specific operations.
  • Expose only narrow, capability-typed handles to callers.

Capability Contract (schema additions)

Add to schema/capos.capnp:

  • GpuDeviceManager
    • listDevices() -> (devices: List(GpuDeviceInfo))
    • openDevice(capabilityIndex :UInt32) -> (session :GpuSession)
  • GpuSession
    • createBuffer(bytes :UInt64, usage :Text) -> (buffer :GpuBuffer)
    • destroyBuffer(buffer :UInt32) -> ()
    • launchKernel(program :Text, grid :UInt32, block :UInt32, bufferList :List(UInt32), fence :GpuFence) -> ()
    • submitMemcpy(dst :UInt32, src :UInt32, bytes :UInt64) -> ()
    • submitFenceWait(fence :UInt32) -> ()
  • GpuBuffer
    • mapReadWrite() -> (addr :UInt64, len :UInt64)
    • unmap() -> ()
    • size() -> (bytes :UInt64)
    • close() -> ()
  • GpuFence
    • poll() -> (status :Text)
    • wait(timeoutNanos :UInt64) -> (ok :Bool)
    • close() -> ()

Exact wire fields are intentionally flexible to keep this proposal at the interface level; method IDs and concrete argument packing should be finalized in the implementation PR.

Implementation Phases

Phase 0 (prerequisite): Stage 4 kernel capability syscalls

  • Implement capability-call syscall ABI.
  • Add cap_id, method_id, params_ptr, params_len path.
  • Add kernel/user copy/validation of capnp messages.
  • Validate user process permissions before dispatch.

Phase 1: Device mediation foundations

  • Add kernel caps:
    • DeviceManager/DeviceMmio/InterruptHandle/DmaBuffer.
  • Add PCI/ACPI discovery enough to identify NVIDIA-compatible functions.
  • Add guarded BAR mapping and scoped grant to an init-privileged service.
  • Add minimal GpuDeviceManager service scaffold returning synthetic/empty device handles.
  • Add manifest entries for a GPU service binary and launch dependencies.

Phase 2: Service-based mock backend

  • Implement gpu-mock userspace service with same Gpu* interface.
  • Support no-op buffers and synthetic fences.
  • Prove end-to-end:
    • init spawns driver
    • process opens session
    • buffer create/map/wait flows via capability calls
  • Add regression checks in integration boot path output.

Phase 3: Real backend integration

  • Add actual backend adapter for one concrete GPU driver API available in environment.
  • Add:
    • queue lifecycle
    • fence lifecycle
    • DMA registration/validation
    • command execution path
    • interrupt completion path to service and return through caps
  • Keep backend replacement possible via trait-like abstraction in userspace service.

Phase 4: Security hardening

  • Add per-session limits for mapped pages and in-flight submissions.
  • Add bounded queue depth and timeout enforcement.
  • Add explicit revocation propagation:
    • session close => all child caps revoked.
    • driver crash => all active caps fail closed.
  • Add explicit audit hooks for submit/launch calls.

Security Model

The kernel does not grant a user process direct MMIO access.

Processes only receive:

  • GpuSession / GpuBuffer / GpuFence capabilities.

The service process receives:

  • DeviceMmio, InterruptHandle, and memory-caps derived from its policy.

This ensures:

  • No userland process can program BAR registers.
  • No userland process can claim untrusted memory for DMA.
  • No userland process can observe or reset another session state.

Dependencies and Alignment

This proposal depends on:

  • Stage 4 capability syscalls.
  • Kernel networking/PCI/interrupt groundwork from cloud deployment roadmap.
  • Stage 6/7 for richer cross-process IPC and SMP behavior.

It complements:

  • Device and service architecture proposals.
  • Storage/service manifest execution flow.
  • In-process threading work (future queue completion callbacks).

Minimal acceptance criteria

  • make run boots and prints GPU service lifecycle messages.
  • Init spawns GPU service and grants only device-scoped caps.
  • A sample userspace client can:
    • create session
    • allocate and map a GPU buffer
    • submit a synthetic job
    • wait on a fence and receive completion
  • Attempts to submit unsupported/malformed operations return explicit capnp errors.
  • Removing service/session capabilities invalidates descendants without kernel restart.

Risks

  • Real NVIDIA closed stack integration may require vendor-specific adaptation.
  • Buffer mapping semantics can become complex with paging and fragmentation.
  • Interrupt-heavy completion paths require robust scheduling before user-visible completion guarantees.

Open Questions

  • Is CUDA mandatory from first integration, or is the initial surface command-focused (gpu-kernel payload as opaque bytes) and CUDA runtime-specific later?
  • Should memory registration support pinned physical memory only at first?
  • Which isolation level is needed for multi-tenant versus single-tenant first phase?

Proposal: Formal MAC/MIC Model and Proof Track

How capOS could move from pragmatic label checks to a formal mandatory access control and mandatory integrity control story suitable for a GOST-style claim.

Problem

Adding a label field to capabilities is not enough to claim formal mandatory access control. ГОСТ Р 59453.1-2021 frames access control through a formal model of an abstract automaton: the model describes states, subjects, objects, containers, rights, accesses, information flows, safety conditions, and proofs that unsafe accesses or flows cannot arise.

capOS should therefore separate two levels:

  • Pragmatic label policy. Userspace brokers and wrapper capabilities enforce labels at trusted grant paths and selected method calls.
  • Formal MAC/MIC. A documented abstract state machine, safety predicates, transition rules, proof obligations, and an implementation mapping. Only this second level can support a GOST-style claim.

This proposal defines the path to the second level. It is not a claim that capOS currently satisfies it.

Scope

The first formal target should be narrow:

Confidentiality:
  No transition creates an unauthorized information flow from an object at a
  higher or incomparable confidentiality label to an object at a lower label,
  except through an explicit trusted declassifier transition.

Integrity:
  No low-integrity or incomparable subject can control a higher-integrity
  subject, and no low-integrity subject can write or transfer influence into a
  higher-integrity object, except through an explicit trusted upgrader or
  sanitizer transition.

The proof should cover capability authority creation and transfer before it covers every device, filesystem, or POSIX compatibility corner. For capOS, capability transfer is the dangerous boundary.

Terminology

The Russian GOST terms to keep straight:

  • мандатное управление доступом: mandatory access control for confidentiality.
  • мандатный контроль целостности: mandatory integrity control.
  • целостность: integrity.
  • уровень целостности: integrity level.
  • уровень конфиденциальности: confidentiality level.
  • субъект доступа: access subject.
  • объект доступа: access object.

The standards separate confidentiality MAC from integrity control. capOS should not merge them into one vague label field.

Abstract State

The formal model should be intentionally smaller than the implementation. It models only the security-relevant state.

U       set of user accounts / principals
S       set of subjects: processes, sessions, services
O       set of objects: files, namespaces, endpoints, process handles, secrets
C       set of containers: namespaces, directories, stores, service subtrees
E       entities = O union C
K       kernel object identities
Cap     capability handles / hold edges
Hold    relation S -> E with metadata
Own     subject-control or ownership relation
Ctrl    subject-control relation
Flow    observed information-flow relation
Rights  abstract rights: read, write, execute, own, control, transfer
Access  realized accesses: read, write, call, return, spawn, supervise

Hold is central. In capOS, authority is represented by capability table entries and transfer records, not by global paths. A formal model that does not model capability hold edges will miss the main authority channel.

Suggested hold-edge metadata:

HoldEdge {
  subject
  entity
  interface_id
  badge
  transfer_mode
  origin
  confidentiality_label
  integrity_label
}

Label Lattices

Use deployment-defined partial orders, not hardcoded government categories.

Example confidentiality lattice:

public < internal < confidential < secret
compartments = {project-a, project-b, ops, crypto}

dominates(a, b) means:

level(a) >= level(b)
and compartments(a) includes compartments(b)

Integrity should be separate:

untrusted < user < service < trusted
domains = {boot, storage, network, auth}

The model must specify how labels compose across containers:

  • contained entity confidentiality cannot exceed what the container policy permits unless the container explicitly supports mixed labels;
  • contained entity integrity cannot exceed the container’s integrity policy;
  • a subject-associated object such as a process ring, endpoint queue, or process handle needs labels derived from the subject it controls or exposes.

Capability Method Flow Classes

capOS cannot rely on syscall names such as read and write. Each interface method needs a flow class.

Initial categories:

ReadLike       data flows object -> subject
WriteLike      data flows subject -> object
Bidirectional  data flows both ways
ControlLike    subject controls another subject/object lifecycle
TransferLike   authority or future data path is transferred
ObserveLike    metadata/log/status observation
Declassify     trusted downgrade of confidentiality
Sanitize       trusted upgrade of integrity after validation
NoFlow         lifecycle release or local bookkeeping only

Examples:

File.read                 ReadLike
File.write                WriteLike
Namespace.bind            WriteLike + ControlLike
LogReader.read            ReadLike
ManifestUpdater.apply     WriteLike + ControlLike
ProcessSpawner.spawn      ControlLike + TransferLike
ProcessHandle.wait        ObserveLike
ServiceSupervisor.restart ControlLike
Endpoint.call             depends on endpoint declaration
Endpoint.return           depends on endpoint declaration
CAP_OP_RELEASE            NoFlow
CAP_OP_CALL transfers     TransferLike
CAP_OP_RETURN transfers   TransferLike

The flow table is part of the trusted model. Adding a new capability method without classifying its flow should fail review.

Transitions

The abstract automaton should include at least these transitions:

create_session(principal, profile)
spawn(parent, child, grants)
copy_cap(sender, receiver, cap)
move_cap(sender, receiver, cap)
insert_result_cap(sender, receiver, cap)
call(subject, endpoint, payload)
return(server, client, result, result_caps)
read(subject, object)
write(subject, object)
bind(subject, namespace, name, object)
supervise(controller, target, operation)
release(subject, cap)
revoke(authority, object)
declassify(trusted_subject, source, target)
sanitize(trusted_subject, source, target)
relabel(trusted_subject, object, new_label)

Each transition needs preconditions and effects. Example:

copy_cap(sender, receiver, cap):
  pre:
    Hold(sender, cap.entity)
    cap.transfer_mode allows copy
    confidentiality_flow_allowed(cap.entity, receiver)
    integrity_flow_allowed(sender, cap.entity, receiver)
    receiver quota has free cap slot
  effect:
    Hold(receiver, cap.entity) is added
    Flow(cap.entity, receiver, transfer) is recorded when relevant

Move is not a shortcut. It has different authority effects but can still create an information/control flow into the receiver.

Safety Predicates

Confidentiality:

read_allowed(s, e):
  clearance(s) dominates classification(e)

write_allowed(s, e):
  classification(e) dominates current_confidentiality(s)

flow_allowed(src, dst):
  classification(dst) dominates classification(src)

No write down follows from classification(dst) dominates classification(src).

Integrity:

integrity_write_allowed(s, e):
  integrity(s) >= integrity(e)

control_allowed(controller, target):
  integrity(controller) >= integrity(target)

integrity_flow_allowed(src, dst):
  integrity(src) >= integrity(dst)

The exact inequality direction must be validated against the chosen integrity semantics. The intent is that low-integrity subjects cannot modify or control high-integrity subjects or objects.

Subject control:

supervise_allowed(controller, target):
  confidentiality/control labels are compatible
  and integrity(controller) >= integrity(target)
  and Hold(controller, ServiceSupervisor(target)) exists

Authority graph:

all live authority is represented by Hold
every Hold edge has a live cap table slot or trusted kernel root
no transition creates Hold without passing transfer/spawn/broker preconditions

Proof Shape

The proof is an invariant proof over the abstract automaton:

Base:
  initial_state satisfies Safety

Step:
  for every transition T:
    if Safety(state) and Precondition(T, state),
    then Safety(apply(T, state))

The transition proof must explicitly cover:

  • spawn grants,
  • copy transfer,
  • move transfer,
  • result-cap insertion,
  • endpoint call and return,
  • namespace bind,
  • supervisor operations,
  • declassification,
  • sanitization,
  • relabel,
  • revocation and release preserving consistency.

The proof must also state what it does not cover:

  • physical side channels,
  • timing channels not modeled by Flow,
  • bugs below the abstraction boundary,
  • device DMA until DMAPool/IOMMU boundaries are modeled,
  • persistence/replay until persistent object identity is modeled.

Tooling Plan

Start with lightweight formal tools, then deepen only if the model stabilizes.

TLA+

Best first tool for capOS because capability transfer, spawn, endpoint delivery, and revocation are state transitions. Use TLA+ to model:

  • sets of subjects, objects, labels, and hold edges,
  • bounded transfer/spawn/call transitions,
  • invariants for confidentiality, integrity, and hold-edge consistency.

TLC can find counterexamples early. Apalache is worth evaluating later for symbolic checking if TLC state explosion becomes painful.

Alloy

Useful for relational counterexample search:

  • label lattice dominance,
  • container hierarchy invariants,
  • hold-edge graph consistency,
  • “can a path of transfers create forbidden flow?” queries.

Alloy complements TLA+; it does not replace transition modeling.

Coq, Isabelle, or Lean

Only after the model stops moving. These tools are appropriate for a durable machine-checked proof artifact. They are expensive if the policy surface is still changing.

Kani / Prusti / Creusot

Use these for implementation-level Rust obligations after the abstract model exists:

  • cap table generation/index invariants,
  • transfer transaction rollback,
  • label dominance helper correctness,
  • quota reservation/release balance,
  • wrapper cap narrowing properties.

They do not replace the abstract automaton proof.

Implementation Mapping

The proof track must produce implementation obligations that code review and tests can check.

Required implementation hooks:

  • every kernel object that participates in policy has stable ObjectId;
  • every labeled object has MandatoryLabel;
  • every hold edge or capability entry records enough label metadata for transfer checks;
  • every capability method has a flow class;
  • every transfer path calls one shared label/flow checker;
  • every spawn grant uses the same checker as transfer;
  • every endpoint has declared flow policy;
  • every declassifier/sanitizer is an explicit capability and audited;
  • every relabel operation is explicit and audited;
  • every wrapper cap preserves or narrows authority and labels;
  • process exit and release remove hold edges without leaving ghost authority.

The current pragmatic userspace broker model is allowed as an earlier stage, but the implementation mapping must identify where it is bypassable. Any path that lets untrusted code transfer labeled authority without the broker must move into the kernel-visible checked path before a formal MAC/MIC claim.

Testing and Review Gates

Before implementing kernel-visible labels:

  • write the TLA+ or Alloy model;
  • include at least one counterexample-driven test showing a rejected unsafe transfer in the model;
  • document every transition that is intentionally out of scope.

Before claiming pragmatic MAC/MIC:

  • broker and wrapper caps enforce labels at grant paths;
  • audit records every grant, denial, and relabel/declassify operation;
  • QEMU demo shows a denied high-to-low transfer and a permitted trusted declassification.

Before claiming GOST-style MAC/MIC:

  • abstract automaton is written;
  • safety predicates are explicit;
  • all modeled transitions preserve safety;
  • implementation obligations are mapped to code paths;
  • transfer/spawn/result-cap insertion cannot bypass label checks;
  • limitations and non-modeled channels are documented.

Integration With Existing Plans

This proposal depends on:

Non-Goals

  • No certification claim.
  • No claim that current capOS implements GOST-style MAC/MIC.
  • No attempt to model all side channels in the first version.
  • No kernel policy language interpreter.
  • No POSIX uid/gid authorization.
  • No label field without transition rules and proof obligations.

Open Questions

  • What is the smallest useful label lattice for the first demo?
  • Should labels live on objects, hold edges, or both?
  • Should endpoint flow policy be static per endpoint, per method, or per transferred cap?
  • How should declassifier and sanitizer capabilities be scoped and audited?
  • Which channels must be modeled as memory flows versus time flows?
  • Is TLA+ sufficient for the first formal artifact, or should the relational parts start in Alloy?
  • Which parts of ГОСТ Р 59453.1-2021 should be treated as direct goals versus inspiration for a capOS-native formal model?

References

Proposal: Running capOS in the Browser (WebAssembly, Worker-per-Process)

How capOS goes from “boots in QEMU” to “boots in a browser tab,” with each capOS process executing in its own Web Worker and the kernel acting as the scheduler/dispatcher across them.

The goal is a teaching and demo target, not a production runtime. It should preserve the capability model — typed endpoints, ring-based IPC, no ambient authority — while replacing the hardware substrate (page tables, IDT, preemptive timer, privilege rings) with browser primitives (Worker boundaries, SharedArrayBuffer, Atomics.wait/notify).

Depends on: Stage 5 (Scheduling), Stage 6 (IPC) — the capability ring is the only kernel/user interface we want to port. Anything still sitting behind the transitional write/exit syscalls must migrate to ring opcodes first.

Complements: userspace-binaries-proposal.md (language/runtime story), service-architecture-proposal.md (process lifecycle). A browser port stresses both: the runtime must build for wasm32-unknown-unknown, and process spawn becomes “instantiate a Worker” rather than “map an ELF.”

Non-goals:

  • Running the existing x86_64 kernel unmodified in the browser. That’s a separate question (QEMU-WASM / v86) and is a simulator, not a port.
  • Emulating the MMU, IDT, or PIT in WASM. The whole point is to replace them with primitives the browser already gives us for free.
  • Any persistence, networking, or storage beyond what a hosted demo needs.

Current State

capOS is x86_64-only. Arch-specific code lives under kernel/src/arch/x86_64/ and relies on:

MechanismFileBrowser equivalent
Page tables, W^X, user/kernel splitmem/paging.rs, arch/x86_64/smap.rsWorker + linear-memory isolation (structural)
Preemptive timer (PIT @ 100 Hz)arch/x86_64/pit.rs, idt.rssetTimeout/MessageChannel + cooperative yield
Syscall entry (SYSCALL/SYSRET)arch/x86_64/syscall.rsDirect Atomics.notify on ring doorbell
Context switcharch/x86_64/context.rsNone — each process is its own Worker, OS schedules
ELF loadingelf.rs, main.rsWebAssembly.instantiate from module bytes
Frame allocatormem/frame.rsmemory.grow inside each instance
Capability ringcapos-config/src/ring.rs, cap/ring.rsReused unchanged — shared via SharedArrayBuffer
CapTable, CapObjectcapos-lib/src/cap_table.rsReused unchanged in kernel Worker

The capability-ring layer is the only stable interface that survives the port intact. Everything below cap/ring.rs is arch work; everything above is schema-driven capnp dispatch that doesn’t care about the substrate.


Architecture

flowchart LR
    subgraph Tab[Browser Tab / Origin]
        direction LR
        Main[Main thread<br/>xterm.js, UI, loader]
        subgraph KW[Kernel Worker]
            Kernel[capOS kernel<br/>CapTable, scheduler,<br/>ring dispatch]
        end
        subgraph P1[Process Worker #1<br/>init]
            RT1[capos-rt] --> App1[init binary]
        end
        subgraph P2[Process Worker #2<br/>service<br/>spawned by init]
            RT2[capos-rt] --> App2[service binary]
        end
        SAB1[(SharedArrayBuffer<br/>ring #1)]
        SAB2[(SharedArrayBuffer<br/>ring #2)]
        Main <-->|postMessage| KW
        KW <-->|SAB + Atomics| SAB1
        KW <-->|SAB + Atomics| SAB2
        P1 <-->|SAB + Atomics| SAB1
        P2 <-->|SAB + Atomics| SAB2
        P1 -.spawn.-> KW
        KW -.new Worker.-> P2
    end

One Worker per capOS process. Each process is a WASM instance in its own Worker, with its own linear memory. Cross-process access is structurally impossible — postMessage and shared ring buffers are the only channels.

Kernel in a dedicated Worker. Not on the main thread: the main thread is reserved for UI (terminal, loader, error display). The kernel Worker owns the CapTable, holds the Arc<dyn CapObject> registry, dispatches SQEs, and maintains one SharedArrayBuffer per process for that process’s ring. It directly spawns init; all further processes are created via the ProcessSpawner cap it serves.

Capability ring over SharedArrayBuffer. The existing CapRingHeader/CapSqe/CapCqe layout in capos-config/src/ring.rs already uses volatile access helpers for cross-agent visibility. Mapping it onto a SharedArrayBuffer is a change of backing store, not of protocol. Both sides see the same bytes; Atomics.load/Atomics.store replace the volatile reads on the host side; on the Rust/WASM side the existing read_volatile/ write_volatile lower to plain atomic loads/stores under wasm32-unknown-unknown with the atomics feature enabled.

cap_enter becomes Atomics.wait. The process Worker calls Atomics.wait on a doorbell word in the SAB after publishing SQEs. The kernel Worker (or its scheduler tick) calls Atomics.notify after producing completions. That is exactly the io_uring-inspired “syscall-free submit, blocking wait on completion” the ring was designed around — the browser happens to give us the primitive for free.

No preemption inside a process. A Worker runs to completion on its event loop turn; the kernel can’t interrupt it. This is fine: each process is single-threaded in its own isolate, and the scheduler only needs to wake the next process after Atomics.wait, not forcibly remove the running one. This is closer to a cooperative capnp-rpc vat model than to the current timer-preempted kernel, and matches what the capability ring already assumes.


Mapping capOS Concepts to WASM/Browser

Process isolation

The Worker boundary replaces the page table. Two capOS processes cannot observe each other’s linear memory, cannot jump into each other’s code (code is out-of-band in WASM — not addressable as data), and cannot share globals. The SharedArrayBuffer containing the ring is the only intentional shared region, and it is created by the kernel Worker and transferred to the process Worker at spawn time.

No W^X enforcement is needed within a Worker because WASM has no writable code region to begin with — WebAssembly.Module is validated and immutable. The MMU’s job is done by the WASM type system and validator.

Address space / memory

Each Worker’s WASM instance has one linear memory. capos-rt’s fixed heap initialization uses memory.grow instead of VirtualMemory::map. The VirtualMemory capability still exists in the schema, but its implementation in the browser port is a thin wrapper over memory.grow with bookkeeping for “logical unmap” (zeroing + tracking a free list — WASM doesn’t return pages to the host).

Protection flags (PROT_READ/PROT_WRITE/PROT_EXEC) become no-ops with a documented caveat in the proposal: the browser port does not enforce intra-process protection. Cross-process protection is structural and stronger than the native build.

Syscalls

The three transitional syscalls (write, exit, cap_enter) collapse to:

  • write — already slated for removal once init is cap-native. In the browser port, do not implement it at all. Force the port to drive the existing cap-native Console ring path, which forces the rest of the tree to be cap-native too. A forcing function, not a cost.
  • exitpostMessage({type: 'exit', code}) to the kernel Worker, which terminates the Worker via worker.terminate() and reaps the process entry.
  • cap_enterAtomics.wait on the ring doorbell after publishing SQEs, with a waitAsync variant for cooperative mode if we ever want to avoid blocking the Worker’s event loop.

Scheduler

Round-robin is gone; the browser scheduler is the OS scheduler. The kernel Worker’s “scheduler” is reduced to:

  1. A poll loop that drains each process’s SQ (the existing cap/ring.rs::process_sqes logic, called on every notify or on a setTimeout(0) tick).
  2. A completion-fanout step that pushes CQEs and Atomics.notifys the target Worker.

No context switch, no run queue, no per-process kernel stack. The code deleted here is exactly the code that smp-proposal.md says needs per-CPU structures — an orthogonal win: the browser port has no SMP problem because each process is structurally on its own agent.

Process spawning

The kernel Worker spawns exactly one process Worker directly — init — with a fixed cap bundle: Console, ProcessSpawner, FrameAllocator, VirtualMemory, BootPackage, and any host-backed caps (Fetch, etc.) granted to it.

// Kernel Worker bootstrap
const initMod = await WebAssembly.compileStreaming(fetch('/init.wasm'));
const initRing = new SharedArrayBuffer(RING_SIZE);
const initWorker = new Worker('process-worker.js', {type: 'module'});
kernel.registerProcess(initWorker, initRing, buildInitCapBundle());
initWorker.postMessage(
    {type: 'boot', mod: initMod, ring: initRing, capSet: initCapSet,
     bootPackage: manifestBytes},
    [/* transfer */]);

All further processes come from init invoking ProcessSpawner.spawn. ProcessSpawner is served by the kernel Worker; each invocation:

  1. Compiles the referenced binary bytes (WebAssembly.compile over the NamedBlob from BootPackage).
  2. Creates a new Worker and a SharedArrayBuffer for its ring.
  3. Builds the child’s CapTable from the ProcessSpec the caller passed, applying move/copy semantics to caps transferred from the caller’s table.
  4. Returns a ProcessHandle cap.

Init composes service caps in userspace: hold Fetch, attenuate to per-origin HttpEndpoint, hand each child only the caps its ProcessSpec names. Same shape as native after Stage 6.

Host-backed capability services

Some capabilities in the browser port are implemented by talking to the browser rather than to hardware. Fetch and HttpEndpoint — drafted in service-architecture-proposal.md — are the canonical example. On native capOS they run over a userspace TCP/IP stack on virtio-net/ENA/gVNIC. In the browser port, the service process is replaced by a thin implementation living in the kernel Worker (or a dedicated “host bridge” Worker) that dispatches each capnp call by calling fetch / new WebSocket and returning the response as a CQE. The attenuation story is unchanged: Fetch can reach any URL, HttpEndpoint is bound to one origin at mint time, derived from Fetch by a policy process.

This is not a back door. The capability is granted through the manifest exactly as on native. Processes without the cap cannot reach the host’s network, cannot discover it, and cannot forge one. The only difference from native is the implementation of the service behind the CapObject trait — same schema, same TYPE_ID, same error model.

The same pattern applies to anything else the browser provides natively. Candidate future interfaces (no schema yet, mentioned so the port is considered when they are designed):

  • Clipboard over navigator.clipboard
  • LocalStorage / KvStore over IndexedDB (natural Store backend for the storage proposal in the browser)
  • Display / Canvas over an OffscreenCanvas posted back to the main thread
  • RandomSource over crypto.getRandomValues — trivial but needs a cap rather than a syscall

Other drafted network interfaces — TcpSocket, TcpListener, UdpSocket, NetworkManager from networking-proposal.md — do not have a clean browser mapping. The browser exposes no raw-socket primitives, so these caps cannot be served in the browser port at all. Applications that need networking in the browser must go through Fetch/HttpEndpoint, and the POSIX shim’s socket path must detect the absence of NetworkManager and route connect("http://...") through Fetch instead (or fail closed for other schemes). CloudMetadata from cloud-metadata-proposal.md is simply not granted in the browser; there is no cloud instance to describe.

Each host-backed cap is opt-in per-process via the manifest; each has a native counterpart that the schema is already the contract for. This is a substantial point in favor of the port: host-provided services slot into the existing capability model without widening it.

CapSet bootstrap

The read-only CapSet page at CAPSET_VADDR is replaced by a structured-clone payload in the initial postMessage. capos-rt::capset::find still parses the same CapSetHeader/CapSetEntry layout, just out of a Uint8Array placed at a known offset in the process’s linear memory by the boot shim.


Binary Portability

Source-portable, not binary-portable. An ELF built for x86_64-unknown-capos does not run; the same source rebuilt for wasm32-unknown-unknown (with the atomics target feature) does, provided it stays inside the supported API surface.

Rust binaries on capos-rt

Port cleanly:

  • Any binary that uses only capos-rt’s public API — typed cap clients (ConsoleClient, future FileClient, etc.), ring submission/completion, CapSet::find, exit, cap_enter, alloc::*.
  • Pure computation, core/alloc containers, serde/capnp message building.

Do not port:

  1. Anything that uses core::arch::x86_64, inline asm!, or global_asm!.
  2. Binaries with a custom _start or a linker script baking in 0x200000. capos-rt owns the entry shape; the wasm entry is set by the host (WebAssembly.instantiate + an exported init), so the prologue differs.
  3. #[thread_local] relying on FS base until the wasm TLS story is decided (per-Worker globals, or the wasm threads proposal’s TLS).
  4. Code that assumes a fixed-size static heap region and reaches it with raw pointers. The wasm arch uses memory.grow; alloc::* hides this, unsafe { &mut HEAP[..] } does not.
  5. Anything that still calls the transitional write syscall shim — the browser build deliberately omits it.

Binaries mixing target features across the workspace produce silently- broken atomics. A single rustflags set for the browser build is required.

POSIX binaries (when the shim lands)

The POSIX compatibility layer described in userspace-binaries-proposal.md Part 4 sits on top of capos-rt. If capos-rt builds for wasm, the shim builds for wasm, and well-behaved POSIX code rebuilt for a wasm-targeted libcapos (clang --target=wasm32-unknown-unknown + our libc) ports too.

Ports cleanly:

  • Pure computation, string/number handling, data-structure libraries.
  • stdio over Console / future File caps.
  • malloc/free, C++ new/delete, static constructors.
  • select/poll/epoll implemented over the ring (ring CQEs are exactly the event source these APIs want).
  • posix_spawn over ProcessSpawner — spawning a new process becomes “instantiate a new Worker,” which is the native shape of the browser anyway.
  • Networking via Fetch/HttpEndpoint (drafted in service-architecture-proposal.md) if the manifest grants the cap. The browser port serves these against the host’s fetch/WebSocket — not ambient authority, because only processes granted the cap can invoke it. Raw AF_INET/AF_INET6 sockets via the TcpSocket/NetworkManager interfaces in networking-proposal.md are not available in the browser (no raw-socket primitive); POSIX networking code wants URLs in practice, and a libc shim can map getaddrinfo+connect+write over Fetch/HttpEndpoint for the HTTP(S) case, failing closed otherwise.

Does not port without new work, possibly ever:

  1. fork. Cannot clone a Worker’s linear memory into a new Worker and resume at the fork call site — there is no COW, no MMU, no way to duplicate an opaque WASM module’s mid-execution state. This is the same reason Emscripten/WASI don’t support fork. POSIX programs that fork-then-exec can be rewritten to posix_spawn; programs that fork-for-concurrency (Apache prefork, some Redis paths) cannot.
  2. Signals. No preemption inside a Worker means no asynchronous signal delivery. SIGALRM, SIGINT, SIGSEGV all need cooperative polling at best; kill(pid, SIGKILL) maps to worker.terminate() and nothing finer. setjmp/longjmp works within a function call tree; siglongjmp out of a signal handler does not exist.
  3. mmap of files with MAP_SHARED. WASM linear memory is not file-backed and cannot be. MAP_PRIVATE | MAP_ANONYMOUS works trivially (it’s just memory.grow + a free list). File-backed mappings require a userspace emulation that reads on fault and writes back on unmap — workable for small files, a lie for the memory- mapped-database case.
  4. Threads without the wasm threads proposal. pthreads over Workers sharing a memory is the only implementation strategy, and it requires the wasm atomics/bulk-memory/shared-memory feature set plus careful runtime support. Single-threaded POSIX code works now; multithreaded POSIX code needs the in-process-threading track from the native roadmap and its wasm counterpart.
  5. Address-arithmetic tricks. Wasm validates loads/stores against the linear-memory bounds. Code that relies on unmapped trap pages (guard pages, end-of-allocation sentinels) or on specific virtual addresses fails.
  6. dlopen. A wasm module is immutable after instantiation. Dynamic loading requires loading a second module and linking via exported tables — possible with the component model, nowhere near drop-in dlopen. Static linking is the pragmatic answer.

Rough guide: if a POSIX program compiles cleanly under WASI and uses only WASI-supported syscalls, it will almost certainly port to capOS-on-wasm with the shim, because the constraints overlap. If it needs features WASI doesn’t support (fork, signals, shared mmap), the capOS browser port will not magically fix that — the limitations come from the substrate, not from the POSIX shim’s completeness.


Build Path

Three new cargo targets, no workspace restructuring required:

  1. capos-lib on wasm32-unknown-unknown. Already no_std + alloc, no arch-specific code. Should build as-is; verify under cargo check --target wasm32-unknown-unknown -p capos-lib.

  2. capos-config on wasm32-unknown-unknown. Same — pure logic, the ring structs and volatile helpers are portable.

  3. capos-rt on wasm32-unknown-unknown with atomics feature. The standalone userspace runtime currently hard-codes x86_64 syscall instructions. Introduce an arch module split:

    • arch/x86_64.rs (existing syscall.rs contents)
    • arch/wasm.rs (new — Atomics.wait via core::arch::wasm32::memory_atomic_wait32, exit via host import)

    Gate at the syscall boundary, not deeper; the ring client above it is arch-agnostic.

  4. Demos on wasm32-unknown-unknown. Same arch split applied via capos-rt. No per-demo changes expected if the split is clean.

The kernel does not build for wasm. Instead, a new crate capos-kernel-wasm/ (peer to kernel/) reuses capos-lib’s CapTable and capos-config’s ring structs and implements the dispatch loop against JS host imports for Worker management. It is, deliberately, not the same kernel binary. Trying to build kernel/ for wasm would pull in IDT/GDT/paging code that has no meaning in the browser.


Phased Plan

Phase A: Port the pure crates

  • Verify capos-lib, capos-config build clean on wasm32-unknown-unknown. CI job: cargo check --target wasm32-unknown-unknown -p capos-lib -p capos-config.
  • Add a host-side ring-tests-js harness that exercises the same invariants as tests/ring_loom.rs but with a real JS producer and a Rust/wasm consumer, both sharing a SharedArrayBuffer. Proves the volatile access helpers are portable before anything else depends on them.

Phase B: capos-rt arch split

  • Introduce capos-rt/src/arch/{x86_64,wasm}.rs behind a #[cfg(target_arch)].
  • Rewire syscall/ring/client to call through the arch module.
  • Add make capos-rt-wasm-check target. Existing make capos-rt-check stays for x86_64.

Phase C: Kernel Worker + init

  • capos-kernel-wasm/ with a Console capability that renders to xterm.js via postMessage back to the main thread.
  • Kernel Worker spawns init. Init prints “hello” through Console and exits.

Phase D: ProcessSpawner + Endpoint

  • ProcessSpawner served by the kernel Worker, granted to init.
  • Init parses its BootPackage and spawns the endpoint-roundtrip and ipc-server/ipc-client demos via ProcessSpawner.spawn. These stress capability transfer across Workers: does a cap handed from A to B via the ring land correctly in B’s ring, and does B’s subsequent invocation route back to the right holder?
  • This phase turns the port into a validation surface for the capability-transfer and badge-propagation invariants in docs/authority-accounting-transfer-design.md, and a second implementation of the Stage 6 spawn primitive.

Phase E: Integration with demos page

  • Hosted page at a project URL; xterm.js terminal; selector for which demo manifest to boot.
  • Serve .wasm artifacts as static assets.

Security Boundary Analysis

The browser port changes what is trusted and what is verified. Summary:

BoundaryNative (x86_64)Browser (WASM-Workers)
Process ↔ processPage tables + ringsWorker agents + SAB (structural)
Process ↔ kernelSyscall MSRs + SMEP/SMAPpostMessage + validated host imports
Code integrityW^X + NXWASM validator + immutable Module
Capability forgeryKernel-owned CapTableKernel-Worker-owned CapTable
Capability transferRing SQE validated in kernelRing SQE validated in kernel Worker — same code path

The capability-forgery story is the same in both: an unforgeable 64-bit CapId is assigned by the kernel and can only be resolved through the kernel’s CapTable. A process Worker cannot synthesize a valid CapId because it never sees the CapTable; it only sees SQEs it submits and CQEs it receives. This property is what makes the port worth doing — the capability model is preserved exactly.

What weakens: no SMAP/SMEP equivalent, but also no corresponding attack surface (the “kernel” Worker has no pointer into process memory; it can only copy bytes out of the shared ring). No DMA problem. No side-channel parity with docs/dma-isolation-design.md — Spectre/meltdown in the browser is the browser’s problem, mitigated by site isolation and COOP/COEP.

Required headers: Cross-Origin-Opener-Policy: same-origin and Cross-Origin-Embedder-Policy: require-corpSharedArrayBuffer is gated on these. A hosted demo page must set them.


What This Port Buys Us

  1. Shareable demos. A URL that boots capOS in ~1s, with no QEMU, no local install. Valuable for documentation and recruiting.
  2. A second substrate for the capability model. If the cap-transfer protocol has a bug, reproducing it under Workers (single-threaded, deterministic scheduling) is much easier than under SMP x86_64. A second implementation of the dispatch surface is a correctness asset.
  3. Forcing function for write syscall removal. The browser port cannot support the transitional write path without importing host I/O as a back door, which is exactly the ambient authority we want to avoid. Shipping a browser demo at all requires finishing the migration to the Console capability over the ring.
  4. Teaching surface. Workers give a much clearer visual of “one process, one memory, one cap table” than a bare-metal kernel ever will. The isolation story renders in the DevTools panel.

What It Does Not Buy Us

  1. Not a validation surface for the x86_64 kernel. Page tables, IDT, context switch, SMP — none of that runs. Bugs in those subsystems will not appear in the browser build.
  2. Not a performance story. WASM + Workers + SAB is slower than native QEMU-KVM for the parts it does overlap on, and does not exercise the hardware features capOS eventually cares about (IOMMU, NVMe, virtio-net).
  3. Not a path to “capOS on Cloudflare Workers” or similar. Cloudflare’s runtime is a single isolate per request, no SAB, no threads — a different environment that would need its own proposal.

Open Questions

  1. Do we ship one capos-kernel-wasm crate, or does the kernel Worker run plain JS that imports a thin capos-dispatch wasm? JS-hosted kernel is simpler (no second wasm toolchain for the kernel side) but duplicates cap-dispatch logic. Preferred: Rust/wasm kernel Worker reusing capos-lib — dispatch code stays single-sourced.
  2. How do we surface kernel panics in the browser? Native capOS halts the CPU; the browser equivalent is posting an error to the main thread and tearing down all Workers. Should match the panic = "abort" contract — no recovery attempted.
  3. Do we implement VirtualMemory as a no-op or as a real allocator? No-op is faster to ship; a real allocator over memory.grow exercises more of the capability surface. Lean toward real, gated behind a browser-shim flag so the demo doesn’t silently diverge from the native semantics.
  4. Manifest format: keep capnp, or add JSON for hand-authored demo configs? Keep capnp. The manifest is already the contract; adding a parallel format is exactly the drift the project has been careful to avoid.

Relationship to Other Proposals

  • userspace-binaries-proposal.md — the wasm32 runtime story lives there eventually. This proposal is narrower: just enough runtime to boot the existing demo set in a browser. If the userspace proposal lands a richer runtime first, this one adopts it.
  • smp-proposal.md — structurally irrelevant to the browser port (each Worker is its own agent). The browser port does inform SMP testing, because the cap-transfer protocol under Workers is a cleaner model of “messages cross agents asynchronously” than single-CPU preempted kernels.
  • service-architecture-proposal.md — process spawn in the browser becomes Worker instantiation. The lifecycle primitives (supervise, restart, retarget) map naturally. Live upgrade (live-upgrade-proposal.md) is even more natural under Workers than under in-kernel retargeting — swap the WebAssembly.Module behind a Worker while the ring stays live.
  • security-and-verification-proposal.md — the browser port adds a CI job (wasm builds + JS-side ring tests) but does not change the verification story for the native kernel.

Rejected Proposal: Cap’n Proto SQE Envelope

Status: rejected.

Proposal

Replace the fixed C-layout CapSqe descriptor with a fixed-size padded Cap’n Proto message. Each SQ slot would contain a serialized single-segment Cap’n Proto struct with a union for call, recv, return, release, and finish, then zero padding to the chosen SQE size.

For a 128-byte slot, the rough layout would be:

+0x00  u32 segment_count_minus_one
+0x04  u32 segment0_word_count
+0x08  word root pointer
+0x10  RingSqe data words, including union discriminant
+0x??  zero padding to 128 bytes

A compact schema would need to keep fields flat to avoid pointer-heavy nested payload structs:

struct RingSqe {
  userData @0 :UInt64;
  capId @1 :UInt32;
  methodId @2 :UInt16;
  flags @3 :UInt16;
  addr @4 :UInt64;
  len @5 :UInt32;
  resultAddr @6 :UInt64;
  resultLen @7 :UInt32;
  callId @8 :UInt32;

  union {
    call @9 :Void;
    recv @10 :Void;
    return @11 :Void;
    release @12 :Void;
    finish @13 :Void;
  }
}

Potential Benefits

A Cap’n Proto SQE envelope would make the ring operation shape schema-defined instead of Rust-struct-defined. That has some real advantages:

  • The ABI documentation would live in schema/capos.capnp next to the capability interfaces.
  • Future userspace runtimes in Rust, C, Go, or another language could use generated accessors instead of hand-mirroring a packed descriptor layout.
  • The operation choice could be represented as a schema union, making it clear that fields meaningful for CALL are not meaningful for RECV or RETURN.
  • Cap’n Proto defaulting gives a familiar path for adding optional fields while letting older readers ignore fields they do not understand.
  • Ring dumps and traces could be decoded with generic Cap’n Proto tooling.
  • A single “everything crossing this boundary is Cap’n Proto” rule is architecturally simpler to explain.

Those benefits are mostly about schema uniformity, generated bindings, and tooling. They do not remove the need for an operation discriminator; they move it from an explicit fixed descriptor field to a Cap’n Proto union tag.

Rationale For Rejection

The SQE is the fixed control-plane descriptor for a hostile kernel boundary. It should be cheap to classify and validate before any operation-specific payload parsing. A Cap’n Proto SQE envelope would still have a discriminator, but would move it into generated reader state and require Cap’n Proto message validation before the kernel even knows whether the entry is a CALL, RECV, or RETURN.

Cap’n Proto framing also consumes slot space: a single-segment message needs a segment table and root pointer before the struct data. A flat 64-byte envelope would be tight and brittle; a 128-byte envelope would spend much of the slot on framing and padding. Nested payload structs are worse because they add pointers inside the ring descriptor.

The accepted split is:

  • fixed #[repr(C)] ring descriptors for SQ/CQ control state;
  • Cap’n Proto for capability method params, results, and higher-level transport payloads where schema evolution is valuable;
  • endpoint delivery metadata in a small fixed EndpointMessageHeader followed by opaque params bytes.

There is also a layering issue. The capability ring is part of the local Cap’n Proto transport implementation: it is the mechanism that moves capnp calls, returns, and eventually release/finish/promise bookkeeping between a process and the kernel. The SQE itself is therefore below ordinary Cap’n Proto message usage. Making the transport substrate depend on parsing Cap’n Proto messages to discover which transport operation to perform would couple the transport implementation to the protocol it is supposed to carry. Method params and results are proper Cap’n Proto messages; the ring descriptor is the framing/control structure that gets the transport to the point where those messages can be interpreted.

This keeps queue geometry simple, preserves bounded hostile-input handling, and avoids running a Cap’n Proto parser on the hot descriptor path.

Rejected Proposal: Sleep(INF) Process Termination

Status: rejected.

Concern

Unix-style zombies are a poor fit for capOS. A terminated child should not keep its address space, cap table, endpoint state, or other authority alive merely because a parent has not waited yet. The remaining observable state should be a small, capability-scoped completion record, and only holders of the corresponding ProcessHandle should be able to observe it.

The current ProcessHandle.wait() -> exitCode :Int64 shape is also too weak for future lifecycle semantics. Raw numeric status cannot distinguish normal application exit from abandon, kill, fault, startup failure, runtime panic, or supervisor policy actions without inventing process-wide magic numbers.

Proposal

Introduce a system sleep operation and treat Sleep(INF) as a special terminal operation. The argument for this spelling is that a process that never wants to run again can enter an infinite sleep instead of becoming a zombie. The kernel would recognize the infinite case and handle it specially:

  • finite Sleep(duration) blocks the process and wakes it later;
  • Sleep(INF) never wakes, so the kernel tears down the process;
  • the process’s authority is released as if it had exited;
  • parent-visible process completion is either omitted or reported as a special status.

A variant also removes the dedicated sys_exit syscall and makes Sleep(INF) the only user-visible process termination primitive.

Candidate Semantics

Sleep(INF) as Exit(0)

The simplest version maps Sleep(INF) to normal successful exit.

This is rejected because it lies about intent. A program that completed successfully, a program that intentionally detached, and a program that chose to disappear without status are not the same lifecycle event. Supervisors would see the same status for all of them.

Sleep(INF) as Abandoned

A less lossy version gives Sleep(INF) a distinct terminal status:

struct ProcessStatus {
  union {
    exited @0 :ApplicationExit;
    abandoned @1 :Void;
    killed @2 :KillReason;
    faulted @3 :FaultInfo;
    startupFailed @4 :StartupFailure;
  }
}

struct ApplicationExit {
  code @0 :Int64;
}

ProcessHandle.wait() would return status :ProcessStatus instead of a bare exitCode :Int64. Normal application termination returns exited(code), while Sleep(INF) returns abandoned.

This fixes the type problem, but leaves the operation name wrong. Sleep normally means the process remains alive and keeps its authority until a wake condition. The infinite special case would instead release authority, reclaim memory, cancel endpoint state, complete process handles, and make the process impossible to wake. That is termination, not sleep.

Sleep(INF) as Detached No-Status Termination

Another version treats Sleep(INF) as detached termination and gives parents no status. That avoids inventing an exit code, but it weakens supervision. Init and future service supervisors need a definite terminal event to implement restart policy, diagnostics, dependency failure reporting, and “wait for all children” flows. A missing status is not a useful status.

Remove sys_exit Through a Typed Lifecycle Capability

Removing the dedicated sys_exit syscall is a separate, plausible future direction. The cleaner version is not Sleep(INF), but an explicit lifecycle operation:

interface ProcessSelf {
  terminate @0 (status :ProcessStatus) -> ();
  abandon @1 () -> ();
}

interface ProcessHandle {
  wait @0 () -> (status :ProcessStatus);
}

The process would receive ProcessSelf only for itself. Calling terminate would be non-returning in practice: the kernel would process the request, release process authority, complete any ProcessHandle waiter with the typed status, and not post an ordinary success completion back to the dying process.

The transport shape needs care. A generic Cap’n Proto call normally expects a completion CQE, but a self-termination operation cannot safely rely on the dying process to consume one. Viable implementations include:

  • a dedicated ring operation such as CAP_OP_EXIT targeting a self-lifecycle cap;
  • a ProcessSelf.terminate call whose method is explicitly non-returning and never posts a CQE to the caller;
  • keeping sys_exit temporarily until ring-level non-returning operations have explicit ABI and runtime support.

This path removes the ambient exit syscall without overloading sleep. It also forces terminal status to become typed before kill, abandon, restart policy, or fault reporting are added.

Rationale For Rejection

Sleep(INF) solves the wrong abstraction problem. The zombie problem is not that a process needs a forever-blocked state. The problem is retaining process resources after terminal execution. capOS should solve that by separating process lifetime from process-status observation:

  • process termination immediately releases authority and reclaims process resources;
  • a ProcessHandle is only observation authority, not ownership of the live process;
  • if a handle exists, a small completion record may remain until it is waited or released;
  • if no handle exists, terminal status can be discarded;
  • no ambient parent process table is needed.

Under that model, a sleeping process remains alive and authoritative, while a terminated process does not. Special-casing Sleep(INF) to perform teardown would make the name actively misleading and would create a hidden terminal operation with different semantics from finite sleep.

The accepted direction is therefore:

  • keep explicit process termination semantics;
  • replace raw exitCode :Int64 with typed ProcessStatus before adding more lifecycle states;
  • keep exit(code) as the current minimal ABI until a typed self-lifecycle capability or ring operation can replace it cleanly;
  • add future Timer.sleep(duration) only for real sleep, where the process remains alive and may wake.

Sleep(INF) remains rejected as a termination primitive. The concern it raises is valid, but the solution is typed terminal status plus status-record cleanup, not infinite sleep.

Research: Capability-Based and Microkernel Operating Systems

Survey of existing systems to inform capOS design decisions across IPC, scheduling, capability model, persistence, VFS, and language support.

Design consequences for capOS

  • Keep the flat generation-tagged capability table; seL4-style CNode hierarchy is not needed until delegation patterns demand it.
  • Treat the typed Cap’n Proto interface as the permission boundary; avoid a parallel rights-bit system that would drift from schema semantics.
  • Continue the ring transport plus direct-handoff IPC path, with shared memory reserved for bulk data once SharedBuffer/MemoryObject exists.
  • Use badge metadata, move/copy transfer descriptors, and future epoch revocation to make authority delegation explicit and reviewable.
  • Keep persistence explicit through Store/Namespace capabilities; do not adopt EROS-style transparent global checkpointing as a kernel baseline.
  • Push POSIX compatibility and VFS behavior into libraries and services rather than adding a kernel global filesystem namespace.
  • Add resource donation, scheduling-context donation, notification objects, and runtime/thread primitives only when the corresponding service or runtime path needs them.

Individual deep-dive reports:

  • seL4 – formal verification, CNode/CSpace, IPC fastpath, MCS scheduling
  • Fuchsia/Zircon – handles with rights, channels, VMARs/VMOs, ports, FIDL vs Cap’n Proto
  • Plan 9 / Inferno – per-process namespaces, 9P protocol, file-based vs capability-based interfaces
  • EROS / CapROS / Coyotos – persistent capabilities, single-level store, checkpoint/restart
  • Genode – session routing, VFS plugins, POSIX compat, resource trading, Sculpt OS
  • LLVM target customization – target triples, TLS models, Go runtime requirements
  • Cap’n Proto error handling – protocol, schema, and Rust crate error behavior used by the capOS error model
  • OS error handling – error patterns in capability systems and microkernels used by the capOS error model
  • IX-on-capOS hosting – clean integration of IX package/build model via MicroPython control plane, native template rendering, Store/Namespace, and build services
  • Out-of-kernel scheduling – whether scheduler policy can move to user space, and which dispatch/enforcement mechanisms must stay in kernel

Cross-Cutting Analysis

1. Capability Table Design

All surveyed systems store capabilities as process-local references to kernel objects. The key design variable is how capabilities are organized.

SystemStructureLookupDelegationRevocation
seL4Tree of CNodes (power-of-2 arrays with guard bits)O(depth)Subtree (grant CNode cap)CDT (derivation tree), transitive
ZirconFlat per-process handle tableO(1)Transfer through channels (move)Close handle; refcount; no propagation
EROS32-slot nodes forming treesO(depth)Node key passingForwarder keys (O(1) rescind)
GenodeKernel-enforced capability referencesO(1)Parent-mediated session routingSession close
capOSFlat table with generation-tagged CapId, hold-edge metadata, and Arc<dyn CapObject> backingO(1)Manifest exports plus copy/move transfer descriptors through Endpoint IPCLocal release/process exit; epoch revocation not yet

Recommendation for capOS: Keep the flat table. It is simpler than seL4’s CNode tree and sufficient for capOS’s use cases. Augment each entry with:

  1. Badge (from seL4) – u64 value delivered to the server on invocation, allowing a server to distinguish callers without separate capability objects.
  2. Generation counter (from Zircon) – upper bits of CapId detect stale references after a slot is reused. (Implemented.)
  3. Epoch (from EROS) – per-object revocation epoch. Incrementing the epoch invalidates all outstanding references. O(1) revoke, O(1) check.

Not adopted: per-entry rights bitmask. Zircon and seL4 use rights bitmasks (READ/WRITE/EXECUTE) because their handle/syscall interfaces are untyped. capOS uses Cap’n Proto typed interfaces where the schema defines what methods exist. Method-level access control is the interface itself – to restrict what a caller can do, grant a narrower capability (a wrapper CapObject that exposes fewer methods). A parallel rights system would create an impedance mismatch: generic flags (READ/WRITE) mapped arbitrarily onto typed methods. Meta-rights for the capability reference itself (TRANSFER/DUPLICATE) may be added when Stage 6 IPC needs them. See capability-model.md for the full rationale.

2. IPC Design

IPC is the most performance-critical kernel mechanism. Every capability invocation across processes goes through it.

SystemModelLatency (round-trip)Bulk dataAsync
seL4Synchronous endpoint, direct context switch~240 cycles (ARM), ~400 cycles (x86)Shared memory (explicit)Notification objects (bitmask signal/wait)
ZirconChannels (async message queue, 64KiB + 64 handles)~3000-5000 cyclesVMOs (shared memory)Ports (signal-based notification)
EROSSynchronous domain call~2x L4Through address space nodesNone (synchronous only)
Plan 99P over pipes (kernel-mediated)~5000+ cyclesLarge reads/writes (iounit)None (blocking per-fid)
GenodeRPC objects with session routingVaries by kernel (uses seL4/NOVA/Linux underneath)Shared-memory dataspacesSignal capabilities

Recommendation for capOS: Continue the dual-path IPC design:

Fast synchronous path (seL4-inspired, for RPC):

  • When process A calls a capability in process B and B is blocked waiting, perform a direct context switch (A -> kernel -> B, no unrelated scheduler pick). The current single-CPU direct handoff is implemented.
  • Future fastpath work can transfer small messages (<64 bytes) through registers during the switch instead of copying through ring buffers.

Async submission/completion rings (io_uring-inspired, for batching):

  • SQ/CQ in shared memory for batched capability invocations. This is the current transport for CALL/RECV/RETURN/RELEASE/NOP.
  • Support SQE chaining for Cap’n Proto promise pipelining.
  • Signal/notification delivery through CQ entries (from Zircon ports).
  • User-queued CQ entries for userspace event loop integration.

Bulk data (Zircon/Genode-inspired):

  • SharedBuffer capability for zero-copy data transfer between processes.
  • Capnp messages for control plane; shared memory for data plane.
  • Critical for file I/O, networking, and GPU rendering.

3. Memory Management Capabilities

Zircon’s VMO/VMAR model is the most mature capability-based memory design. The Go runtime proposal shows why these primitives are essential.

VirtualMemory capability (baseline implemented; still central for Go and advanced allocators):

interface VirtualMemory {
    map @0 (hint :UInt64, size :UInt64, prot :UInt32) -> (addr :UInt64);
    unmap @1 (addr :UInt64, size :UInt64) -> ();
    protect @2 (addr :UInt64, size :UInt64, prot :UInt32) -> ();
}

MemoryObject capability (needed for IPC bulk data, shared libraries). Zircon calls this concept a VMO (Virtual Memory Object); capOS uses the name SharedBuffer – see docs/proposals/storage-and-naming-proposal.md for the canonical interface definition.

interface MemoryObject {
    read @0 (offset :UInt64, count :UInt64) -> (data :Data);
    write @1 (offset :UInt64, data :Data) -> ();
    getSize @2 () -> (size :UInt64);
    createChild @3 (offset :UInt64, size :UInt64, options :UInt32)
        -> (child :MemoryObject);
}

4. Scheduling

SystemModelPriority inversion solutionTemporal isolation
seL4 (MCS)Scheduling Contexts (budget/period/priority) + Reply ObjectsSC donation through IPC (caller’s SC transfers to callee)Yes (budget enforcement per SC)
ZirconFair scheduler with profiles (deadline, capacity, period)Kernel-managed priority inheritanceProfiles provide some isolation
GenodeDelegated to underlying kernel (seL4/NOVA/Linux)Depends on kernelDepends on kernel
Out-of-kernel policyKernel dispatch/enforcement + user-space policy serviceScheduling-context donation through IPCKernel-enforced budgets, user-chosen policy
User-space runtimesM:N work stealing, fibers, async tasks over kernel threadsRequires futexes, runtime cooperation, and OS-visible blocking eventsUsually runtime-local only

Recommendation for capOS: Start with round-robin (already done). When implementing priority scheduling:

  1. Add scheduling context donation for synchronous IPC: when process A calls process B, B inherits A’s priority and budget. Prevents inversion through the capability graph.
  2. Support passive servers (from seL4 MCS): servers without their own scheduling context that only run when called, using the caller’s budget. Natural fit for capOS’s service architecture.
  3. Add temporal isolation (budget/period per scheduling context) for the cloud deployment scenario.

For moving scheduler policy out of the kernel, see Out-of-kernel scheduling. The key finding is a split between kernel dispatch/enforcement and user-space policy: dispatch, budget enforcement, and emergency fallback remain privileged, while admission control, budgets, priorities, CPU masks, and SQPOLL/core grants can be represented as policy managed by a scheduler service. Thread creation, thread handles, scheduling contexts, and futex authority should be capability-based from the start; the remaining research task is measurement: compare generic capnp/ring calls against compact capability-authorized futex-shaped operations before deciding the futex hot-path encoding.

5. Persistence

SystemModelConsistencyApplication effort
EROS/CapROSTransparent global checkpoint (single-level store)Strong (global snapshot)None (automatic)
Plan 9User-mode file servers with explicit writesPer-file serverFull (explicit save/load)
GenodeApplication-level (services manage own persistence)Per-componentFull
capOS (planned)Content-addressed Store + Namespace capsPer-serviceFull (explicit capnp serialize)

Recommendation for capOS: Three phases, as informed by EROS:

  1. Explicit persistence (current plan) – services serialize state to the Store capability as capnp messages. Simple, gives services control.
  2. Opt-in Checkpoint capability – kernel captures process state (registers, memory, cap table) as capnp messages stored in the Store. Enables process migration and crash recovery for services that opt in.
  3. Coordinated checkpointing – a coordinator service orchestrates consistent snapshots across multiple services.

Persistent capability references (from EROS + Cap’n Proto):

struct PersistentCapRef {
    interfaceId @0 :UInt64;
    objectId @1 :UInt64;
    permissions @2 :UInt32;
    epoch @3 :UInt64;
}

Do NOT implement EROS-style transparent global persistence. The kernel complexity is enormous, debuggability is poor, and Cap’n Proto’s zero-copy serialization already provides near-equivalent benefits for explicit persistence.

6. Namespace and VFS

Plan 9’s per-process namespace is the closest analog to capOS’s per-process capability table. The key insight: Plan 9’s bind/mount with union semantics provides composability that capOS’s current Namespace design lacks.

Recommendation: Extend Namespace with union composition:

enum UnionMode { replace @0; before @1; after @2; }

interface Namespace {
    resolve @0 (name :Text) -> (hash :Data);
    bind @1 (name :Text, hash :Data) -> ();
    list @2 () -> (names :List(Text));
    sub @3 (prefix :Text) -> (ns :Namespace);
    union @4 (other :Namespace, mode :UnionMode) -> (merged :Namespace);
}

VFS as a library (from Genode): libcapos-posix should be an in-process library that translates POSIX calls to capability invocations. Each POSIX process receives a declarative mount table (capnp struct) mapping paths to capabilities. No VFS server needed.

FileServer capability (from Plan 9): For resources that are naturally file-like (config trees, debug introspection, /proc-style interfaces), provide a FileServer interface. Not universal (as in Plan 9) but available where the file metaphor fits.

7. Resource Accounting

Genode’s session quota model addresses a gap in capOS: without resource accounting, a malicious client can exhaust a server’s memory by creating many sessions.

Recommendation: Session-creating capability methods should accept a resource donation parameter:

interface NetworkManager {
    createTcpSocket @0 (bufferPages :UInt32) -> (socket :TcpSocket);
}

The client donates buffer memory as part of the session creation. The server allocates from donated resources, not its own.

8. Language Support Roadmap

From the LLVM research, the recommended order:

StepWhatBlocks
1Custom target JSON (x86_64-unknown-capos)Optional build hygiene; not required by current no_std runtime
2VirtualMemory capabilityDone for baseline map/unmap/protect; Go allocator glue remains
3TLS support (PT_TLS parsing, FS base save/restore)Done for static ELF processes; runtime-controlled FS base remains
4Futex authority capability + measured ABIGo threads, pthreads
5Timer capability (monotonic clock)Go scheduler
6Go Phase 1: minimal GOOS=capos (single-threaded)CUE on capOS
7Kernel threadingGo GOMAXPROCS>1
8C toolchain + libcaposC programs, musl
9Go Phase 2: multi-threaded + concurrent GCGo network services
10Go Phase 3: network pollernet/http on capOS

Key decisions:

  • Keep x86_64-unknown-none for kernel, x86_64-unknown-capos for userspace.
  • Use local-exec TLS model (static linking, no dynamic linker).
  • Implement futex as capability-authorized from the start. Because it operates on memory addresses and must be fast, measure generic capnp/ring calls against a compact capability-authorized operation before fixing the ABI.
  • Go can start with cooperative-only preemption (no signals).

Recommendations by Roadmap Stage

Stage 5: Scheduling

SourceRecommendationPriority
ZirconGeneration counter in CapId (stale reference detection)Done
seL4Add notification objects (lightweight bitmask signal/wait)Medium
LLVMCustom target JSON for userspace (x86_64-unknown-capos)Medium
LLVMRuntime-controlled FS-base operation for Go/threadingMedium

Stage 6: IPC and Capability Transfer

SourceRecommendationPriority
seL4Direct-switch IPC for synchronous cross-process callsDone baseline
seL4Badge field on capability entries (server identifies callers)Done
ZirconMove semantics for capability transfer through IPCDone
ZirconMemoryObject capability (shared memory for bulk data)High
EROSEpoch-based revocation (O(1) revoke, O(1) check)High
ZirconSideband capability-transfer descriptors and result-cap recordsDone baseline
GenodeSharedBuffer capability for data-plane transfersHigh
Plan 9Promise pipelining (SQE chaining in async rings)Medium
GenodeSession quotas / resource donation on session creationMedium
seL4Scheduling context donation through IPCMedium
Plan 9Namespace union composition (before/after/replace)Low

Post-Stage 6 / Future

SourceRecommendationPriority
seL4MCS scheduling (passive servers, temporal isolation)When needed
EROSOpt-in Checkpoint capability for process persistenceWhen needed
GenodeDynamic manifest reconfiguration at runtimeWhen needed
Plan 9exportfs-pattern capability proxy for network transparencyWhen needed
EROSPersistentCapRef struct in capnp for storing capability graphsWhen needed
seL4Rust-native formal verification (track Verus/Prusti)Long-term

Design Decisions Validated

Several capOS design choices are validated by this research:

  1. Cap’n Proto as the universal wire format. Superior to FIDL (random access, zero-copy, promise pipelining, persistence-ready). The right choice. See zircon.md Section 5.

  2. Flat capability table. Simpler than seL4’s CNode tree, sufficient for capOS. Only add complexity (CNode-like hierarchy) if delegation patterns demand it. See sel4.md Section 4.

  3. No ambient authority. Every surveyed capability OS confirms this is essential. EROS proved confinement. seL4 proved integrity. capOS has this by design.

  4. Explicit persistence over transparent. EROS’s single-level store is elegant but the kernel complexity is enormous. Cap’n Proto zero-copy gives most of the benefits. See eros-capros-coyotos.md Section 6.

  5. io_uring-inspired async rings. Better than Zircon’s port model for capOS (operation-based > notification-based). See zircon.md Section 4.

  6. VFS as library, not kernel feature. Genode’s approach, matched by capOS’s planned libcapos-posix. See genode.md Section 3.

  7. No fork(). Genode has operated without fork() for 15+ years, proving it unnecessary. See genode.md Section 4.

Design Gaps Identified

  1. No bulk data path. Copying capnp messages through the kernel works for control but not for file/network/GPU data. SharedBuffer or MemoryObject capability is essential for Stage 6+.

  2. Resource accounting is fragmented. The authority-accounting design exists and current code has several local ledgers, but VirtualMemory, FrameAllocator, and process/resource quotas are not yet unified.

  3. No notification primitive. seL4 notifications (lightweight bitmask signal/wait) are needed for interrupt delivery and event notification without full capnp message overhead.

  4. No runtime-controlled FS-base/thread TLS API. Static ELF TLS and context-switch FS-base state exist, but Go and future user threads still need a way to set FS base per thread.


References

See individual deep-dive reports for full reference lists. Key primary sources:

  • Klein et al., “seL4: Formal Verification of an OS Kernel,” SOSP 2009
  • Lyons et al., “Scheduling-context capabilities,” EuroSys 2018
  • Shapiro et al., “EROS: A Fast Capability System,” SOSP 1999
  • Shapiro & Weber, “Verifying the EROS Confinement Mechanism,” IEEE S&P 2000
  • Pike et al., “The Use of Name Spaces in Plan 9,” OSR 1993
  • Feske, “Genode Foundations” (genode.org/documentation)
  • Fuchsia Zircon kernel documentation (fuchsia.dev)

seL4 Deep Dive: Lessons for capOS

Research notes on seL4’s design, covering formal verification, capability model, IPC, scheduling, and applicability to capOS.

Primary sources: “seL4: Formal Verification of an OS Kernel” (Klein et al., SOSP 2009), seL4 Reference Manual (v12.x / v13.x), “The seL4 Microkernel – An Introduction” (whitepaper, 2020), “Towards a Verified, General-Purpose Operating System Kernel” (Klein et al., 2008), “Principled Approach to Kernel Design for MCS” (Lyons et al., 2018), seL4 source code and API documentation.


1. Formal Verification Approach

What seL4 Proves

seL4 is the first general-purpose OS kernel with a machine-checked proof of functional correctness. The verification chain establishes:

  1. Functional correctness: The C implementation of the kernel refines (faithfully implements) an abstract specification written in Isabelle/HOL. Every possible execution of the C code corresponds to an allowed behavior in the abstract spec. This is not “absence of some bug class” – it is a complete behavioral equivalence between spec and code.

  2. Integrity (access control): The kernel enforces capability-based access control. A process cannot access a kernel object unless it holds a capability to it. This is proven as a consequence of functional correctness: the spec defines access rules, and the implementation provably follows them.

  3. Confidentiality (information flow): In the verified configuration, information cannot flow between security domains except through explicitly authorized channels. This proves noninterference at the kernel level.

  4. Binary correctness: The proof chain extends from the abstract spec through a Haskell executable model, then to the C implementation, and finally to the compiled ARM binary (via the verified CAmkES/CompCert chain or translation validation against GCC output). On ARM, the compiled binary is proven to behave as the C source specifies.

The Verification Chain

Abstract Specification (Isabelle/HOL)
    |
    | refinement proof
    v
Executable Specification (Haskell)
    |
    | refinement proof
    v
C Implementation (10,000 lines of C)
    |
    | translation validation / CompCert
    v
ARM Binary

Each refinement step proves that the lower-level implementation is a correct realization of the higher-level spec. The Haskell model serves as an “executable spec” – it’s precise enough to run but abstract enough to reason about.

Properties Verified

  • No null pointer dereferences – a consequence of functional correctness.
  • No buffer overflows – all array accesses are proven in-bounds.
  • No arithmetic overflow – all integer operations are proven safe.
  • No use-after-free – memory management correctness is proven.
  • No memory leaks (in the kernel) – all allocated memory is accounted for.
  • No undefined behavior – the C code is proven to avoid all UB.
  • Capability enforcement – objects are only accessible through valid capabilities, and capabilities cannot be forged.
  • Authority confinement – proven that authority does not leak beyond what capabilities allow.

Practical Implications

What verification buys you:

  • Eliminates all implementation bugs in the verified code. Not “most bugs” or “common bug classes” – literally all of them, for the verified configuration.
  • The security properties (integrity, confidentiality) hold absolutely, not probabilistically.
  • Makes the kernel trustworthy as a separation kernel / isolation boundary.

What verification does NOT cover:

  • The specification itself could be wrong (it could specify the wrong behavior). Verification proves “code matches spec,” not “spec is correct.”
  • Hardware must behave as modeled. The proof assumes a correct CPU, correct memory, no physical attacks. DMA from malicious devices can break isolation unless an IOMMU is used (and IOMMU management is proven correct).
  • Only the verified configuration is covered. seL4 has unverified configurations (e.g., SMP, RISC-V, certain platform features). Using unverified features voids the proof.
  • Performance-critical code paths (like the IPC fastpath) were initially outside the verification boundary, though significant progress has been made on verifying them.
  • The bootloader and hardware initialization code are outside the proof boundary.
  • Compiler correctness: on x86, the proof trusts GCC. On ARM, binary verification closes this gap.

Design Constraints Imposed by Verification

The requirement of formal verification has profoundly shaped seL4’s design:

  1. Small kernel: ~10,000 lines of C. Every line must be verified, so the kernel is as small as possible. Drivers, file systems, networking – everything lives in user space.

  2. No dynamic memory allocation in the kernel: The kernel does not have a general-purpose heap. All kernel memory is pre-allocated and managed through typed capabilities (Untyped memory). This eliminates an entire class of verification complexity (heap reasoning).

  3. No concurrency in the kernel: seL4 runs the kernel as a single- threaded “big lock” model (interrupts disabled in kernel mode). SMP is handled by running independent kernel instances on each core with explicit message passing between them (the “clustered multikernel” approach), or by using a big kernel lock (the current SMP approach, which is NOT covered by the verification proof).

  4. C implementation: Written in a restricted subset of C that is amenable to Isabelle/HOL reasoning. No function pointers (mostly), no complex pointer arithmetic, no compiler-specific extensions. This makes the code more rigid than typical C but provable.

  5. Fixed system call set: The kernel API is small and fixed. Adding a new syscall requires extending the proofs – a major effort.

  6. Platform-specific verification: The proof is per-platform. ARM was verified first; x86 verification came later with additional effort. Each new platform requires new proofs.


2. Capability Transfer Model

Core Concepts

seL4’s capability model descends from the EROS/KeyKOS tradition but with significant innovations driven by formal verification requirements.

Kernel Objects: Everything the kernel manages is an object: TCBs (thread control blocks), endpoints (IPC channels), CNodes (capability storage), page tables, frames, address spaces (VSpaces), untyped memory, and more. The kernel tracks the exact type and state of every object.

Capabilities: A capability is a reference to a kernel object combined with access rights. Capabilities are stored in kernel memory, never directly accessible to user space. User space refers to capabilities by position in its capability space.

CSpaces, CNodes, and CSlots

CSlot (Capability Slot): A single storage location that can hold one capability. A CSlot is either empty or contains a capability (object pointer

  • access rights + badge).

CNode (Capability Node): A kernel object that is a power-of-two-sized array of CSlots. A CNode with 2^n slots has a “guard” and a “radix” of n. CNodes are the building blocks of the capability addressing tree.

CSpace (Capability Space): The complete capability namespace of a thread. A CSpace is a tree of CNodes, rooted at the thread’s CSpace root (a CNode pointed to by the TCB). Capability lookup traverses this tree.

Thread's TCB
  |
  +-- CSpace Root (CNode, 2^8 = 256 slots)
        |
        +-- slot 0: cap to Endpoint A
        +-- slot 1: cap to Frame X
        +-- slot 2: cap to another CNode (2^4 = 16 slots)
        |     |
        |     +-- slot 0: cap to Endpoint B
        |     +-- slot 1: empty
        |     +-- ...
        +-- slot 3: empty
        +-- ...

Capability Addressing (CPtr and Depth)

A CPtr (Capability Pointer) is a word-sized integer used to name a capability within a thread’s CSpace. It is NOT a memory pointer – it is an index that the kernel resolves by walking the CNode tree.

Resolution works bit-by-bit from the most significant end:

  1. Start at the CSpace root CNode.
  2. The CNode’s guard is compared against the corresponding bits of the CPtr. If they don’t match, the lookup fails. Guards allow sparse addressing without allocating huge CNode arrays.
  3. The next radix bits of the CPtr are used as an index into the CNode array.
  4. If the slot contains a CNode capability, recurse: consume the next bits of the CPtr to walk deeper.
  5. If the slot contains any other capability, the lookup is complete.
  6. The depth parameter in the syscall tells the kernel how many bits of the CPtr to consume. This disambiguates between “stop at this CNode cap” and “descend into this CNode.”

Example: A CPtr of 0x4B with a two-level CSpace:

  • Root CNode: guard = 0, radix = 4 (16 slots)
  • Bits [7:4] = 0x4 -> index into root CNode slot 4
  • Slot 4 contains a CNode cap: guard = 0, radix = 4 (16 slots)
  • Bits [3:0] = 0xB -> index into second-level CNode slot 11
  • Slot 11 contains an Endpoint cap -> lookup complete

Flat Table vs. Hierarchical CSpace

seL4’s hierarchical CSpace has significant implications:

Advantages of hierarchical:

  • Sparse capability spaces without wasting memory. A process can have a huge CPtr range with only a few CNodes allocated.
  • Subtree delegation: a parent can give a child a CNode cap that grants access to a subset of capabilities. The child can manage its own subtree without affecting the parent’s.
  • Guards compress address bits, allowing efficient encoding of large capability identifiers.

Disadvantages of hierarchical:

  • Lookup is slower than a flat array index – multiple memory indirections per resolution.
  • More complex kernel code (and more complex verification).
  • User space must explicitly manage CNode allocation and CSpace layout.

capOS comparison: capOS uses a flat Vec<Option<Arc<dyn CapObject>>> indexed by CapId (u32). The shared Arc lets a single kernel capability back multiple per-process slots, which is what makes cross-process IPC work when another service resolves its CapRef via CapSource::Service. The flat layout is simpler and faster for lookup (single array index), but cannot support sparse addressing or subtree delegation. For capOS’s research goals, the flat approach is adequate initially. If capOS needs hierarchical delegation later (e.g., a supervisor delegating a subset of caps to a child without copying), it could add a level of indirection without adopting seL4’s full tree model.

Capability Operations

seL4 provides these operations on capabilities:

Copy: Duplicate a capability from one CSlot to another. Both the source and destination must be in the caller’s CSpace (or the caller must have CNode caps to the relevant CNodes). The new cap has the same authority as the original, minus any rights the caller chooses to strip.

Mint: Like Copy, but also sets a badge on the new capability. A badge is a word-sized integer embedded in the capability that is delivered to the receiver when the capability is used. Badges allow a server to distinguish which client is calling – each client gets a differently-badged cap to the same endpoint, and the server sees the badge on each incoming message.

Move: Transfer a capability from one CSlot to another. The source slot becomes empty. This is an atomic transfer of authority.

Mutate: Move + modify rights or badge in one operation.

Delete: Remove a capability from a CSlot, making it empty.

Revoke: Delete a capability AND all capabilities derived from it. This is the most powerful operation – it allows a parent to withdraw authority it granted to children, transitively.

Capability Derivation and the CDT

seL4 tracks a Capability Derivation Tree (CDT) – a tree recording which capability was derived from which. When capability A is copied or minted to produce capability B, B becomes a child of A in the CDT.

Revoke(A) deletes all descendants of A in the CDT but leaves A itself. This gives the holder of A the power to revoke all authority derived from their own authority.

The CDT is critical for clean revocation but adds significant kernel complexity. It requires maintaining a tree structure across all capability copies throughout the system.

Untyped Memory and Retype

One of seL4’s most distinctive features is that the kernel never allocates memory on its own. All physical memory is initially represented as Untyped Memory capabilities. To create any kernel object (endpoint, CNode, TCB, page frame, etc.), user space must invoke the Untyped_Retype operation on an untyped cap, which carves out a portion of the untyped memory and creates a new typed object.

This means:

  • User space (specifically, the root task or a memory manager) controls all memory allocation.
  • The kernel has zero internal allocation – all memory it uses comes from retyped untypeds.
  • Memory exhaustion is impossible in the kernel – if a syscall needs memory, user space must have provided it in advance via retype.
  • Revoke on an untyped cap destroys ALL objects created from it, reclaiming the memory. This is the mechanism for wholesale cleanup.

3. IPC Fastpath

Overview

seL4’s IPC is synchronous and endpoint-based. An endpoint is a rendezvous point: the sender blocks until a receiver is ready, or vice versa. There is no buffering in the kernel (unlike Mach ports or Linux pipes).

The IPC fastpath is a highly optimized code path for the common case of a short synchronous call/reply between two threads. It is one of seL4’s signature performance features.

How the Fastpath Works

When thread A calls seL4_Call(endpoint_cap, msg):

  1. Capability lookup: Resolve the CPtr to find the endpoint cap. In the fastpath, this is optimized to handle the common case of a direct CSlot lookup (single-level CSpace, no guard traversal needed).

  2. Receiver check: Is there a thread waiting on this endpoint? If yes, the fastpath applies. If no (receiver isn’t ready), fall to the slowpath which queues the sender.

  3. Direct context switch: Instead of the normal path (save sender registers -> return to scheduler -> pick receiver -> restore receiver registers), the fastpath performs a direct register transfer:

    • Save the sender’s register state into its TCB.
    • Copy the message registers (a small number, typically 4-8 words) from the sender’s physical registers directly into the receiver’s TCB (or leave them in registers if possible).
    • Load the receiver’s page table root (vspace) into CR3/TTBR.
    • Switch to the receiver’s kernel stack.
    • Restore the receiver’s register state.
    • Return to user mode as the receiver.

    This is a direct context switch – the kernel goes directly from the sender to the receiver without passing through the scheduler. The IPC operation IS the context switch.

  4. Reply cap: The sender’s reply cap is set up so the receiver can reply. In the classic (non-MCS) model, a one-shot reply capability is placed in the receiver’s TCB. The receiver calls seL4_Reply(msg) to send the response directly back.

Performance Characteristics

seL4 IPC is among the fastest measured:

  • ARM (Cortex-A9): ~240 cycles for a Call+Reply round-trip (including two privilege transitions, a full context switch, and message transfer).
  • x86-64: ~380-500 cycles for a Call+Reply round-trip depending on hardware generation.
  • Message size: The fastpath handles small messages (fits in registers). Longer messages require copying from IPC buffer pages and take the slowpath.

For comparison:

  • Linux pipe IPC: ~5,000-10,000 cycles for a round-trip.
  • Mach IPC (macOS XNU): ~3,000-5,000 cycles.
  • L4/Pistachio: ~700-1,000 cycles (seL4 improved on this).

Fastpath Constraints

The fastpath is only taken when ALL of these conditions hold:

  1. The operation is seL4_Call or seL4_ReplyRecv (the two most common IPC operations).
  2. The message fits in message registers (no extra caps, no long messages that require the IPC buffer).
  3. The capability lookup is “simple” – single-level CSpace, direct slot lookup, no guard bits to check.
  4. There IS a thread waiting at the endpoint (no need to block the sender).
  5. The receiver is at sufficient priority (in the non-MCS configuration, higher priority than any other runnable thread – or in MCS, the scheduling context can be donated).
  6. No capability transfer is happening in this message.
  7. Certain bookkeeping conditions are met (no pending operations on either thread, no debug traps, etc.).

When any condition fails, the kernel falls through to the slowpath, which handles the general case correctly but with more overhead (~5-10x slower than the fastpath).

Direct Switch Mechanics

The key insight is: when thread A calls thread B synchronously, A is going to block until B replies. There is no scheduling decision to make – the only correct action is to run B immediately. So the kernel skips the scheduler entirely:

Thread A (running)          Kernel              Thread B (blocked on recv)
    |                         |                        |
    | seL4_Call(ep, msg) ---> |                        |
    |                         | [fastpath]             |
    |                         | Save A's regs          |
    |                         | Copy msg A -> B        |
    |                         | Switch page tables     |
    |                         | Restore B's regs       |
    |                         | ---------------------->|
    |                         |                        | [running, processes msg]
    |                         |                        |
    |                         | <--- seL4_Reply(reply)  |
    |                         | [fastpath again]       |
    |                         | Save B's regs          |
    |                         | Copy reply B -> A      |
    |                         | Switch page tables     |
    |                         | Restore A's regs       |
    | <-----------------------|                        |
    | [running, has reply]    |                        |

The entire round-trip involves exactly two kernel entries and two context switches, with no scheduler invocation.

Implications

  1. RPC is the natural IPC pattern: seL4’s IPC is optimized for the client-server call/reply pattern. Fire-and-forget or multicast patterns require different mechanisms (notifications, shared memory).

  2. Notifications: For async signaling (like interrupts or events), seL4 provides notification objects – a lightweight word-sized bitmask that can be signaled and waited on without message transfer. These are separate from endpoints.

  3. Shared memory for bulk transfer: IPC messages are small (register- sized). For large data transfers, the standard pattern is: set up shared memory, then use IPC to synchronize. This is explicit – the kernel doesn’t transparently copy large buffers.


4. CNode/CSpace Architecture in Detail

CNode Structure

A CNode object is a contiguous array of CSlots in kernel memory. The size is always a power of two. The kernel metadata for a CNode includes:

  • Radix bits: log2 of the number of slots (e.g., radix=8 means 256 slots).
  • Guard value: a bit pattern that must match the CPtr during resolution.
  • Guard bits: the number of bits in the guard.

The total bits consumed during resolution of one CNode level is: guard_bits + radix_bits.

Multi-Level Resolution Example

Consider a two-level CSpace:

Root CNode: guard=0 (0 bits), radix=8 (256 slots)
  Slot 5 -> CNode B: guard=0x3 (2 bits), radix=6 (64 slots)
    Slot 42 -> Endpoint X

To reach Endpoint X with a 16-bit CPtr at depth 16:

  • CPtr = 0b 00000101 11 101010
  • Root CNode consumes 8 bits: 00000101 = 5 -> Slot 5 (CNode B cap)
  • CNode B guard: next 2 bits = 11 -> matches guard 0x3 -> OK
  • CNode B radix: next 6 bits = 101010 = 42 -> Slot 42 (Endpoint X)
  • Total bits consumed: 8 + 2 + 6 = 16 = depth -> resolution complete

CSpace Layout Strategies

Flat: One large root CNode with radix=N, no sub-CNodes. Simple, fast lookup (one level). Wastes memory if the CPtr space is sparse.

Two-level: Small root CNode pointing to sub-CNodes. Common for processes that need moderate capability counts.

Deep: Many levels. Useful for delegation: a supervisor gives a child a cap to a sub-CNode, and the child manages its own CSpace subtree below that point.

Comparison with capOS’s Flat Table

AspectseL4 CSpacecapOS CapTable
StructureTree of CNodesFlat Vec<Option<Arc<dyn CapObject>>>
Lookup costO(depth) memory indirectionsO(1) array index
Sparse supportYes (guards + tree)No (dense array, holes via free list)
Subtree delegationYes (grant CNode cap)No
Memory overheadCNode objects are power-of-2Exact-sized Vec
ComplexityHigh (bit-level CPtr resolution)Low
Capability identityPosition in CSpaceCapId (u32 index)
Verification burdenVery highN/A (Rust safety)

5. MCS (Mixed-Criticality Systems) Scheduling

Background

The original seL4 scheduling model is a simple priority-preemptive scheduler with 256 priority levels and round-robin within each level. This model has a known flaw: priority inversion through IPC. When a high-priority thread calls a low-priority server, the reply might be delayed indefinitely by medium-priority threads preempting the server. The classic solution (priority inheritance) is complex to verify and doesn’t compose well.

The MCS extensions redesign scheduling to solve this and provide temporal isolation.

Key Concepts

Scheduling Context (SC): A new kernel object that represents the “right to execute on a CPU.” An SC holds:

  • A budget (microseconds of CPU time per period)
  • A period
  • A priority
  • Remaining budget in the current period

A thread must have a bound SC to be runnable. Without an SC, a thread cannot execute regardless of its priority.

Reply Object: In the MCS model, the one-shot reply capability from classic seL4 is replaced by an explicit Reply kernel object. When thread A calls thread B:

  1. A’s scheduling context is donated to B.
  2. A reply object is created to hold A’s return path.
  3. B now runs on A’s scheduling context (A’s priority and budget).
  4. When B replies, the SC returns to A.

This solves priority inversion: the server (B) inherits the caller’s priority and budget automatically.

Passive servers: A server thread can exist without its own SC. It only becomes runnable when a client donates an SC via the Call operation. When it replies, it becomes passive again. This is powerful:

  • No CPU time is “reserved” for idle servers.
  • The server executes on the client’s budget – the client pays for the work it requests.
  • Multiple clients can call the same passive server; each brings its own SC.

Temporal Isolation

MCS SCs provide temporal fault isolation:

  • Each SC has a fixed budget/period. A thread cannot exceed its budget in any period. When the budget expires, the thread is descheduled until the next period begins.
  • This is enforced by hardware timer interrupts – the kernel programs the timer to fire when the current SC’s budget expires.
  • A misbehaving (or compromised) component cannot starve other components because its SC bounds its CPU consumption.
  • This works even across IPC: if client A calls server B with A’s SC, the combined execution of A+B is bounded by A’s budget.

Comparison with capOS’s Scheduler

capOS currently has a round-robin scheduler (kernel/src/sched.rs) with no priority levels and no temporal isolation:

#![allow(unused)]
fn main() {
struct Scheduler {
    processes: BTreeMap<Pid, Process>,
    run_queue: VecDeque<Pid>,
    current: Option<Pid>,
}
}

Timer preemption, cap_enter blocking waits, Endpoint IPC, and a baseline direct IPC handoff are implemented. The MCS model is relevant for the next scheduling step because the same priority inversion problem arises when a high-priority client calls a low-priority server through a capability.


6. Relevance to capOS

6.1 Formal Verification

Applicability: Low in the near term. seL4’s verification is done in Isabelle/HOL over C code, which doesn’t transfer to Rust. However, the constraints that verification imposed are valuable design guidance:

  • Minimal kernel: seL4’s ~10K lines of C demonstrate how little code a microkernel actually needs. capOS should resist adding kernel features and instead move them to user space.
  • No kernel heap allocation on the critical path: seL4’s “untyped memory” approach where user space provides all memory is worth studying. capOS has removed the earlier allocation-heavy synchronous ring dispatch path, but it still uses owned kernel objects and preallocated scratch rather than a user-supplied untyped-memory model.
  • No kernel concurrency: seL4 avoids kernel-level concurrency entirely (SMP uses separate kernel instances or a big lock). capOS currently uses spin::Mutex around the scheduler and capability tables. The seL4 approach suggests this is acceptable until/unless per-CPU kernel instances are needed.

Rust alternative: Rust’s type system provides memory safety guarantees that overlap with some of seL4’s verified properties (no buffer overflows, no use-after-free, no null dereference in safe code). This is not a substitute for functional correctness proofs, but it significantly raises the bar compared to unverified C. Ongoing research in Rust formal verification (e.g., Prusti, Creusot, Verus) may eventually enable seL4-style proofs over Rust kernels.

6.2 Capability Model

CNode tree vs. flat table: capOS’s flat CapTable is the right choice for now. seL4’s CNode tree exists to support delegation (granting a subtree of your CSpace to a child) and sparse addressing. capOS’s current model gives each process its own independent flat table and now supports manifest-provided caps plus explicit copy/move transfer descriptors through Endpoint IPC. If capOS later needs fine-grained delegation (a parent granting access to a subset of its caps without copying), it can add a level of indirection:

Option A: Proxy capability objects that forward to the parent's table
Option B: A two-level table (small root array -> larger sub-arrays)
Option C: Shared capability objects with refcounting

Badge/Mint pattern: seL4’s badge mechanism is directly applicable to capOS. When multiple clients share a capability to the same server endpoint, the server needs to distinguish callers. In seL4, each client gets a differently-badged copy of the endpoint cap. The badge is delivered with each message.

capOS has implemented this by adding badge metadata to capability references and hold edges. Endpoint CALL delivery reports the invoked hold badge to the receiver, and copy/move transfer preserves badge metadata.

Current ring SQEs carry cap id and method id separately. The cap table stores badge and transfer-mode metadata alongside the object reference:

#![allow(unused)]
fn main() {
struct CapEntry {
    object: Arc<dyn CapObject>,
    badge: u64,
    transfer_mode: CapTransferMode,
}
}

Revocation (CDT): seL4’s Capability Derivation Tree is its most complex internal structure. For capOS, full CDT-style transitive revocation is probably overkill initially. The service-architecture proposal already identifies simpler alternatives:

  • Generation counters: Each capability has a generation number. Bumping the generation invalidates all references without traversing a tree.
  • Proxy caps: A proxy object that can be invalidated by its creator. Callers hold the proxy, not the real capability.
  • Process-lifetime revocation: When a process dies, all caps it held are automatically invalidated (seL4 does this too, but the CDT allows more fine-grained revocation within a living process).

Untyped memory: seL4’s “no kernel allocation” model via untyped memory is elegant but probably too heavyweight for capOS’s current stage. The key takeaway is the principle: user space should control resource allocation as much as possible. capOS’s FrameAllocator capability already moves frame allocation authority into the capability model.

6.3 IPC Design

This is the most directly actionable area for capOS’s Stage 6.

seL4’s model (synchronous rendezvous + direct switch) vs. capOS’s model (async rings + Cap’n Proto wire format):

AspectseL4capOS
IPC primitiveSynchronous endpointAsync submission/completion rings
Message formatUntyped words in registersCap’n Proto serialized messages
Bulk transferShared memory (explicit)TBD (copy in kernel or shared memory)
Message sizeSmall (register-sized, ~4-8 words)Variable (up to 64KB currently)
Scheduling integrationDirect switch (caller -> callee)Baseline direct IPC handoff implemented
BatchingNo (one message per syscall)Yes (io_uring-style ring)

Key lessons from seL4’s IPC for capOS:

  1. Direct switch for synchronous RPC: Even with async rings, capOS needs a synchronous fast path. The baseline single-CPU direct IPC handoff is implemented for the case where process A calls an Endpoint and process B is blocked waiting in RECV. Future work is register payload transfer and measured fastpath tuning.

  2. Register-based message transfer for small messages: seL4 avoids copying message bytes through kernel buffers for small messages by transferring them through registers during the context switch. capOS currently moves serialized payloads through ring buffers and bounded kernel scratch. For cross-process IPC, minimizing copies is critical. Options:

    • Small messages (<64 bytes) could be transferred in registers during direct switch.
    • Large messages could use shared memory regions (mapped into both address spaces) with IPC used only for synchronization.
    • The io_uring-style rings are already shared memory – the submission and completion ring buffers could potentially be mapped into both the caller’s and callee’s address spaces for zero-copy IPC.
  3. Separate mechanisms for sync and async: seL4 uses endpoints for synchronous IPC and notification objects for async signaling. capOS’s io_uring approach inherently supports batched async operations, but the common case of a simple RPC call-and-wait should have a fast synchronous path too. The two mechanisms complement each other.

  4. Notifications for interrupts and events: seL4’s notification objects (lightweight bitmask signal/wait) map well to capOS’s interrupt delivery model. When a hardware interrupt fires, the kernel signals a notification object, and the driver thread waiting on that notification wakes up. This is cleaner than delivering interrupts as full IPC messages.

The Cap’n Proto dimension: capOS’s use of Cap’n Proto wire format for capability messages is a significant divergence from seL4’s untyped word arrays. Tradeoffs:

  • Pro: Type safety, schema evolution, language-neutral interfaces, built-in serialization/deserialization, native support for capability references in messages (Cap’n Proto has a “capability table” concept in its RPC protocol).
  • Con: Serialization overhead. Even Cap’n Proto’s zero-copy format requires pointer validation and bounds checking that seL4’s raw register transfer does not. For very hot IPC paths, this overhead may be significant.
  • Mitigation: For the hot path, capOS could define a “small message” format that bypasses full capnp serialization – just a few raw words, similar to seL4’s register message. Fall back to full capnp for larger or more complex messages.

6.4 MCS Scheduling

Priority donation via IPC: Directly relevant when capOS implements cross-process capability calls. If process A (high priority) calls a capability in process B (low priority), B needs to run at A’s priority to avoid inversion. The seL4 MCS approach of “donating” the scheduling context with the IPC message is clean and composable.

For capOS, the io_uring model complicates this slightly: if submissions are batched, which submitter’s priority should the server inherit? Options:

  • Inherit the highest priority among pending submissions.
  • Each submission carries its own priority/scheduling context.
  • Use the synchronous fast-path (with donation) for priority-sensitive calls, and the async ring for bulk/background operations.

Passive servers: The MCS concept of servers that only consume CPU when called (by borrowing the caller’s scheduling context) maps well to capOS’s capability-based services. A network stack server that only runs when a client sends a request, consuming the client’s CPU budget, is a natural fit for capOS’s service architecture.

Temporal isolation: Budget/period enforcement prevents denial-of-service between capability holders. Even if process A holds a capability to process B, A cannot cause B to consume unbounded CPU time – B’s execution on behalf of A is bounded by A’s scheduling context budget. This is worth considering for capOS’s roadmap, especially for the cloud deployment scenario where isolation is critical.

6.5 Specific Recommendations for capOS

Near-term (Stages 5-6):

  1. Badge field on cap holds: Done. Manifest CapRef badge metadata is carried into cap-table hold edges, delivered to Endpoint receivers, and preserved across copy/move transfer.

  2. Implement direct-switch IPC for synchronous calls: Baseline done for Endpoint receivers blocked in RECV. Remaining work is the measured fastpath shape, especially small-message register transfer.

  3. Keep the flat CapTable: seL4’s CNode tree complexity is justified by formal verification constraints and subtree delegation. capOS’s flat table is simpler and sufficient. Add proxy/wrapper capabilities for delegation rather than restructuring the table.

  4. Add notification objects: A lightweight signaling primitive (word- sized bitmask, signal/wait operations) for interrupt delivery and event notification. Much cheaper than sending a full capnp message for “wake up, there’s work to do.”

Medium-term (post-Stage 6):

  1. Scheduling context donation: When implementing priority scheduling, attach a scheduling context to IPC calls so servers inherit caller priority. This prevents priority inversion through the capability graph.

  2. Capability rights attenuation: Add a rights mask to capability references so a parent can grant a cap with reduced permissions (e.g., read-only access to a read-write capability). seL4’s rights bits are: Read, Write, Grant (can pass the cap to others), GrantReply (can pass reply cap only).

  3. Revocation via generation/epoch counters: Generation-tagged CapIds catch stale slot reuse today. Object-wide epoch revocation remains future work and is simpler than a derivation tree.

Long-term (research directions):

  1. Zero-copy IPC via shared memory: For bulk data transfer between processes, map shared memory regions (Cap’n Proto segments) into both address spaces. Use IPC only for synchronization and capability transfer. This combines seL4’s “shared memory + IPC sync” pattern with capOS’s Cap’n Proto wire format.

  2. Rust-native verification: Track developments in Verus, Prusti, and other Rust verification tools. capOS’s Rust implementation is better positioned for future formal verification than a C implementation would be, given the type system guarantees already present.

  3. Untyped memory model: Consider moving kernel object allocation entirely into capability-gated operations (like seL4’s Retype). User space provides memory for all kernel objects, ensuring the kernel never runs out of memory on its own. This is a significant architectural change but aligns with the “everything is a capability” principle.


Summary Table

seL4 FeatureMaturitycapOS EquivalentRecommended Action
Functional correctness proofProductionNone (Rust type safety)Track Rust verification tools
CNode/CSpace treeProductionFlat CapTableKeep flat
Capability badge/mintProductionHold-edge badgeDone baseline
Revocation (CDT)ProductionGeneration-tagged CapId; no epoch yetUse epoch revocation instead of CDT
Untyped memory / RetypeProductionFrameAllocator capConsider for hardening phase
Synchronous IPC endpointsProductionEndpoint CALL/RECV/RETURNDone baseline
IPC fastpath (direct switch)ProductionDirect IPC handoffDone baseline; tune register payload later
Notification objectsProductionNoneImplement as lightweight signal primitive
MCS Scheduling ContextsProductionRound-robin schedulerImplement SC donation for IPC
Passive serversProductionNoneNatural fit with cap-based services
Temporal isolationProductionNoneConsider for cloud deployment

References

  1. Klein, G., et al. “seL4: Formal Verification of an OS Kernel.” SOSP 2009.
  2. seL4 Reference Manual, versions 12.1.0 and 13.0.0.
  3. “The seL4 Microkernel – An Introduction.” seL4 Foundation Whitepaper, 2020.
  4. Lyons, A., et al. “Scheduling-context capabilities: A principled, light-weight operating-system mechanism for managing time.” EuroSys 2018.
  5. Heiser, G., & Elphinstone, K. “L4 Microkernels: The Lessons from 20 Years of Research and Deployment.” SOSP 2016.
  6. seL4 source code: https://github.com/seL4/seL4
  7. seL4 API documentation: https://docs.sel4.systems/

Fuchsia Zircon Kernel: Research Report for capOS

Research into Zircon’s design for informing capOS capability model, IPC, virtual memory, async I/O, and interface definition decisions.

1. Handle-Based Capability Model

Overview

Zircon implements capabilities as handles. A handle is a process-local integer (similar to a Unix file descriptor) that references a kernel object and carries a bitmask of rights. The kernel maintains a per-process handle table that maps handle values to (kernel_object_pointer, rights) pairs. Processes can only interact with kernel objects through handles they hold.

There is no ambient authority in Zircon. A process cannot address kernel objects by name, path, or global ID – it must possess a handle. The initial set of handles is passed to a process at creation time by its parent (or by the component framework).

Handle Representation

Internally, a handle is:

  • A process-local 32-bit integer (the “handle value”). The low two bits encode a generation counter to detect use-after-close.
  • A reference to a kernel object (refcounted Dispatcher in Zircon’s C++).
  • A rights bitmask (zx_rights_t, a uint32_t).

The handle table is per-process, so handle value 0x1234 in process A and 0x1234 in process B refer to completely different objects (or nothing).

Rights

Rights are a bitmask that constrain what operations a handle can perform. Key rights include:

RightMeaning
ZX_RIGHT_DUPLICATECan be duplicated via zx_handle_duplicate()
ZX_RIGHT_TRANSFERCan be sent through a channel
ZX_RIGHT_READCan read data (channel messages, VMO bytes)
ZX_RIGHT_WRITECan write data
ZX_RIGHT_EXECUTEVMO can be mapped as executable
ZX_RIGHT_MAPVMO can be mapped into a VMAR
ZX_RIGHT_GET_PROPERTYCan query object properties
ZX_RIGHT_SET_PROPERTYCan modify object properties
ZX_RIGHT_SIGNALCan set user signals on the object
ZX_RIGHT_WAITCan wait on the object’s signals
ZX_RIGHT_MANAGE_PROCESSCan perform management ops on a process
ZX_RIGHT_MANAGE_THREADCan manage threads

When a syscall is invoked on a handle, the kernel checks that the handle’s rights include the rights required by that syscall. For example, zx_channel_write() requires ZX_RIGHT_WRITE on the channel handle.

Rights can only be reduced, never amplified. zx_handle_duplicate() takes a rights mask and the new handle gets original_rights & requested_rights.

Handle Lifecycle

Creation: Syscalls that create kernel objects return handles. For example, zx_channel_create() returns two handles (one for each endpoint). zx_vmo_create() returns a VMO handle. The initial rights are defined per object type (e.g., a new channel endpoint gets READ|WRITE|TRANSFER|DUPLICATE|SIGNAL|WAIT).

Duplication: zx_handle_duplicate(handle, rights) -> new_handle. Creates a second handle to the same kernel object, possibly with reduced rights. The original is untouched. Requires ZX_RIGHT_DUPLICATE on the source handle.

Transfer: Handles are transferred through channels. When a message is written to a channel, handles listed in the message are moved from the sender’s handle table to a transient state inside the channel message. When the message is read, those handles are installed into the receiver’s handle table with new handle values. The original handle values in the sender become invalid. Transfer requires ZX_RIGHT_TRANSFER on each handle being sent.

Replacement: zx_handle_replace(handle, rights) -> new_handle. Atomically invalidates the old handle and creates a new one with the specified rights (must be a subset). This avoids a window where two handles exist simultaneously (unlike duplicate-then-close). Useful for reducing rights before transferring.

Closing: zx_handle_close(handle). Removes the handle from the process’s table and decrements the kernel object’s refcount. When the last handle to an object is closed, the object is destroyed (with some exceptions like the kernel itself keeping references).

Comparison to capOS

capOS’s current CapTable maps CapId (u32) to an Arc<dyn CapObject>. The shared Arc lets a single kernel capability (for example, a kernel:endpoint owned by one service and referenced by another through CapSource::Service) back multiple per-process CapTable slots for cross-process IPC. This is conceptually similar to Zircon’s handle table, but with key differences:

AspectZirconcapOS (current)
RightsBitmask per handleNone (all-or-nothing)
Object typesFixed kernel types (Channel, VMO, etc.)Extensible via CapObject trait
TransferMove semantics through channelsCopy/move descriptors through Endpoint IPC
DuplicationExplicit with rights reductionCopy transfer for transferable holds
RevocationClose handle; object dies with last refRemove from table; no propagation
InterfaceFixed syscall per object typeCap’n Proto method dispatch
Generation counterLow bits of handle valueUpper bits of CapId

Recommendations for capOS:

  1. Keep method authority in typed interfaces for now. Zircon’s rights bitmask is useful for an untyped syscall surface. capOS currently uses narrow Cap’n Proto interfaces plus hold-edge transfer metadata; generic READ/WRITE flags would duplicate schema-level authority unless a concrete cross-interface need appears.

  2. Handle generation counters. Implemented: capOS encodes a generation tag in the upper bits of CapId, with lower bits selecting the table slot. This catches stale CapId use after slot reuse.

  3. Move semantics for transfer. Implemented for Endpoint CALL/RETURN sideband descriptors. Copy transfer remains explicit and requires a transferable source hold.

  4. replace operation. An atomic replace (invalidate old, create new with reduced rights) is cleaner than duplicate-then-close for rights attenuation before transfer.

2. Channels

Overview

Zircon channels are the fundamental IPC primitive. A channel is a bidirectional, asynchronous message-passing pipe with two endpoints. Each endpoint is a separate kernel object referenced by a handle.

Creation and Structure

zx_channel_create(options, &handle0, &handle1) creates a channel and returns handles to both endpoints. Each endpoint can be independently transferred to different processes. When one endpoint is closed, the other becomes “peer-closed” (signaled with ZX_CHANNEL_PEER_CLOSED).

Message Format

A channel message consists of:

  • Data: Up to 65,536 bytes (64 KiB) of arbitrary byte payload.
  • Handles: Up to 64 handles transferred with the message.

Messages are discrete and ordered (FIFO). There is no streaming or partial reads – you read a complete message or nothing.

Write and Read Syscalls

Write: zx_channel_write(handle, options, bytes, num_bytes, handles, num_handles)

  • Copies bytes into the kernel message queue.
  • Moves each handle in the handles array from the caller’s handle table into the message. If any handle is invalid or lacks ZX_RIGHT_TRANSFER, the entire write fails and no handles are moved.
  • The write is non-blocking. If the peer has been closed, returns ZX_ERR_PEER_CLOSED.

Read: zx_channel_read(handle, options, bytes, handles, num_bytes, num_handles, actual_bytes, actual_handles)

  • Dequeues the next message. Copies data into bytes, installs handles into the caller’s handle table, writing new handle values into the handles array.
  • If the buffer is too small, returns ZX_ERR_BUFFER_TOO_SMALL and fills actual_bytes/actual_handles so the caller can retry with a larger buffer.
  • Non-blocking by default.

zx_channel_call: A synchronous call primitive. Writes a message to the channel, then blocks waiting for a reply with a matching transaction ID. This is the primary mechanism for client-server RPC. The kernel optimizes this path to avoid unnecessary scheduling: if the server thread is waiting to read, the kernel can directly switch to it (similar to L4 IPC optimizations).

Handle Transfer Mechanics

When handles are sent through a channel:

  1. The kernel validates all handles (exist, have TRANSFER right).
  2. Handles are atomically removed from the sender’s table.
  3. Handle objects are stored inside the kernel message structure.
  4. On read, handles are inserted into the receiver’s table with fresh handle values.
  5. If the channel is destroyed with unread messages containing handles, those handles are closed (objects’ refcounts decremented).

This is critical: handle transfer is move, not copy. The sender loses the handle. To keep a copy, the sender must duplicate before sending.

Signals

Each channel endpoint has associated signals:

  • ZX_CHANNEL_READABLE – at least one message is queued.
  • ZX_CHANNEL_PEER_CLOSED – the other endpoint was closed.

Processes can wait on these signals using zx_object_wait_one(), zx_object_wait_many(), or by binding to a port (see Section 4).

FIDL Relationship

Channels carry raw bytes + handles. FIDL (Section 5) provides the structured protocol layer on top: it defines how bytes are laid out (message header with transaction ID, ordinal, flags; then the payload) and how handles in the message correspond to protocol-level concepts (client endpoints, server endpoints, VMOs, etc.).

Every FIDL protocol communication happens over a channel. A FIDL “client end” is a channel endpoint handle where the client sends requests and reads responses. A “server end” is the other endpoint where the server reads requests and sends responses.

Comparison to capOS

capOS currently uses shared submission/completion rings with Endpoint objects for cross-process CALL/RECV/RETURN routing. Same-process capabilities dispatch directly through the holder’s table; cross-process Endpoint calls queue to the server ring and can trigger a direct IPC handoff when the receiver is blocked.

AspectZircon ChannelscapOS
TopologyPoint-to-point, 2 endpointsEndpoint-routed capability calls
AsyncNon-blocking read/write + signal waitsShared SQ/CQ rings
Handle/cap transferEmbedded in messagesSideband transfer descriptors
Message formatRaw bytes + handlesCap’n Proto serialized
Size limits64 KiB data, 64 handles64 KiB params (current limit)
BufferingKernel-side message queueEndpoint queues plus per-process rings

Recommendations for capOS:

  1. Capability transfer alongside capnp messages. Zircon embeds handles as out-of-band data alongside message bytes. capOS has adopted the same separation with ring sideband transfer descriptors and result-cap records. That keeps the kernel from parsing arbitrary Cap’n Proto payload graphs.

  2. Two-endpoint channels vs. Endpoint calls. Zircon’s channels are general-purpose pipes. capOS uses a lighter Endpoint CALL/RECV/RETURN model where a capability invocation is routed to the serving process rather than requiring a channel object per connection.

  3. Message size limits. Zircon’s 64 KiB limit has been a pain point (large data must go through VMOs). capOS’s capnp messages naturally handle this because large data can be a separate VMO-like capability referenced in the message. Keep the per-message limit reasonable (64 KiB is a good default) and use capability references for bulk data.

3. VMARs and VMOs

Virtual Memory Objects (VMOs)

A VMO is a kernel object representing a contiguous region of virtual memory that can be mapped into address spaces. VMOs are the fundamental unit of memory in Zircon.

Types:

  • Paged VMO: Backed by the page fault handler. Pages are allocated on demand. This is the default.
  • Physical VMO: Backed by a specific contiguous range of physical memory. Used for device MMIO.
  • Contiguous VMO: Like a paged VMO but guarantees physically contiguous pages. Used for DMA.

Key operations:

  • zx_vmo_create(size, options) -> handle: Create a paged VMO.
  • zx_vmo_read(handle, buffer, offset, length): Read bytes from a VMO.
  • zx_vmo_write(handle, buffer, offset, length): Write bytes to a VMO.
  • zx_vmo_get_size() / zx_vmo_set_size(): Query/resize.
  • zx_vmo_op_range(): Operations like commit (force-allocate pages), decommit (release pages back to system), cache ops.

VMOs can be read/written directly via syscalls without mapping them. This is useful for small transfers but less efficient than mapping for large data.

Copy-on-Write (CoW) Cloning

zx_vmo_create_child(handle, options, offset, size) -> child_handle

Creates a child VMO that is a CoW clone of a range within the parent. Several clone types exist:

  • Snapshot (ZX_VMO_CHILD_SNAPSHOT): Point-in-time snapshot. Both parent and child see CoW pages. Writes to either side trigger page copies. The child is fully independent after creation – closing the parent does not affect committed pages in the child.

  • Slice (ZX_VMO_CHILD_SLICE): A window into the parent. No CoW – writes to the slice are visible through the parent and vice versa. The child cannot outlive the parent.

  • Snapshot-at-least-on-write (ZX_VMO_CHILD_SNAPSHOT_AT_LEAST_ON_WRITE): Like snapshot but allows the implementation to share unchanged pages between parent and child more aggressively (pages only diverge when written).

CoW cloning is central to how Fuchsia implements fork()-like semantics for memory (though Fuchsia doesn’t have fork()) and how it shares immutable data (e.g., shared libraries are CoW-cloned VMOs).

Virtual Memory Address Regions (VMARs)

A VMAR represents a contiguous range of virtual address space within a process. VMARs form a tree rooted at the process’s root VMAR, which covers the entire user-accessible address space.

Hierarchy:

Root VMAR (entire user address space)
  +-- Sub-VMAR A (e.g., 0x1000..0x10000)
  |     +-- Mapping of VMO X at offset 0x1000
  |     +-- Sub-VMAR B (0x5000..0x8000)
  |           +-- Mapping of VMO Y at offset 0x5000
  +-- Sub-VMAR C (0x20000..0x30000)
        +-- Mapping of VMO Z at offset 0x20000

Key operations:

  • zx_vmar_map(vmar, options, offset, vmo, vmo_offset, len) -> addr: Map a VMO (or a range of it) into the VMAR at a specific offset or let the kernel choose (ASLR).
  • zx_vmar_unmap(vmar, addr, len): Remove a mapping.
  • zx_vmar_protect(vmar, options, addr, len): Change permissions (read/write/execute) on a mapped range.
  • zx_vmar_allocate(vmar, options, offset, size) -> child_vmar, addr: Create a sub-VMAR.
  • zx_vmar_destroy(vmar): Recursively unmap everything and destroy all sub-VMARs. Prevents new mappings.

ASLR: Zircon implements address space layout randomization through VMARs. When ZX_VM_OFFSET_IS_UPPER_LIMIT or no specific offset is given, the kernel randomizes placement within the VMAR.

Permissions: Mapping permissions (R/W/X) are constrained by the VMO handle’s rights. A VMO handle without ZX_RIGHT_EXECUTE cannot be mapped as executable, regardless of what the zx_vmar_map() call requests.

Why VMARs Matter

VMARs provide:

  1. Sandboxing within a process. A component can be given a sub-VMAR handle instead of the root VMAR, limiting where it can map memory.
  2. Hierarchical cleanup. Destroying a VMAR recursively unmaps everything beneath it.
  3. Controlled mapping. The parent decides the address space layout for child components by allocating sub-VMARs and passing only sub-VMAR handles.

Comparison to capOS

capOS currently has AddressSpace plus a VirtualMemory capability for anonymous map/unmap/protect operations. There is no VMO-like shared memory object yet; FrameAllocator still exposes raw physical frame grants.

AspectZirconcapOS (current)
Memory objectsVMO (paged, physical, contiguous)Raw frames plus anonymous VirtualMemory mappings
CoWVMO child clones (snapshot, slice)Not implemented
Address spaceVMAR treeFlat AddressSpace plus VirtualMemory cap
SharingMap same VMO in multiple processesNot implemented
PermissionsPer-mapping + per-handle rightsPer-page flags at mapping time

Recommendations for capOS:

  1. VMO-equivalent capability. A “MemoryObject” capability that represents a range of memory (backed by demand-paging or physical pages). This becomes the unit of sharing: pass a MemoryObject cap through IPC, and the receiver maps it into their address space. Define it in schema/capos.capnp.

  2. Sub-VMAR capabilities for sandboxing. When spawning a process, instead of granting access to the full address space, grant a sub-region capability. This limits where the process can map memory.

  3. CoW cloning is valuable but not urgent. The primary use case (shared libraries, fork) may not apply to capOS’s early stages. Design the VMO interface to support cloning later.

  4. VMO read/write without mapping. Zircon allows reading/writing VMO contents via syscall without mapping. This is useful for small IPC data and avoids TLB pressure. Consider supporting this in capOS’s MemoryObject.

4. Async Model (Ports)

Overview

Zircon’s async I/O model is built around ports – kernel objects that receive event packets. A port is similar to Linux’s epoll but with important differences. It is the foundation for all async programming in Fuchsia.

Port Basics

A port is a kernel object with a queue of packets (zx_port_packet_t). Packets arrive either from signal-based waits or from direct user queuing.

Key operations:

  • zx_port_create(options) -> handle: Create a port.
  • zx_port_wait(port, deadline) -> packet: Dequeue the next packet, blocking until one is available or the deadline expires.
  • zx_port_queue(port, packet): Manually enqueue a user packet.
  • zx_port_cancel(port, source, key): Cancel pending waits.

Signal-Based Async (Object Wait Async)

zx_object_wait_async(object, port, key, signals, options):

This is the primary mechanism. It tells the kernel: “when object has any of these signals asserted, deliver a packet to port with this key.”

Two modes:

  • One-shot (ZX_WAIT_ASYNC_ONCE): The wait fires once and is automatically removed. The user must re-register after handling.
  • Edge-triggered (ZX_WAIT_ASYNC_EDGE): Fires each time a signal transitions from deasserted to asserted. Stays registered.

Packet Format

typedef struct zx_port_packet {
    uint64_t key;              // User-defined key (set during wait_async)
    uint32_t type;             // ZX_PKT_TYPE_SIGNAL_ONE, ZX_PKT_TYPE_USER, etc.
    zx_status_t status;        // Result status
    union {
        zx_packet_signal_t signal;   // Which signals triggered
        zx_packet_user_t user;       // User-queued packet payload (32 bytes)
        zx_packet_guest_bell_t guest_bell;
        // ... other packet types
    };
} zx_port_packet_t;

The signal variant includes trigger (which signals were waited on), observed (current signal state), and a count (for edge-triggered, how many transitions).

Async Dispatching (libasync)

Fuchsia’s userspace async library (libfidl, async-loop) provides a higher-level event loop:

  1. async::Loop: An event loop that owns a port and dispatches events to registered handlers.
  2. async::Wait: Wraps zx_object_wait_async() with a callback. When the signal fires, the loop calls the handler.
  3. async::Task: Runs a closure on the loop’s dispatcher.
  4. FIDL bindings: The async FIDL bindings register channel-readable waits on the loop’s port. When a message arrives, the FIDL dispatcher decodes it and calls the appropriate protocol method handler.

The typical pattern:

loop = async::Loop()
loop.port -> zx_port_create()

// Register interest in channel readability
zx_object_wait_async(channel, loop.port, key, ZX_CHANNEL_READABLE)

// Event loop
while True:
    packet = zx_port_wait(loop.port)
    handler = lookup(packet.key)
    handler(packet)
    // Re-register if one-shot

Comparison to Linux io_uring

AspectZircon PortsLinux io_uring
ModelEvent notification (signals)Operation submission/completion
SubmissionNo SQ; operations are separate syscallsSQ ring: batch operations
CompletionPort packet queueCQ ring in shared memory
Kernel transitionsOne per wait_async + one per port_waitOne per io_uring_enter (batched)
Memory sharingNo shared ring buffersSQ/CQ are mmap’d shared memory
Zero-copyNot for port packetsRegistered buffers, fixed files
BatchingNo inherent batchingCore design: submit N ops, one syscall
ChainingNot supportedSQE linking (sequential/parallel)
ScopeSignal notification onlyFull I/O operations (read, write, send, recv, fsync, …)

Key differences:

  1. Ports are notification-based; io_uring is operation-based. A port tells you “something happened” (a signal was asserted), then you do separate syscalls to act on it (read the channel, accept the socket, etc.). io_uring lets you submit the actual I/O operation and the kernel does it asynchronously, returning the result in the completion ring.

  2. io_uring avoids syscalls for submission. The submission queue is shared memory – userspace writes SQEs and the kernel reads them without a syscall (in polling mode) or with a single io_uring_enter() for a batch of operations. Ports require a syscall per wait_async registration.

  3. io_uring supports chaining. SQE linking allows dependent operations (e.g., “read from file, then write to socket”) without returning to userspace between steps.

  4. Ports are simpler. The signal model is straightforward and composes well with Zircon’s object model. io_uring’s complexity (dozens of opcodes, registered buffers, fixed files, kernel-side polling) is much higher.

Performance Tradeoffs

Ports:

  • Pro: Simple, well-integrated with kernel object model, easy to reason about.
  • Con: Extra syscalls per operation (wait_async to register, port_wait to receive, then the actual operation syscall). At least 3 syscalls per async operation.

io_uring:

  • Pro: Can batch many operations in a single syscall. Shared-memory rings avoid copies. Kernel-side polling can eliminate syscalls entirely.
  • Con: Complex API surface, security attack surface (many kernel bugs have been in io_uring), complex state management.

Comparison to capOS’s Planned Async Rings

capOS plans io_uring-inspired capability rings: an SQ where userspace submits capnp-serialized capability invocations and a CQ where the kernel posts completions.

AspectZircon PortscapOS Planned Rings
SubmissionSeparate syscallsSQ in shared memory
CompletionPort packet queue (kernel-owned)CQ in shared memory
Operation scopeSignal notification onlyFull capability invocations
BatchingNoneNatural (fill SQ, single syscall)
Wire formatFixed packet structCap’n Proto messages

Recommendations for capOS:

  1. The io_uring model is better than ports for capOS’s use case. Since every operation in capOS is a capability invocation (not just a signal notification), putting the full operation in the submission ring eliminates the extra round-trip that ports require. This is the right choice.

  2. Keep a signal/notification mechanism too. Even with async rings, capOS needs a way to wait for events (e.g., “data available on this channel”, “process exited”). Consider a simple signal/wait mechanism alongside the capability rings – perhaps signal delivery goes through the CQ as a special completion type.

  3. Study io_uring’s SQE linking. Chaining dependent capability calls (e.g., “read from FileStore, then write to Console”) without returning to userspace is powerful. This maps naturally to Cap’n Proto promise pipelining: “call method A on cap X, then call method B on the result’s capability” – the kernel can chain these internally.

  4. Registered/fixed capabilities. io_uring has “fixed files” (registered fd set for faster lookup). capOS could have a “hot set” of capabilities pinned in the SQ context for faster dispatch (avoid per-call table lookup).

  5. Completion ordering. io_uring completions can arrive out of order. capOS’s CQ should also support out-of-order completion (each SQE has a user_data tag echoed in the CQE) to enable true async pipelining.

5. FIDL (Fuchsia Interface Definition Language)

Overview

FIDL is Fuchsia’s IDL for defining protocols that communicate over channels. It serves a similar role to Cap’n Proto schemas in capOS: defining the contract between client and server.

FIDL vs. Cap’n Proto: Schema Language

FIDL example:

library fuchsia.example;

type Color = strict enum : uint32 {
    RED = 1;
    GREEN = 2;
    BLUE = 3;
};

protocol Painter {
    SetColor(struct { color Color; }) -> ();
    DrawLine(struct { x0 float32; y0 float32; x1 float32; y1 float32; }) -> ();
    -> OnPaintComplete(struct { num_pixels uint64; });
};

Equivalent Cap’n Proto:

enum Color { red @0; green @1; blue @2; }

interface Painter {
    setColor @0 (color :Color) -> ();
    drawLine @1 (x0 :Float32, y0 :Float32, x1 :Float32, y1 :Float32) -> ();
}

Key differences in the schema language:

FeatureFIDLCap’n Proto
Unionsflexible union, strict unionAnonymous unions in structs
Enumsstrict enum, flexible enumenum (always strict)
Optionalitybox<T>, nullable typesDefault values, union with Void
Evolutionflexible keyword for forward compatField numbering, @N ordinals
Tablestable (like protobuf, sparse)struct with default values
Events-> EventName(...) server-sentNo built-in events
Error syntax-> () error uint32Must encode in return struct
Capability typesclient_end:P, server_end:Pinterface P as field type

FIDL’s table type is analogous to Cap’n Proto structs in terms of evolvability (can add fields without breaking), but Cap’n Proto structs are more compact on the wire (fixed-size inline section + pointers) while FIDL tables use an envelope-based encoding.

Wire Format Comparison

FIDL wire format:

  • Little-endian, 8-byte aligned.
  • Messages have a 16-byte header: txid (4 bytes), flags (3 bytes), magic byte (0x01), ordinal (8 bytes).
  • Structs are laid out inline with natural alignment and explicit padding.
  • Out-of-line data (strings, vectors, tables) uses offset-based indirection via “envelopes” (inline 8-byte entry: 4 bytes num_bytes, 2 bytes num_handles, 2 bytes flags).
  • Handles are out-of-band. The wire format contains ZX_HANDLE_PRESENT (0xFFFFFFFF) or ZX_HANDLE_ABSENT (0x00000000) markers where handles appear. The actual handles are in the channel message’s handle array, consumed in order of appearance in the linearized message.
  • Encoding is done into a contiguous byte buffer + a separate handle array, matching the channel write API.
  • No pointer arithmetic. FIDL v2 uses a “depth-first traversal order” encoding where out-of-line objects are laid out sequentially. Offsets are not stored; the decoder walks the type schema to find boundaries.

Cap’n Proto wire format:

  • Little-endian, 8-byte aligned (word-based).
  • Messages have a segment table header listing segment sizes.
  • Structs have a fixed data section + pointer section. Pointers are relative offsets (self-relative, in words).
  • Uses pointer-based random access: can read any field without parsing the entire message.
  • Capabilities are indexed. Cap’n Proto’s RPC protocol assigns capability table indices to interface references in messages. The actual capability (file descriptor, handle, etc.) is transferred out-of-band.
  • Supports multi-segment messages (FIDL is always single-segment).
  • Zero-copy read: can read directly from the wire buffer without deserialization.

Key wire format differences:

PropertyFIDLCap’n Proto
Random accessNo (sequential decode)Yes (pointer-based)
Zero-copy readPartial (decode-on-access for some types)Full (read from buffer)
SegmentsSingle contiguous bufferMulti-segment
PointersImplicit (traversal order)Explicit (relative offsets)
Size overheadSmaller (no pointer words)Larger (pointer section)
Decode costMust validate sequentiallyCan validate lazily
Handle/cap encodingPresence markers + out-of-band arrayCap table indices + out-of-band

FIDL Capability Transfer

FIDL has first-class syntax for capability transfer in protocols:

protocol FileSystem {
    Open(resource struct {
        path string:256;
        flags uint32;
        object server_end:File;
    }) -> ();
};

protocol File {
    Read(struct { count uint64; }) -> (struct { data vector<uint8>:MAX; });
    GetBuffer(struct { flags uint32; }) -> (resource struct { buffer zx.Handle:VMO; });
};
  • server_end:File – a channel endpoint where the server will serve the File protocol. The client creates a channel, keeps the client end, and sends the server end through this call.
  • client_end:File – a channel endpoint for a client of the File protocol.
  • zx.Handle:VMO – a handle to a specific kernel object type (VMO).
  • The resource keyword marks types that contain handles (and thus cannot be copied, only moved).

The FIDL compiler tracks handle ownership: types containing handles are “resource types” with move semantics. This is enforced at the language binding level (e.g., in C++, resource types are move-only; in Rust, they implement Drop but not Clone).

Comparison to capOS’s Cap’n Proto Usage

Cap’n Proto natively supports capability transfer through its interface types:

interface FileSystem {
    open @0 (path :Text, flags :UInt32) -> (file :File);
}

interface File {
    read @0 (count :UInt64) -> (data :Data);
    getBuffer @1 (flags :UInt32) -> (buffer :MemoryObject);
}

In standard Cap’n Proto RPC, file :File in the return type means “a capability to a File interface.” The RPC system assigns a capability table index, transfers it out-of-band, and the receiver gets a live reference to invoke further methods.

Recommendations for capOS:

  1. Use out-of-band capability transfer beside Cap’n Proto payloads. Cap’n Proto RPC has capability descriptors indexed into a capability table, but capOS currently keeps kernel transfer semantics in ring sideband records so the kernel can treat Cap’n Proto payload bytes as opaque. Promise pipelining should build on that sideband result-cap namespace rather than requiring general payload traversal in the kernel.

  2. No need to switch to FIDL. Cap’n Proto’s wire format is superior for capOS’s use case:

    • Random access means runtimes and services can inspect specific fields without full deserialization. The kernel should keep using bounded sideband metadata for transport decisions.
    • Zero-copy read means less allocation in userspace protocol handling.
    • Multi-segment messages allow avoiding large contiguous allocations.
    • Promise pipelining is native to Cap’n Proto RPC, aligning with capOS’s planned async ring chaining.
  3. FIDL’s resource keyword is worth imitating. Mark capnp types that contain capabilities differently from pure-data types. This could be done at the schema level (Cap’n Proto already distinguishes interface fields) or as a convention. This enables the kernel to fast-path messages that contain no capabilities (no need to scan for capability descriptors).

  4. FIDL’s table type for evolution. Cap’n Proto structs already support adding fields, but capOS should be aware that FIDL tables are more explicitly designed for cross-version compatibility. For system interfaces that will evolve, consider using Cap’n Proto groups or designing structs with generous ordinal spacing.

6. Synthesis: Relevance to capOS

Handle Model vs. Typed Capability Dispatch

Zircon’s handle model is untyped at the handle level – a handle is just (object_ref, rights). The type comes from the object. All operations go through fixed syscalls (zx_channel_write, zx_vmo_read, etc.).

capOS’s model is typed at the capability level – each capability implements a Cap’n Proto interface with method dispatch. Operations go through ring SQEs such as CAP_OP_CALL, with Cap’n Proto params and results carried in userspace buffers.

Both are valid. Zircon’s approach is lower overhead (no serialization for simple operations like vmo_read), while capOS’s approach gives uniformity (every operation has the same wire format, enabling persistence and network transparency).

Hybrid recommendation: For performance-critical operations (memory mapping, signal waiting), consider adding “fast-path” syscalls that bypass capnp serialization, similar to how Zircon has dedicated syscalls per object type. The capnp path remains the general mechanism and the “canonical” interface.

Async Rings vs. Ports: The Right Call

capOS’s io_uring-inspired async rings are a better fit than Zircon’s port model for a capability OS:

  1. Ports require separate syscalls for registration, waiting, and the actual operation. Async rings batch everything.
  2. Cap’n Proto’s promise pipelining maps naturally to SQE chaining.
  3. The shared-memory ring design avoids kernel-side queuing overhead.

However, learn from ports:

  • The signal model (each object has a signal set, watchers are notified) is clean and composable. Consider making “wait for signal” a CQ event type.
  • zx_port_queue() (user-initiated packets) is useful for waking up event loops from user code. Support user-initiated CQ entries.

VMO/VMAR vs. capOS Memory Model

capOS should implement VMO-equivalent capabilities after the current Endpoint and transfer baseline:

  • IPC already has shared rings, but bulk data still needs explicit shared memory objects.
  • Capability transfer of memory regions (passing a MemoryObject cap through IPC) is the standard pattern for bulk data transfer.
  • CoW cloning enables efficient process creation.

Proposed capability interfaces:

interface MemoryObject {
    read @0 (offset :UInt64, count :UInt64) -> (data :Data);
    write @1 (offset :UInt64, data :Data) -> ();
    getSize @2 () -> (size :UInt64);
    setSize @3 (size :UInt64) -> ();
    createChild @4 (offset :UInt64, size :UInt64, options :UInt32) -> (child :MemoryObject);
}

interface AddressRegion {
    map @0 (offset :UInt64, vmo :MemoryObject, vmoOffset :UInt64, len :UInt64, flags :UInt32) -> (addr :UInt64);
    unmap @1 (addr :UInt64, len :UInt64) -> ();
    protect @2 (addr :UInt64, len :UInt64, flags :UInt32) -> ();
    allocateSubRegion @3 (offset :UInt64, size :UInt64) -> (region :AddressRegion, addr :UInt64);
}

FIDL vs. Cap’n Proto: Stay with Cap’n Proto

Cap’n Proto is the right choice for capOS. The advantages over FIDL:

  1. Language-independent standard. FIDL is Fuchsia-only. Cap’n Proto has implementations in C++, Rust, Go, Python, Java, etc.
  2. Zero-copy random access. The kernel can inspect message fields without full deserialization.
  3. Promise pipelining. Native to capnp-rpc, enabling the async ring chaining that capOS plans.
  4. Persistence. Cap’n Proto messages are self-describing (with schema) and suitable for on-disk storage – important for capOS’s planned capability persistence.

The one thing FIDL does better: tight integration of handle/capability metadata in the type system (the resource keyword, client_end/server_end syntax, handle type constraints). capOS should ensure its capnp schemas clearly distinguish capability-carrying types and that the kernel enforces capability transfer semantics.

Concrete Action Items for capOS

Ordered by priority and dependency:

  1. Keep typed-interface authority model. Do not add a Zircon-style generic rights bitmask until a concrete method-attenuation need beats narrow wrapper capabilities and transfer-mode metadata.

  2. Handle generation counters. Done: upper bits of CapId detect stale references.

  3. Design MemoryObject/SharedBuffer capability. Define and implement the shared-memory object that replaces raw-frame transfer for bulk IPC.

  4. Design AddressRegion capability (Stage 5). Sub-VMAR-like sandboxing. The root VMAR handle is part of the initial capability set.

  5. Capability transfer sideband. Baseline CALL/RETURN copy and move transfer is implemented; promise-pipelined result-cap mapping still needs a precise rule before pipeline dispatch lands.

  6. Async rings with signal delivery. SQ/CQ capability rings are implemented for transport; notification objects and promise pipelining remain future work.

  7. User-queued CQ entries (with async rings). Allow userspace to post wake-up events to its own CQ, enabling pure-userspace event loop integration.

Appendix: Key Zircon Syscall Reference

For reference, the most architecturally significant Zircon syscalls:

SyscallPurpose
zx_handle_closeClose a handle
zx_handle_duplicateDuplicate with rights reduction
zx_handle_replaceAtomic replace with new rights
zx_channel_createCreate channel pair
zx_channel_readRead message + handles from channel
zx_channel_writeWrite message + handles to channel
zx_channel_callSynchronous write-then-read (RPC)
zx_port_createCreate async port
zx_port_waitWait for next packet
zx_port_queueEnqueue user packet
zx_object_wait_asyncRegister signal wait on port
zx_object_wait_oneSynchronous wait on one object
zx_vmo_createCreate virtual memory object
zx_vmo_read / writeDirect VMO access
zx_vmo_create_childCoW clone
zx_vmar_mapMap VMO into address region
zx_vmar_unmapUnmap
zx_vmar_allocateCreate sub-VMAR
zx_process_createCreate process (with root VMAR)
zx_process_startStart process execution

Genode OS Framework: Research Report for capOS

Research on Genode’s capability-based component framework, session routing, VFS architecture, and POSIX compatibility – with lessons for capOS.

1. Capability-Based Component Framework

Core Abstraction: RPC Objects

Genode’s fundamental abstraction is the RPC object. Every service in the system is implemented as an RPC object that can be invoked by clients holding a capability to it. The capability is an unforgeable reference – a kernel- protected token that names a specific RPC object and grants the holder the right to invoke its methods.

Genode supports multiple microkernels (NOVA, seL4, Fiasco.OC, a custom base-hw kernel). The capability model is consistent across all of them, though the kernel-level implementation details differ. The framework abstracts kernel capabilities into its own uniform model.

Key properties of Genode capabilities:

  • Unforgeable. A capability can only be obtained by delegation from a holder or creation by the kernel. There is no mechanism to synthesize a capability from an integer or address.
  • Typed. Each capability refers to an RPC object with a specific interface. The C++ type system enforces interface contracts at compile time.
  • Delegatable. A capability holder can pass it to another component via RPC arguments, allowing authority to flow through the system graph.
  • Revocable. Capabilities can be revoked (invalidated). When an RPC object is destroyed, all capabilities pointing to it become invalid.

Capability Types in Genode

Genode distinguishes several kinds of capabilities based on what they refer to:

  1. Session capabilities. The most common type. A session capability refers to a service session – an ongoing relationship between a client and a server. Example: a Log_session capability lets a client write log messages to a specific log session on a LOG server.

  2. Parent capability. Every component holds an implicit capability to its parent. This is the channel through which it requests resources and sessions. The parent capability is never explicitly passed – it’s built into the component framework.

  3. Dataspace capabilities. Represent shared-memory regions. A Ram_dataspace capability grants access to a specific region of physical memory. Dataspaces are the mechanism for bulk data transfer between components (the RPC path is for small messages and control).

  4. Signal capabilities. Used for asynchronous notifications. A signal source produces signals; holders of the signal capability can register handlers. Signals are Genode’s primary async notification mechanism – they don’t carry data, just wake up the receiver.

Sessions: The Service Contract

A session is the central concept of Genode’s inter-component communication. It represents an established relationship between a client component and a server component, with negotiated resource commitments.

Session lifecycle:

  1. Request. A client asks its parent to create a session of a specific type (e.g., Gui::Session, File_system::Session, Nic::Session). The request includes a label string and optional session arguments.

  2. Routing. The parent routes the session request according to its policy (see Section 2). The request may traverse multiple levels of the component tree.

  3. Creation. The server creates a session object, allocates resources for it (e.g., a shared-memory buffer), and returns a session capability to the client.

  4. Use. The client invokes RPC methods on the session capability. The server handles the calls. Both sides can use shared dataspaces for bulk data.

  5. Close. Either side can close the session. Resources committed to the session are released back.

This model is fundamentally different from Unix IPC (anonymous pipes/sockets). Every session is:

  • Typed – the interface is known at compile time.
  • Named – sessions carry a label used for routing and policy.
  • Resource-accounted – the client explicitly donates RAM to the server via a “session quota” to fund the server-side state for this session. This prevents denial-of-service through resource exhaustion.

Resource Trading

Genode’s resource model is unique and worth studying closely. Resources (primarily RAM) flow through the component tree:

  • The kernel grants a fixed RAM budget to core (the root component).
  • Core grants budgets to its children (typically just init).
  • Init grants budgets to its children according to the deployment config.
  • Each component can donate RAM to servers when opening sessions.

The session_quota mechanism works as follows: when a client opens a session, it specifies how much RAM it donates. This RAM transfer goes from the client’s budget to the server’s budget. The server uses this donated RAM to allocate server-side state for the session. When the session closes, the RAM flows back.

This creates a closed accounting system:

  • No component can use more RAM than it was granted.
  • Servers don’t need their own large budgets – clients fund their sessions.
  • Resource exhaustion is contained: a misbehaving client can only exhaust its own budget, not the server’s.

Capability Invocation vs. Delegation

Genode distinguishes two fundamental operations on capabilities:

Invocation: calling an RPC method on the capability. The caller sends a message to the RPC object named by the capability, the server processes it and returns a result. This is synchronous in Genode – the caller blocks until the server replies. (Asynchronous interaction uses signals and shared memory.)

Delegation: passing a capability as an argument in an RPC call. When a capability appears as a parameter or return value, the kernel transfers the capability reference to the receiving component. The receiver now holds an independent reference to the same RPC object. This is how authority propagates through the system.

Example: when a client opens a File_system::Session, the session creation returns a session capability. If the file system server needs to allocate memory, it calls back to the client’s RAM service using a RAM capability that was delegated during session setup.

Capabilities in Genode RPC are transferred by the kernel during the IPC operation – the framework marshals them into a special “capability argument” slot in the IPC message, and the kernel copies the capability reference into the receiver’s capability space. This is transparent to application code: capabilities appear as typed C++ objects in the RPC interface.

2. Session Routing

The Problem Session Routing Solves

In a traditional OS, services are found via well-known names in a global namespace (D-Bus addresses, socket paths, service names). This creates ambient authority – any process can connect to any service if it knows the name.

Genode has no global service namespace. A component can only obtain sessions through its parent. The parent decides which server to route each session request to. This means:

  • Service visibility is controlled structurally.
  • A component can only reach services its parent explicitly allows.
  • Different children of the same parent can be routed to different servers for the same service type.

Parent-Child Relationship

Every Genode component (except core) has exactly one parent. The parent:

  1. Created the child (spawned it with an initial set of resources).
  2. Intercepts all session requests from the child.
  3. Routes requests according to its routing policy.
  4. Can deny requests entirely (the child gets an error).

This creates a tree structure where authority flows downward. A child cannot bypass its parent to reach a service the parent didn’t approve.

Init’s Routing Configuration

The init process (Genode’s init) reads an XML configuration that specifies which services to start and how to route their session requests. This is the core of system policy.

A minimal init config:

<config>
  <parent-provides>
    <service name="LOG"/>
    <service name="ROM"/>
    <service name="CPU"/>
    <service name="RAM"/>
    <service name="PD"/>
  </parent-provides>

  <start name="timer">
    <resource name="RAM" quantum="1M"/>
    <provides> <service name="Timer"/> </provides>
    <route>
      <service name="ROM"> <parent/> </service>
      <service name="LOG"> <parent/> </service>
      <service name="CPU"> <parent/> </service>
      <service name="RAM"> <parent/> </service>
      <service name="PD">  <parent/> </service>
    </route>
  </start>

  <start name="test-log">
    <resource name="RAM" quantum="1M"/>
    <route>
      <service name="Timer"> <child name="timer"/> </service>
      <service name="LOG">   <parent/> </service>
      <!-- remaining services routed to parent by default -->
      <any-service> <parent/> </any-service>
    </route>
  </start>
</config>

Key routing directives:

  • <parent/> – route to the parent (upward delegation).
  • <child name="x"/> – route to a specific child (sibling routing).
  • <any-child/> – route to any child that provides the service.
  • <any-service> – catch-all for unspecified service types.

Label-Based Routing

Labels are strings attached to session requests. They carry context about who is requesting and what they want, enabling fine-grained routing decisions.

When a client requests a session, it attaches a label. As the request traverses the routing tree, each intermediate component (typically init) can prepend its own label. By the time the request reaches the server, the label encodes the full path through the component tree.

Example: a component named my-app inside an init subsystem named apps requests a File_system session with label "data". The composed label arriving at the file system server is: "apps -> my-app -> data".

The server can use this label for:

  • Access control. Grant different permissions based on who is asking.
  • Isolation. Store data in different directories per client.
  • Logging. Identify which component generated a message.

Label-based routing in init config:

<start name="fs">
  <provides> <service name="File_system"/> </provides>
  <route> ... </route>
</start>

<start name="app-a">
  <route>
    <service name="File_system" label="data">
      <child name="fs"/>
    </service>
    <service name="File_system" label="config">
      <child name="config-fs"/>
    </service>
  </route>
</start>

Here, app-a’s file system requests are split: requests labeled "data" go to one server, requests labeled "config" go to another. The application code is unchanged – the routing is entirely a deployment decision.

Routing as Policy

The critical insight is that routing IS access control. There is no separate permission system. If a component’s route config doesn’t include a path to a network service, that component has no network access – period. It cannot discover the network service because it has no way to name it.

This replaces:

  • Firewall rules (routing controls which network services are reachable)
  • File permissions (routing controls which file system sessions are available)
  • Process isolation policies (routing controls everything)

The routing configuration is equivalent to a whitelist of allowed service connections for each component. Adding or removing access means editing the init config, not modifying the component’s code or the server’s access control lists.

Dynamic Routing and Sculpt

In the static case (Genode’s test scenarios), routing is defined once in init’s config. In Sculpt OS (Section 6), the routing configuration can be modified at runtime, allowing users to install applications and connect them to services dynamically.

3. VFS on Top of Capabilities

The VFS Layer

Genode’s VFS (Virtual File System) is a library-level abstraction, not a kernel feature. It provides a path-based file-like interface implemented as a plugin architecture within a component’s address space.

The VFS exists because many existing applications (and libc) expect file-like access patterns. Rather than forcing all code to use Genode’s native session/capability model, the VFS provides a translation layer.

Architecture:

Application code
  |
  |  POSIX: open(), read(), write()
  v
libc (Genode's port of FreeBSD libc)
  |
  |  VFS API: vfs_open(), vfs_read(), vfs_write()
  v
VFS library (in-process)
  |
  |  Plugin dispatch based on mount point
  v
VFS plugins (in-process)
  |
  +--> ram_fs plugin (in-memory file system)
  +--> <fs> plugin (delegates to File_system session)
  +--> <terminal> plugin (delegates to Terminal session)
  +--> <log> plugin (delegates to LOG session)
  +--> <nic> plugin (delegates to Nic session, for socket layer)
  +--> <block> plugin (delegates to Block session)
  +--> <dir> plugin (combines subtrees)
  +--> <tar> plugin (read-only tar archive)
  +--> <import> plugin (populate from ROM)
  +--> <pipe> plugin (in-process pipe pair)
  +--> <rtc> plugin (system clock)
  +--> <zero> plugin (/dev/zero equivalent)
  +--> <null> plugin (/dev/null equivalent)
  ...

VFS Plugin Architecture

Each VFS plugin is a dynamically loadable library (or statically linked module) that implements a file-system-like interface. Plugins handle:

  • open/close – create/destroy file handles
  • read/write – data transfer
  • stat – metadata queries
  • readdir – directory enumeration
  • ioctl – device-specific control (limited)

Plugins are composed by the VFS configuration, which is XML embedded in the component’s config:

<config>
  <vfs>
    <dir name="dev">
      <log/>
      <null/>
      <zero/>
      <terminal name="stdin" label="input"/>
      <inline name="rtc">2024-01-01 00:00</inline>
    </dir>
    <dir name="tmp"> <ram/> </dir>
    <dir name="data"> <fs label="persistent"/> </dir>
    <dir name="socket"> <lxip dhcp="yes"/> </dir>
  </vfs>
  <libc stdout="/dev/log" stderr="/dev/log" stdin="/dev/stdin"
        rtc="/dev/rtc" socket="/socket"/>
</config>

This config creates a virtual filesystem tree:

  • /dev/log – writes go to the LOG session
  • /dev/null, /dev/zero – standard synthetic files
  • /dev/stdin – reads from a Terminal session
  • /tmp/ – in-memory filesystem (RAM-backed)
  • /data/ – delegates to a File_system session labeled “persistent”
  • /socket/ – network sockets via lwIP stack (in-process)

The <fs> plugin is the bridge from VFS to Genode’s capability world. When the application does open("/data/foo.txt"), the <fs> plugin translates this into a File_system::Session RPC call to the external file system server that the component’s routing connects to.

File System Components

Genode has several file system server components:

  • ram_fs – in-memory file system server. Multiple components can share files through it by routing their File_system sessions to it.
  • vfs_server (previously vfs) – a file system server backed by the VFS plugin architecture itself. This enables recursive composition: a VFS server can mount another VFS server.
  • fatfs – FAT file system driver over a Block session.
  • ext2_fs – ext2/3/4 via a ported Linux implementation (rump kernel).
  • store_fs / recall_fs – content-hash-based storage (experimental in some Genode releases).

The file system server is a regular Genode component. It receives a Block session (from a block device driver), provides File_system sessions, and the routing determines who can access what:

block_driver -> provides Block session
       |
       v
fatfs -> consumes Block session, provides File_system session
       |
       v
application -> consumes File_system session via VFS <fs> plugin

Libc Integration

Genode ports a substantial subset of FreeBSD’s libc. The integration point is the VFS: libc’s file operations are implemented by calling the VFS layer, which dispatches to plugins, which invoke Genode sessions as needed.

The libc port modifies FreeBSD libc minimally. Most changes are in the “backend” layer that replaces kernel syscalls with VFS calls:

  • open() -> vfs_open() -> VFS plugin dispatch
  • read() -> vfs_read() -> VFS plugin
  • socket() -> via VFS socket plugin (<lxip> or <lwip>)
  • mmap() -> supported for anonymous mappings and file-backed read-only
  • fork() -> NOT supported (no fork() in Genode)
  • exec() -> NOT supported (no in-place process replacement)
  • pthreads -> supported via Genode’s Thread API
  • select()/poll() -> supported via VFS notification mechanism
  • signal() -> partial support (SIGCHLD, basic signal delivery)

The key architectural decision: libc talks to the VFS library (in-process), the VFS talks to Genode sessions (cross-process RPC). Application code never directly touches Genode capabilities – the VFS mediates everything.

4. POSIX Compatibility

The Noux Approach (Historical)

Genode’s early POSIX approach was Noux, a process runtime that emulated Unix-like process semantics (fork, exec, pipe) on top of Genode. Noux ran as a single Genode component containing multiple “Noux processes” that shared an address space but had separate VFS views.

Noux supported:

  • fork() via copy-on-write within the Noux address space
  • exec() via in-place program replacement
  • pipe() for inter-process communication
  • A shared file system namespace

Noux was eventually deprecated because:

  1. It conflated multiple processes in one address space, undermining Genode’s isolation model.
  2. Fork emulation was fragile and slow.
  3. The libc-based VFS approach (Section 3) achieved better compatibility with less complexity.

Current Approach: libc + VFS

The current POSIX compatibility strategy:

  1. FreeBSD libc port. Provides standard C library functions. Modified to use Genode’s VFS instead of kernel syscalls.

  2. VFS plugins as POSIX backends. Each POSIX I/O pattern maps to a VFS plugin:

    • File I/O -> <fs> plugin -> File_system session
    • Sockets -> <lxip> or <lwip> plugin -> Nic session (in-process TCP/IP stack)
    • Terminal I/O -> <terminal> plugin -> Terminal session
    • Device access -> custom VFS plugins
  3. No fork(). The most significant POSIX omission. Programs that require fork() must be modified to use posix_spawn() or Genode’s native child-spawning mechanism. In practice, many programs use fork() only for daemon patterns or subprocess creation, and can be adapted.

  4. No exec(). Related to no fork(): there’s no in-place process replacement. New processes are created as new Genode components.

  5. Signals. Basic support – enough for SIGCHLD notification and simple signal handling. Complex signal semantics (real-time signals, signal-driven I/O) are not supported.

  6. pthreads. Fully supported via Genode’s native threading.

  7. mmap. Anonymous mappings and read-only file-backed mappings work. MAP_SHARED with write semantics is limited.

What Works in Practice

Genode has successfully ported:

  • Qt5/Qt6 – the full widget toolkit, including QtWebEngine (Chromium). This is the basis of Sculpt’s GUI.
  • VirtualBox – full x86 virtualization (runs Windows, Linux guests).
  • Mesa/Gallium – GPU-accelerated 3D graphics.
  • curl, wget, fetchmail – network utilities.
  • GCC toolchain – compiler, assembler, linker running on Genode.
  • bash – with limitations (no job control via signals, no fork-heavy patterns). Works for simple scripting.
  • vim, nano – terminal editors.
  • OpenSSL/LibreSSL – cryptographic libraries.
  • Various system utilities – ls, cp, rm, etc. via Coreutils port.

Applications that don’t port well:

  • Anything deeply dependent on fork+exec patterns (e.g., traditional Unix shells for complex scripting).
  • Programs relying on procfs, sysfs, or Linux-specific interfaces.
  • Daemons using inotify or Linux-specific async I/O.
  • Programs that assume global file system namespace visibility.

Practical Porting Effort

For most POSIX applications, porting involves:

  1. Build the application using Genode’s ports system (downloads upstream source, applies patches, builds with Genode’s toolchain).
  2. Write a VFS configuration that provides the file-like resources the application expects.
  3. Write a routing configuration that connects the application to required services.
  4. Patch fork() calls if present (usually replacing with posix_spawn() or restructuring to avoid subprocess creation).

The VFS configuration is where the “impedance mismatch” between POSIX expectations and Genode capabilities is resolved. The application thinks it’s accessing /etc/resolv.conf – the VFS plugin infrastructure translates this to capability-mediated access.

5. Component Architecture

Core, Init, and User Components

Core (or base-hw/base-nova/etc.): the lowest-level component, running directly on the microkernel. Core provides the fundamental services: RAM allocation, CPU time (PD sessions), ROM access (boot modules), IRQ delivery, and I/O memory access. Core is the only component with direct hardware access. Everything else goes through core.

Init: the first user-level component, child of core. Init reads its XML configuration and manages the component tree. Init’s responsibilities:

  • Parse <start> entries and spawn components.
  • Route session requests between components according to <route> rules.
  • Manage component lifecycle (restart policies, resource reclamation).
  • Propagate configuration changes (dynamic reconfiguration in Sculpt).

User components: all other components. They can be:

  • Servers that provide sessions (drivers, file systems, network stacks).
  • Clients that consume sessions (applications).
  • Both simultaneously (a network stack consumes NIC sessions and provides socket-level sessions).
  • Sub-inits – components that run their own init-like management for a subtree of components.

Resource Trading in Practice

Resources in Genode flow through the tree. A concrete example:

  1. Core has 256 MB RAM total.
  2. Core grants 250 MB to init, keeps 6 MB for kernel structures.
  3. Init grants 10 MB to the timer driver, 50 MB to the GUI subsystem, 20 MB to the network subsystem, 5 MB to a log server.
  4. When the GUI subsystem starts a framebuffer driver, it donates 8 MB from its 50 MB budget to the driver as a session quota.
  5. The framebuffer driver uses this donated RAM for the frame buffer allocation.

If the GUI subsystem wants more RAM for a new application, it can reclaim RAM by closing sessions (getting donated RAM back) or requesting more from its parent (init).

The accounting is strict: at any point, the sum of all RAM budgets across all components equals the total system RAM. There is no over-commit. This prevents the “OOM killer” problem – each component knows exactly how much RAM it can use.

Practical Component Patterns

Driver components follow a common pattern:

  • Receive: Platform session (for I/O port/memory access), IRQ session
  • Provide: A device-specific session (NIC, Block, GPU, Audio, etc.)
  • Stateless: all per-client state funded by session quota

Multiplexer components:

  • Receive: one instance of a service
  • Provide: multiple instances to clients
  • Example: NIC router receives one NIC session, provides multiple sessions with packet routing between clients

Proxy components:

  • Forward one session type, possibly filtering or transforming
  • Example: nic_bridge, nitpicker (GUI multiplexer), VFS server

Subsystem inits:

  • A component running its own init for a group of related components
  • Isolates the subtree: crash of the subsystem doesn’t affect siblings
  • Example: Sculpt’s drivers subsystem, network subsystem

6. Sculpt OS

What Sculpt Demonstrates

Sculpt OS is Genode’s demonstration desktop operating system. It turns the component framework into a usable system where:

  • Users install and run applications at runtime.
  • Each application runs in its own isolated component with explicitly configured capabilities.
  • A GUI lets users connect applications to services (routing).
  • The entire system is reconfigurable without reboot.

Architecture

Sculpt’s component tree:

core
  |
  init
    |
    +--> drivers subsystem (sub-init)
    |      +--> platform_drv (PCI, IOMMU)
    |      +--> fb_drv (framebuffer)
    |      +--> usb_drv (USB host controller)
    |      +--> wifi_drv (wireless)
    |      +--> ahci_drv (SATA)
    |      +--> nvme_drv (NVMe)
    |      +--> ...
    |
    +--> runtime subsystem (sub-init, user-managed)
    |      +--> (user-installed applications)
    |
    +--> leitzentrale (management GUI)
    |      +--> system shell
    |      +--> config editor
    |
    +--> nitpicker (GUI multiplexer)
    +--> nic_router (network multiplexer)
    +--> ram_fs (shared file system)
    +--> ...

User Experience of Capabilities

In Sculpt, installing an application means:

  1. Download the package (a Genode component archive).
  2. Edit a “deploy” configuration that specifies which services the application can access (routing rules).
  3. The runtime subsystem spawns the component with the specified routing.

A text editor gets: File_system session (to read/write files), GUI session (for display), Terminal session (optionally). It does NOT get: network access, block device access, or access to other applications’ file systems.

A web browser gets: GUI session, Nic session (for network), GPU session (for rendering), File_system session (for downloads). Each service connection is an explicit choice.

The deploy config is the security policy. A user can see exactly what authority each application has, and can change it by editing the config.

Lessons from Sculpt

  1. Capabilities need a management UI. Raw capability graphs are incomprehensible to users. Sculpt provides a GUI that presents service connections in an understandable way (though it’s still oriented toward power users).

  2. Routing is the killer feature. Being able to route the same session type to different servers for different clients is extremely powerful. One application’s “file system” is local storage; another’s is a network share – same code, different routing.

  3. Sub-inits provide failure isolation. The drivers subsystem can crash and restart without affecting applications. Sculpt’s robustness comes from this hierarchical isolation.

  4. Dynamic reconfiguration is essential. A static boot config (like capOS’s current manifest) is fine for servers and embedded systems, but a general-purpose OS needs to add/remove/reconfigure components at runtime.

  5. Package management is a routing problem. Installing an application in Sculpt is not “copy binary to disk” – it’s “add a component to the runtime subsystem with specific routing rules.” The binary is almost secondary to the routing.

  6. POSIX compat through VFS works. Sculpt runs real desktop applications (Qt-based apps, VirtualBox, web browser) using the VFS-mediated POSIX layer. The capability model doesn’t prevent running complex existing software – it just requires explicit service configuration.

7. Relevance to capOS

VFS Capability Design

Genode’s approach: The VFS is an in-process library with a plugin architecture. It mediates between libc/POSIX and Genode sessions. The VFS configuration is per-component XML.

Lessons for capOS:

  1. Don’t put the VFS in the kernel. Genode’s VFS is entirely userspace, which is correct for a capability OS. capOS should do the same – the VFS is a library linked into processes that need POSIX compatibility, not a kernel subsystem.

  2. Plugin model maps well to Cap’n Proto. Each Genode VFS plugin bridges to a specific session type. In capOS, each VFS “backend” would bridge to a specific capability interface:

    Genode VFS plugincapOS VFS backend
    <fs> -> File_system sessionFsBackend -> Namespace + Store caps
    <terminal> -> Terminal sessionTerminalBackend -> Console cap
    <lxip> -> Nic sessionNetBackend -> TcpSocket/UdpSocket caps
    <log> -> LOG sessionLogBackend -> Console cap
    <ram> -> in-process RAMRamBackend -> in-process (no cap needed)
  3. VFS config should be declarative. Rather than hardcoding mount points, capOS processes using libcapos-posix should receive a VFS mount table as part of their initial capability set. This could be a Cap’n Proto struct:

    struct VfsMountTable {
        mounts @0 :List(VfsMount);
    }
    
    struct VfsMount {
        path @0 :Text;           # mount point, e.g. "/data"
        union {
            namespace @1 :Void;  # use the Namespace cap named in capName
            console @2 :Void;    # use a Console cap
            ram @3 :Void;        # in-memory filesystem
            socket @4 :Void;     # socket interface
        }
        capName @5 :Text;        # name of the cap in CapSet backing this mount
    }
    

    This separates the VFS topology (a deployment decision) from the application code (which just calls open()).

  4. Genode’s <fs> plugin is the key analog. capOS’s Namespace capability is equivalent to Genode’s File_system session. The libcapos-posix path resolution layer (open() -> namespace.resolve()) is exactly Genode’s <fs> VFS plugin. The existing capOS design in docs/proposals/userspace-binaries-proposal.md is already on the right track.

  5. Consider streaming for large files. Genode uses shared-memory dataspaces for bulk data transfer in file system sessions. capOS’s current Store interface returns Data (a capnp blob), which means the entire object is copied per get() call. For large files, a streaming interface (with a shared-memory buffer and cursor) would be more efficient. This is capOS’s Open Question #4.

Session Routing Patterns

Genode’s approach: XML-configured routing in init, label-based dispatch, parent mediates all session requests.

Lessons for capOS:

  1. The manifest IS the routing config. capOS’s SystemManifest with structured CapRef source entries such as { service = { service = "net-stack", export = "nic" } } is functionally equivalent to Genode’s init routing config. The capOS design already handles the static case well.

  2. Label-based routing is valuable. Genode’s ability to route different requests from the same client to different servers (based on labels) maps directly to capOS’s capability naming. capOS already does this implicitly – a process can receive separate Namespace caps for “config” and “data”. The key insight is that this should be a deployment-time decision, not an application-time decision.

  3. Consider dynamic routing. capOS’s current manifest is static (baked into the ISO). For a more flexible system, init should support runtime reconfiguration:

    • Reload the manifest from a Store cap.
    • Add/remove services without reboot.
    • Re-route sessions when services restart.

    Genode achieves this via init’s config ROM, which can be updated at runtime. capOS could achieve it by having init watch a Namespace cap for manifest updates.

  4. Parent-mediated routing has costs. In Genode, every session request traverses the component tree. This adds latency and complexity. capOS’s direct capability passing (a process holds a cap directly, not through its parent) avoids this overhead. The tradeoff: capOS has less runtime control over routing (once a cap is passed, the parent can’t intercept invocations on it).

    This is a deliberate design choice. capOS favors direct caps (lower overhead, simpler) over proxied caps (more control). Genode’s session routing is powerful but adds a layer of indirection that may not be worth it for capOS’s use case.

  5. Service export needs a protocol. Genode’s session model has server components explicitly announce what services they provide. capOS’s ProcessHandle.exported() mechanism serves the same purpose. The manifest’s exports field pre-declares what a service will export, which helps init plan the dependency graph before spawning anything.

POSIX Compatibility Without Compromising Capabilities

Genode’s approach: libc port + VFS + per-component VFS config. No global namespace. No fork(). Applications see a curated file tree, not the real system.

Lessons for capOS:

  1. The VFS is a capability adapter, not a capability. The VFS library runs inside the process that needs POSIX compatibility. It doesn’t weaken the capability model because it can only access capabilities the process was granted. This matches capOS’s libcapos-posix design exactly.

  2. musl over FreeBSD libc. Genode uses FreeBSD libc because of its clean backend interface. capOS plans to use musl, which has an even cleaner __syscall() interface. This is a good choice. Genode’s experience shows that the libc implementation matters less than the VFS/backend layer quality.

  3. No fork() is fine. Genode has operated without fork() for over 15 years and runs complex software (Qt, VirtualBox, Chromium). The applications that truly need fork() are rare and usually need only posix_spawn() semantics. capOS should not attempt to implement fork() – focus on posix_spawn() backed by ProcessSpawner cap.

  4. Sockets via in-process TCP/IP stack. Genode’s <lxip> VFS plugin runs an lwIP TCP/IP stack inside the application process, using the NIC session for raw packet I/O. This avoids the overhead of routing every socket call through a separate network stack component.

    capOS could offer a similar choice:

    • Out-of-process: socket calls go to the network stack component via TcpSocket/UdpSocket caps (safer, more isolated, more overhead).
    • In-process: an lwIP/smoltcp library runs inside the application, consuming a raw Nic cap (less isolation, less overhead, more authority).

    For most applications, out-of-process sockets via caps are fine. For high-performance networking (database, web server), an in-process stack over a raw NIC cap may be needed.

  5. select/poll/epoll need async caps. Genode implements select/poll via VFS notifications (signals on file readiness). capOS needs the async capability rings (io_uring-inspired) from Stage 4 before select/poll can work. This is a natural fit: each polled fd maps to a pending capability invocation in the completion ring.

Component Patterns for Cap’n Proto Interfaces

Genode’s patterns and their capOS/Cap’n Proto equivalents:

  1. Session creation = factory method on a capability.

    Genode: client requests a Nic::Session from its parent, which routes to a NIC driver server.

    capOS: client holds a NetworkManager cap and calls create_tcp_socket() to get a TcpSocket cap. The factory pattern is the same, but capOS does it via direct cap invocation instead of parent-mediated session requests.

    Cap’n Proto naturally supports this via interfaces that return interfaces:

    interface NetworkManager {
        createTcpSocket @0 () -> (socket :TcpSocket);
        createUdpSocket @1 () -> (socket :UdpSocket);
        createTcpListener @2 (addr :IpAddress, port :UInt16)
            -> (listener :TcpListener);
    }
    
  2. Resource quotas in session creation.

    Genode: session requests include a RAM quota donated from client to server.

    capOS should consider this pattern. Currently, capOS processes receive a FrameAllocator cap for memory. If a server needs to allocate memory per-client, the client should fund it. Cap’n Proto schema could encode this:

    interface FileSystem {
        open @0 (path :Text, bufferPages :UInt32)
            -> (file :File);
        # bufferPages: number of pages the client donates for
        # server-side buffering. Server allocates from a shared
        # FrameAllocator or the client passes frames explicitly.
    }
    

    This prevents the denial-of-service problem where a client opens many sessions, exhausting the server’s memory.

  3. Multiplexer components.

    Genode: nic_router takes one NIC session, provides many. nitpicker takes one framebuffer, provides many GUI sessions.

    capOS equivalent: a process that consumes a Nic cap and provides multiple TcpSocket/UdpSocket caps. This is already what the network stack component does in capOS’s service architecture proposal. Cap’n Proto’s interface model makes this natural – the multiplexer implements one interface (NetworkManager) using another (Nic).

  4. Attenuation = capability narrowing.

    Genode: servers can return restricted capabilities (e.g., a read-only file handle from a read-write file system session).

    capOS: already planned via Fetch -> HttpEndpoint narrowing, Store -> read-only Store, Namespace -> scoped Namespace. The pattern is sound. Cap’n Proto interfaces make the attenuation explicit in the schema.

  5. Dataspace pattern for bulk data.

    Genode uses shared-memory dataspaces for efficient bulk transfer (file contents, network packets, framebuffers). The RPC path carries only small control messages and capability references.

    capOS currently moves Cap’n Proto control messages through capability rings and bounded kernel scratch, with no zero-copy bulk-data object yet. For bulk data, capOS should add a SharedBuffer capability:

    interface SharedBuffer {
        # Map a shared memory region into caller's address space
        map @0 () -> (addr :UInt64, size :UInt64);
        # Notify that data has been written to the buffer
        signal @1 (offset :UInt64, length :UInt64) -> ();
    }
    

    File system and network operations would use SharedBuffer for data transfer and capability invocations for control, matching Genode’s split between RPC and dataspaces.

  6. Sub-init pattern for failure domains.

    Genode: a sub-init manages a subtree of components. If the subtree crashes, only the sub-init restarts it.

    capOS: a supervisor process (not necessarily init) holds a ProcessSpawner cap and manages a group of services. This is already described in the service architecture proposal’s supervision tree. The key addition from Genode: make sub- supervisors a first-class pattern with their own manifest fragments, not just ad-hoc supervision loops.

Summary of Key Takeaways for capOS

AreaGenode approachcapOS adaptation
Capability modelKernel-enforced caps to RPC objectsKernel-enforced caps to Cap’n Proto objects (aligned)
Service discoveryParent-mediated session routingManifest-driven cap passing at spawn (simpler, less dynamic)
VFSIn-process library with plugin architecturelibcapos-posix with mount table from CapSet (same pattern)
POSIXFreeBSD libc + VFS backendsmusl + libcapos-posix backends (same architecture)
fork()Not supportedNot supported (use posix_spawn -> ProcessSpawner)
Bulk dataShared-memory dataspacesSharedBuffer design exists; implementation pending
Resource accountingSession quotas (RAM donated per session)Authority-accounting design exists; unified ledgers pending
Routing labelsString labels on session requests, routed by initCap naming in manifest serves same purpose
Dynamic reconfigInit config ROM updated at runtimeManifest reload via Store cap (future)
Failure isolationSub-inits as failure domainsSupervisor processes (same concept, different mechanism)
Async notificationSignal capabilitiesAsync cap rings / io_uring model (more general)

Top Recommendations

  1. Add session quotas / resource trading. This is the most important Genode pattern capOS hasn’t adopted yet. Without it, a malicious client can exhaust a server’s memory by opening many capability sessions. Design resource donation into the Cap’n Proto schema for session-creating interfaces.

  2. Design a SharedBuffer capability. Copying capnp messages through the kernel works for control messages but not for bulk data. A shared-memory mechanism (like Genode’s dataspaces) is essential for file I/O, networking, and GPU rendering.

  3. Keep VFS as a library, not a service. Genode’s in-process VFS is the right pattern. capOS’s libcapos-posix should work the same way – a library that translates POSIX calls to capability invocations within the process. No VFS server component needed (though a file system server implementing the Namespace/Store interface is separate).

  4. Add a declarative VFS mount table to process init. Each POSIX-compat process should receive a mount table (as a capnp struct) that maps paths to capabilities. This separates deployment policy from application code, matching Genode’s per-component VFS config.

  5. Plan for dynamic reconfiguration. The static manifest is fine for now, but Sculpt shows that a usable capability OS needs runtime service management. Design init so it can accept manifest updates through a cap, not just from the boot image.

  6. Don’t over-engineer routing. Genode’s parent-mediated session routing is powerful but complex. capOS’s direct capability passing is simpler and sufficient for most use cases. Add proxy/mediator patterns only when specific needs arise (e.g., capability revocation, load balancing).

References

  • Genode Foundations book (genode.org/documentation/genode-foundations/) – the authoritative source for architecture, session model, routing, VFS, and component composition.
  • Norman Feske, “Genode Operating System Framework” (2008-2025) – release notes and design documentation at genode.org.
  • Sculpt OS documentation at genode.org/download/sculpt – practical deployment of the capability model.
  • Genode source repository: github.com/genodelabs/genode – reference implementations of VFS plugins, file system servers, libc port.

Research: Plan 9 from Bell Labs and Inferno OS

Lessons for a capability-based OS using Cap’n Proto wire format.

Table of Contents

  1. Per-Process Namespaces
  2. The 9P Protocol
  3. File-Based vs Capability-Based Interfaces
  4. 9P as IPC
  5. Inferno OS
  6. Relevance to capOS

1. Per-Process Namespaces

Overview

Plan 9’s most significant architectural contribution is per-process namespaces. Every process has its own view of the file hierarchy – not a shared global filesystem tree as in Unix. A process’s namespace is a mapping from path names to file servers (channels to 9P-speaking services). Two processes running on the same machine can see completely different contents at /dev, /net, /proc, or any other path.

Namespaces are inherited by child processes (fork copies the namespace) but can be modified independently afterward. This provides a form of resource isolation that is orthogonal to traditional access control: a process simply cannot name resources that aren’t in its namespace.

The Three Namespace Operations

Plan 9 provides three system calls for namespace manipulation:

bind(name, old, flags) – Takes an existing file or directory name already visible in the namespace and makes it also accessible at path old. This is purely a namespace-level alias – no new file server is involved. The name argument must resolve to something already in the namespace.

Example: bind("#c", "/dev", MREPL) makes the console device (#c is a kernel device designator) appear at /dev. The # prefix addresses kernel devices directly before they have been bound into the namespace.

mount(fd, old, flags, aname) – Like bind, but the source is a file descriptor connected to a 9P server rather than an existing namespace path. The kernel speaks 9P over fd to serve requests for paths under old. The aname parameter selects which file tree the server should export (a single server can serve multiple trees).

Example: mount(fd, "/net", MREPL, "") where fd is a connection to the network stack’s file server, makes the TCP/IP interface appear at /net.

unmount(name, old) – Removes a previous bind or mount from the namespace.

Flags and Union Directories

The flags argument to bind and mount controls how the new binding interacts with existing content at the mount point:

  • MREPL (replace) – The new binding completely replaces whatever was at the mount point. Only the new server’s files are visible.
  • MBEFORE (before) – The new binding is placed before the existing content. When looking up a name, the new binding is searched first. If not found there, the old content is searched.
  • MAFTER (after) – The new binding is placed after the existing content. The old content is searched first.
  • MCREATE – Combined with MBEFORE or MAFTER, controls which component of the union receives create operations.

Union directories are the result of stacking multiple bindings at one mount point. When a directory has multiple bindings, a directory listing returns the union of all names from all components. A lookup walks the bindings in order and returns the first match.

This is how Plan 9 constructs /bin: multiple directories (for different architectures, local overrides, etc.) are union-mounted at /bin. The shell finds commands by simple path lookup – no $PATH variable needed.

bind /rc/bin /bin          # shell built-ins (MAFTER)
bind /386/bin /bin         # architecture binaries (MAFTER)
bind $home/bin/386 /bin    # personal overrides (MBEFORE)

A lookup for /bin/ls searches the personal directory first, then the architecture directory, then the shell builtins – all via a single path.

Namespace Inheritance and Isolation

The rfork system call controls what the child inherits:

  • RFNAMEG – Child gets a copy of the parent’s namespace. Subsequent modifications by either side are independent.
  • RFCNAMEG – Child starts with a clean (empty) namespace.
  • Without either flag, parent and child share the namespace (modifications by one affect the other).

This gives fine-grained control: a shell can construct a restricted namespace for a sandboxed command, or a server can create an isolated namespace for each client connection.

Namespace Construction at Boot

Plan 9’s boot process constructs the initial namespace step by step:

  1. The kernel provides “kernel devices” accessed via # designators: #c (console), #e (environment), #p (proc), #I (IP stack), etc.
  2. The boot script binds these into conventional paths: bind "#c" /dev, bind "#p" /proc, etc.
  3. Network connections mount remote file servers: the CPU server’s file system, the user’s home directory, etc.
  4. Per-user profile scripts further customize the namespace.

The result is that the “standard” file hierarchy is a convention, not a kernel requirement. Any process can rearrange it.

Namespace as Security Boundary

Plan 9 namespaces provide a form of capability-like access control:

  • A process cannot access resources outside its namespace
  • A parent can restrict a child’s namespace before exec
  • There is no way to “escape” a namespace – there is no .. that crosses a mount boundary unexpectedly, and # designators can be restricted

However, this is not a formal capability system:

  • The namespace contains string paths, which are ambient authority within the namespace
  • Any process can open("/dev/cons") if /dev/cons is in its namespace – there is no per-open-call authorization
  • The isolation depends on correct namespace construction, not structural properties

2. The 9P Protocol

Overview

9P (and its updated version 9P2000) is the protocol spoken between clients and file servers. Every resource in Plan 9 is accessed through 9P – local kernel devices, remote file systems, user-space services, and network resources all speak the same protocol.

9P is a request-response protocol with fixed message types. It is connection-oriented: a client establishes a session, authenticates, walks paths to obtain file handles (fids), and then reads/writes through those handles.

Message Types (9P2000)

9P2000 defines the following message pairs (T = request from client, R = response from server):

Session management:

  • Tversion / Rversion – Negotiate protocol version and maximum message size. Must be the first message. The client proposes a version string (e.g., "9P2000") and a msize (maximum message size in bytes). The server responds with the agreed version and msize.
  • Tauth / Rauth – Establish an authentication fid. The client provides a user name and an aname (the file tree to access). The server returns an afid that the client reads/writes to complete an authentication exchange.
  • Tattach / Rattach – Attach to a file tree. The client provides the afid from authentication, a user name, and the aname. The server returns a qid (unique file identifier) for the root of the tree. This fid becomes the client’s handle for the root directory.

Navigation:

  • Twalk / Rwalk – Walk a path from an existing fid. The client provides a starting fid and a sequence of name components (up to 16 per walk). The server returns a new fid pointing to the result and the qids of each intermediate step. Walk is how you traverse directories – there is no open-by-path operation.

File operations:

  • Topen / Ropen – Open an existing file (by fid, obtained via walk). The client specifies a mode (read, write, read-write, exec, truncate). The server returns the qid and an iounit (maximum I/O size for atomic operations).
  • Tcreate / Rcreate – Create a new file in a directory fid. The client specifies name, permissions, and mode.
  • Tread / Rread – Read count bytes at offset from an open fid. The server returns the data.
  • Twrite / Rwrite – Write count bytes at offset to an open fid. The server returns the number of bytes actually written.
  • Tclunk / Rclunk – Release a fid. The server frees associated state. Equivalent to close().
  • Tremove / Rremove – Remove the file referenced by a fid and clunk the fid.
  • Tstat / Rstat – Get file metadata (name, size, permissions, access times, qid, etc.).
  • Twstat / Rwstat – Modify file metadata.

Error handling:

  • Rerror – Any T-message can receive an Rerror instead of its normal response. Contains a text error string (9P2000) or an error number (9P2000.u).

Message Format

Every 9P message starts with a 4-byte length (little-endian, including the length field itself), a 1-byte type, and a 2-byte tag. The tag is chosen by the client and echoed in the response, enabling multiplexed operations over a single connection.

[4 bytes: size][1 byte: type][2 bytes: tag][... type-specific fields ...]

Field types are simple: 1/2/4/8-byte integers (little-endian), counted strings (2-byte length prefix + UTF-8), and counted data blobs (4-byte length prefix + raw bytes).

Qids and File Identity

A qid is a server-assigned 13-byte file identifier:

[1 byte: type][4 bytes: version][8 bytes: path]
  • type – Bits indicating directory, append-only, exclusive-use, authentication file, etc.
  • version – Incremented when the file is modified. The client can detect changes by comparing versions.
  • path – A unique identifier for the file within the server. Typically a hash or inode number.

Qids allow clients to detect file identity (same path through different walks = same qid) and staleness (version changed = re-read needed).

Authentication

9P2000 authentication is pluggable. The protocol provides the Tauth/Rauth mechanism to establish an authentication fid, but the actual authentication exchange happens by reading and writing this fid – the protocol itself is agnostic to the authentication method.

Plan 9’s standard mechanism is p9sk1, a shared-secret protocol using an authentication server. The flow:

  1. Client sends Tauth to get an afid
  2. Client and server exchange challenge-response messages by reading/writing the afid, mediated by the authentication server
  3. Once authentication succeeds, the client uses the afid in Tattach

The key insight: authentication is just another read/write conversation over a special fid. New authentication methods can be implemented without changing the protocol.

Concurrency

9P supports concurrent operations through tags. A client can send multiple T-messages without waiting for responses. Each has a unique tag, and the server can respond out of order. The client matches responses to requests by tag.

A special tag value NOTAG (0xFFFF) is used for Tversion, which must complete before any other messages.

The OEXCL open mode provides exclusive access to a file – only one client can open it at a time. This is used for locking (e.g., the #l lock device in some Plan 9 variants).

Fids are per-connection, not global. Different clients on different connections have independent fid spaces. A server maintains per-connection state.

Maximum Message Size

The msize negotiated in Tversion bounds all subsequent messages. A typical default is 8192 or 65536 bytes. The iounit returned by Topen tells the client the maximum useful count for read/write on that fid, which may be less than msize minus the message header overhead.

This bounding is important for resource management – a server can limit memory consumption per connection.


3. File-Based vs Capability-Based Interfaces

Plan 9: Everything is a File

Plan 9 takes Unix’s “everything is a file” philosophy further than Unix itself ever did:

  • Network stack – TCP connections are managed by reading/writing files in /net/tcp: clone (allocate a connection), ctl (write commands like connect 10.0.0.1!80), data (read/write payload), status (read connection state).
  • Window system – The rio window manager exports a file system: each window has a cons, mouse, winname, etc. A program draws by writing to /dev/draw/*.
  • Process control/proc/<pid>/ contains ctl (write kill to signal), status (read state), mem (read/write process memory), text (read executable), note (signals).
  • Hardware devices – Kernel devices export file interfaces directly. The audio device is files, the graphics framebuffer is files, etc.

The interface contract is: open a file, read/write bytes, stat for metadata. The semantics of those bytes are defined by the file server – there is no ioctl().

Strengths of the file model:

  • Universal tools work everywhere: cat /net/tcp/0/status, echo kill > /proc/1234/ctl
  • Shell scripts can compose services trivially
  • Network transparency is automatic: mount a remote file server, same tools work
  • The interface is self-documenting: ls shows available operations
  • Simple tools like cat, echo, grep become universal adapters

Weaknesses of the file model:

  • Type erasure. Everything is bytes. The protocol cannot express structured data without conventions layered on top (text formats, fixed layouts, etc.). A read() returns raw bytes – the client must know the expected format.
  • Limited operation set. The only verbs are open, read, write, stat, create, remove. Complex operations must be encoded as write-command / read-response sequences (e.g., echo "connect 10.0.0.1!80" > /net/tcp/0/ctl). Error handling is ad-hoc.
  • No schema or type checking. Nothing prevents writing garbage to a ctl file. Errors are detected at runtime, often with cryptic messages.
  • No structured errors. 9P errors are text strings. No error codes, no machine-parseable error metadata.
  • Byte-stream orientation. 9P read/write are offset-based byte operations. This fits files naturally but is awkward for RPC-style request/response interactions. File servers work around this with conventions (write a command, read the response from offset 0).
  • No pipelining of operations. You cannot say “open this file, then read it, and if that succeeds, write to this other file” atomically. Each step is a separate round-trip (though 9P’s tag multiplexing helps amortize latency).

Capability Systems: Everything is a Typed Interface

In a capability system like capOS, resources are accessed through typed interface references:

interface Console {
    write @0 (data :Data) -> ();
    writeLine @1 (text :Text) -> ();
}

interface NetworkManager {
    createTcpSocket @0 (addr :Text, port :UInt16) -> (socket :TcpSocket);
}

interface TcpSocket {
    read @0 (count :UInt32) -> (data :Data);
    write @1 (data :Data) -> (written :UInt32);
    close @2 () -> ();
}

Strengths of the capability model:

  • Type safety. The interface contract is machine-checked. You cannot call write on a NetworkManager – the type system prevents it.
  • Rich operations. Interfaces can define arbitrary methods with typed parameters and return values. No need to encode everything as byte read/writes.
  • Structured errors. Return types can include error variants. Capabilities can define error enums in the schema.
  • Schema evolution. Cap’n Proto supports backwards-compatible schema changes (adding fields, adding methods). Both old and new clients/servers interoperate.
  • No ambient authority. A process has precisely the capabilities it was granted. No path-based discovery, no /proc to enumerate.
  • Attenuation. A broad capability can be narrowed to a restricted version (e.g., Fetch -> HttpEndpoint). The restriction is structural, not a permission check.

Weaknesses of the capability model:

  • No universal tools. cat and echo do not work on capabilities. Each interface needs its own client tool or library. Debugging requires interface-aware tools.
  • Harder composition. Shell pipes compose byte streams trivially. Capability composition requires typed adapters or a capability-aware shell.
  • Discovery problem. ls shows files. What shows capabilities? A management-only CapabilityManager.list() call, but that requires holding the manager cap and a tool that can render the result.
  • Steeper learning curve. A new developer can ls /net to understand the network stack. Understanding a capability interface requires reading the schema definition.
  • Verbosity. Opening a TCP connection in Plan 9 is four file operations (clone, ctl, data, status). In a capability system, it is one typed method call. But defining the interface in the schema is more upfront work than just exporting files.

Synthesis

The file model and the capability model are not opposed – they are different points on a trade-off curve between universality and type safety. Plan 9 chose maximal universality (everything reduces to bytes + paths). Capability systems choose maximal type safety (everything has a schema).

The interesting question is whether a capability system can recover the ergonomic benefits of the file model while maintaining type safety. This is addressed in section 6.


4. 9P as IPC

File Servers as Services

In Plan 9, a “service” is simply a process that speaks 9P. When a client mounts a file server’s connection at some path, all file operations on that path become 9P messages to the server. This is the universal IPC mechanism – there are no Unix-domain sockets, no D-Bus, no shared memory primitives for service communication. Everything goes through 9P.

Examples of services-as-file-servers:

  • exportfs – Re-exports a subtree of the current namespace over a network connection, letting remote clients mount it.
  • ramfs – A RAM-backed file server. Mount it and you have a tmpfs.
  • ftpfs – Mounts a remote FTP server as a local directory. Programs read/write files; the file server translates to FTP protocol.
  • mailfs – Presents a mail spool as a directory of messages. Each message is a directory with header, body, rawbody, etc.
  • plumber – The inter-application message router exports a file interface: write a message to /mnt/plumb/send, and it arrives in the target application’s plumb port.
  • acme – The Acme editor exports its entire UI as a file system: windows, buffers, tags, event streams. External programs can control Acme by reading/writing these files.

The srv Device and Connection Passing

The kernel #s (srv) device provides a namespace for posting file descriptors. A server process creates a pipe, starts serving 9P on one end, and posts the other end as /srv/myservice. Other processes open /srv/myservice to get a connection to the server, then mount it into their namespace.

# Server side:
pipe = pipe()
post(pipe[0], "/srv/myfs")
serve_9p(pipe[1])

# Client side:
fd = open("/srv/myfs", ORDWR)
mount(fd, "/mnt/myfs", MREPL, "")
# Now /mnt/myfs/* are served by the server process

This decouples service registration from namespace mounting. Multiple clients can mount the same service at different paths in their own namespaces.

Performance and Overhead

9P’s overhead compared to direct function calls or shared memory:

  1. Serialization – Every operation is a 9P message: header parsing, field encoding/decoding. Messages are simple binary (not XML/JSON), so this is fast but nonzero.
  2. Copying – Data passes through the kernel (pipe or network): user buffer -> kernel pipe buffer -> server process buffer (and back for responses). This is at least two copies per direction.
  3. Context switches – Each request/response is a write (client) + read (server) + write (server) + read (client) = four context switches for a round-trip.
  4. No zero-copy – 9P does not support shared memory or page remapping. Large data transfers pay the full copy cost.

For metadata-heavy operations (stat, walk, open/close), the overhead is dominated by context switches, not data copying. Plan 9 is designed for networks where latency matters – the protocol’s simplicity and multiplexability help here.

For bulk data, the overhead is significant. Plan 9 compensates somewhat with the iounit mechanism (encouraging large reads/writes to amortize per-call costs) and the fact that most I/O is streaming (sequential reads/writes, not random access).

In practice, Plan 9 systems are not optimized for raw throughput on local IPC. The design prioritizes simplicity and network transparency over local performance. The assumption is that the network is the bottleneck, so local protocol overhead is acceptable.

Network Transparency

9P’s power lies in its network transparency. The same protocol runs over:

  • Pipes – Local IPC between processes on the same machine.
  • TCP connections – Remote file access across the network.
  • Serial lines – Early Plan 9 terminals connected to CPU servers.
  • TLS/SSL – Encrypted connections (added later).

A CPU server is accessed by mounting its file system over the network. The Plan 9 cpu command:

  1. Connects to a remote CPU server over TCP
  2. Authenticates
  3. Exports the local namespace (via exportfs) to the remote side
  4. The remote side mounts the local namespace, overlaying it with its own kernel devices
  5. A shell runs on the remote CPU, but with access to local files

The result: you work on the remote machine but your files, windows, and devices are local. This is more powerful than SSH because the integration is at the namespace level, not the terminal level.

Factoid: In the Plan 9 computing model, terminals were intentionally underpowered. The expensive hardware was the CPU server. Users mounted the CPU server’s filesystem and ran programs there, with the terminal providing I/O devices (keyboard, mouse, display) exported as files back to the CPU server.


5. Inferno OS

What Inferno Adds Beyond Plan 9

Inferno (also from Bell Labs, originally by the same team) took the Plan 9 architecture and adapted it for portable, networked computing. It can run as a native OS on bare hardware, as a hosted application on other OSes (Linux, Windows, macOS), or as a virtual machine.

Key additions and differences:

  1. Dis virtual machine – All user-space code runs on a register-based VM, not native machine code.
  2. Limbo language – A type-safe, garbage-collected, concurrent language (influenced Plan 9 C, CSP, Newsqueak, and Alef). All applications are written in Limbo.
  3. Styx protocol – Inferno’s name for its 9P variant (functionally identical to 9P2000 with minor encoding differences in early versions, later fully aligned with 9P2000).
  4. Portable execution – The same Limbo bytecode runs on any platform where the Dis VM is available. No recompilation needed.
  5. Built-in cryptography – TLS, certificate-based authentication, and signed modules are integrated into the system, not bolted on.

The Dis Virtual Machine

Dis is a register-based virtual machine (unlike the JVM, which is stack-based). Key characteristics:

  • Memory model – Dis uses a module-based memory model. Each loaded module has its own data segment (frame). Instructions reference memory operands by offset within the current module’s frame, the current function’s frame, or a literal (mp, fp, or immediate addressing).
  • Instruction set – CISC-inspired, with three-address instructions: add src1, src2, dst. Opcodes cover arithmetic, comparison, branching, string operations, channel operations, and system calls. Around 80-90 opcodes.
  • Type descriptors – Each allocated block has a type descriptor that identifies which words are pointers. This enables exact garbage collection (no conservative scanning).
  • Garbage collection – Reference counting with cycle detection. Deterministic deallocation for acyclic structures (important for resource management), with periodic cycle collection.
  • Module loading – Dis modules are loaded on demand. A module declares its type signature (exported functions and their types), and the loader verifies type compatibility at link time.
  • JIT compilation – On supported architectures (x86, ARM, MIPS, SPARC, PowerPC), Dis bytecode is compiled to native code at load time. This removes the interpretation overhead for hot code.
  • Concurrency – Dis natively supports concurrent threads of execution within a module. Threads communicate via typed channels (from CSP/Limbo).

The Limbo Language

Limbo is Inferno’s application language. Its design reflects the system’s values:

  • Type-safe – No pointer arithmetic, no unchecked casts, no buffer overflows. The type system is enforced at compile time and verified at module load time.
  • Garbage collected – Programmers do not manage memory. Reference counting provides deterministic resource cleanup.
  • Concurrent – First-class chan types (typed channels) and spawn for creating threads. This is CSP-style concurrency, predating (and influencing) Go’s goroutines and channels.
  • Module system – Modules declare interfaces (like header files with type signatures). A module imports another module’s interface, and the runtime verifies type compatibility at load time.
  • ADTs – Algebraic data types with pick (tagged unions). Pattern matching over variants.
  • Tuples – First-class tuple types for returning multiple values.
  • No inheritance – Limbo has ADTs and modules, not objects and classes.

Example – a simple file server in Limbo:

implement Echo;

include "sys.m";
include "draw.m";
include "styx.m";

sys: Sys;

Echo: module {
    init: fn(nil: ref Draw->Context, argv: list of string);
};

init(nil: ref Draw->Context, argv: list of string)
{
    sys = load Sys Sys->PATH;
    # ... set up Styx server, handle read/write on echo file
}

Limbo and the Namespace Model

Limbo programs interact with the namespace through the Sys module’s file operations (open, read, write, mount, bind, etc.) – the same operations as in Plan 9. The namespace model is identical:

  • Each process group has its own namespace
  • bind and mount manipulate the namespace
  • File servers (Styx servers) provide services
  • Union directories compose multiple servers

The difference is that Limbo’s type safety extends to the file descriptors and channels used to communicate. A Sys->FD is a reference type, not a raw integer. You cannot fabricate a file descriptor from nothing.

Limbo’s channel type (chan of T) provides typed communication between concurrent threads within a process. Channels are a local IPC mechanism complementary to Styx, which handles inter-process and inter-machine communication.

Styx (Inferno’s 9P)

Styx is Inferno’s name for the 9P2000 protocol. In the current version of Inferno, Styx and 9P2000 are wire-compatible – the same byte format, the same message types, the same semantics. The renaming reflects Inferno’s origin as a commercial product from Vita Nuova (and before that, Lucent Technologies) with its own branding.

The Inferno kernel includes a Styx library (Styx and Styxservers modules) that makes implementing file servers straightforward in Limbo. The Styxservers module provides a framework: you implement a navigator (for walk/stat) and a file handler (for read/write), and the framework handles the protocol boilerplate.

include "styx.m";
include "styxservers.m";

styx: Styx;
styxservers: Styxservers;

Srv: adt {
    # ... file tree definition
};

# The framework calls navigator.walk(), navigator.stat() for metadata
# and file.read(), file.write() for data operations.

Inferno also provides the 9srvfs utility for mounting external 9P servers and the mount command for attaching Styx servers to the namespace – the same patterns as Plan 9.

Security Model

Inferno’s security model builds on namespaces with additional mechanisms:

  • Signed modules – Dis modules can be cryptographically signed. The loader can verify signatures before executing code.
  • Certificate-based authentication – Inferno uses a certificate infrastructure (not Kerberos like Plan 9) for authenticating connections.
  • Namespace restriction – The wm/sh shell and other supervisory programs can construct restricted namespaces for untrusted code.
  • Type safety as security – Since Limbo prevents pointer forgery and buffer overflows, type safety is a security boundary. A Limbo program cannot escape its type system to forge file descriptors or access arbitrary memory.

6. Relevance to capOS

6.1 Namespace Composition via Capabilities

Plan 9 lesson: Per-process namespaces are a powerful isolation and composition mechanism. A process’s “view of the world” is constructed by its parent through bind/mount operations. The child cannot escape this view.

capOS parallel: Per-process capability tables serve an analogous role. A process’s “view of the world” is its set of granted capabilities. The child cannot discover or access capabilities outside its table.

What capOS could adopt:

The existing Namespace interface in the storage proposal (docs/proposals/storage-and-naming-proposal.md) already captures some of this – resolve, bind, list, and sub provide name-to-capability mappings. But Plan 9’s namespace model suggests a more dynamic composition pattern:

interface Namespace {
    # Resolve a name to a capability reference
    resolve @0 (name :Text) -> (capId :UInt32, interfaceId :UInt64);

    # Bind a capability at a name in this namespace
    bind @1 (name :Text, capId :UInt32) -> ();

    # Create a union: multiple capabilities behind one name
    union @2 (name :Text, capId :UInt32, position :UnionPosition) -> ();

    # List available names
    list @3 () -> (entries :List(NamespaceEntry));

    # Get a restricted sub-namespace
    sub @4 (prefix :Text) -> (ns :Namespace);
}

enum UnionPosition {
    before @0;   # searched first (like Plan 9 MBEFORE)
    after @1;    # searched last (like Plan 9 MAFTER)
    replace @2;  # replaces existing (like Plan 9 MREPL)
}

struct NamespaceEntry {
    name @0 :Text;
    interfaceId @1 :UInt64;
    label @2 :Text;
}

The key insight from Plan 9 is union composition – multiple capabilities can be bound at the same name, searched in order. This is useful for overlay patterns: a local cache capability layered before a remote store capability, or a per-user config namespace layered before a system-wide default.

Differences from Plan 9:

Plan 9 namespaces map names to file servers. capOS namespaces map names to typed capabilities. The advantage: capOS can verify at bind time that the capability matches the expected interface. Plan 9 cannot – you mount a file server and discover at runtime whether it exports the files you expect.

6.2 Cap’n Proto RPC vs 9P

Protocol comparison:

Aspect9P2000Cap’n Proto RPC
Message formatFixed binary fields, counted strings/dataCapnp wire format (pointer-based, zero-copy decode)
OperationsFixed set (walk, open, read, write, stat, …)Arbitrary per-interface (schema-defined methods)
TypingUntyped bytesStrongly typed (schema-checked)
MultiplexingTag-based (16-bit tags)Question ID-based (32-bit)
PipeliningNot supported (each op is independent)Promise pipelining (call method on not-yet-returned result)
AuthenticationPluggable via auth fidApplication-level (not protocol-specified)
CapabilitiesNo (file fids are unforgeable handles, but no transfer/attenuation)Native capability passing and attenuation
Maximum messageNegotiated msizeNo inherent limit (segmented messages)
Schema evolutionN/A (fixed protocol)Forward/backward compatible schema changes
Network transparencyNative design goalNative design goal

Key differences for capOS:

  1. Promise pipelining – This is capnp RPC’s strongest advantage over 9P. In 9P, opening a TCP connection requires: walk to /net/tcp -> walk to clone -> open clone -> read (get connection number) -> walk to ctl -> open ctl -> write “connect …” -> walk to data -> open data. Eight round-trips minimum. With capnp pipelining: net.createTcpSocket("10.0.0.1", 80) returns a promise, and you can immediately call .write(data) on the promise – the runtime chains the calls without waiting for the first to complete. One logical round-trip.

  2. Typed interfaces – 9P’s strength is that cat works on any file. Capnp’s strength is that the compiler catches console.allocFrame() at compile time. capOS should not try to make everything a “file” – typed interfaces are the right abstraction for a capability system. But a FileServer capability interface could provide Plan 9-like flexibility where needed (see below).

  3. Capability passing – 9P has no way to pass a fid through a file server to a third party. (The srv device is a workaround, not a protocol feature.) Capnp RPC natively supports passing capability references in messages. This is fundamental to capOS’s model.

6.3 File Server Pattern as a Capability

Plan 9’s file server pattern is useful and should not be discarded just because capOS is capability-based. Instead, define a generic FileServer capability interface:

interface FileServer {
    walk @0 (names :List(Text)) -> (fid :FileFid);
    list @1 (fid :FileFid) -> (entries :List(DirEntry));
}

interface FileFid {
    open @0 (mode :OpenMode) -> (iounit :UInt32);
    read @1 (offset :UInt64, count :UInt32) -> (data :Data);
    write @2 (offset :UInt64, data :Data) -> (written :UInt32);
    stat @3 () -> (info :FileInfo);
    close @4 () -> ();
}

A FileServer capability enables:

  • /proc-like introspection – A debugging service exports process state as a file tree. Tools read files to inspect state.
  • Config storage – A configuration namespace can be exposed as files for tools that work with text.
  • POSIX compatibility – The POSIX shim layer maps open()/read()/ write() to FileServer capability calls.
  • Shell scripting – A capability-aware shell could mount FileServer caps and use cat/echo-style tools on them.

The point: FileServer is one capability interface among many. It is not the universal abstraction (as in Plan 9), but it is available where the file metaphor is natural.

6.4 IPC Lessons

Plan 9 lesson: 9P works as universal IPC because the protocol is simple and the kernel handles the plumbing (mount, pipe, network). The cost is per-message overhead (copies, context switches).

capOS implications:

  1. Minimize copies. 9P’s two-copies-per-direction (user -> kernel pipe buffer -> server) is acceptable for networks but expensive for local IPC. capOS should investigate shared-memory regions for bulk data transfer between co-located processes, with capnp messages as the control plane. The roadmap’s io_uring-inspired submission/completion rings already point in this direction.

  2. Direct context switch. The L4/seL4 IPC fast-path (direct switch from caller to callee without choosing an unrelated runnable process) now exists as a baseline for blocked Endpoint receivers. Plan 9 does not do this – every 9P round-trip goes through the kernel’s pipe/network layer. capOS can tune this further because capability calls have a known target process.

  3. Batching. Plan 9 mitigates round-trip costs through large reads/ writes (the iounit mechanism). Capnp’s promise pipelining is the typed equivalent – batch multiple logical operations into a dependency chain that executes without intermediate round-trips.

6.5 Inferno Lessons

Dis VM / type safety: Inferno’s bet on a managed runtime (Dis + Limbo) gives it type safety as a security boundary. capOS, being written in Rust for kernel code and targeting native binaries, does not have this luxury for arbitrary user-space code. However:

  • WASI support (on the roadmap) provides a sandboxed execution environment with type-checked interfaces, similar in spirit to Dis.
  • Cap’n Proto schemas provide interface-level type safety even for native code. The schema is the contract, enforced at message boundaries.

Channel-based concurrency: Limbo’s chan of T type is a local IPC mechanism within a process. capOS does not currently have this (it relies on kernel-mediated capability calls for all IPC). For in-process threading (on the roadmap), typed channels between threads could be useful – implemented as a library on top of shared memory + futex, without kernel involvement.

Portable execution: Inferno’s ability to run the same bytecode everywhere is appealing but orthogonal to capOS’s goals. The WASI runtime item on the roadmap serves this purpose for capOS.

6.6 Concrete Recommendations

Based on this research, the following items are most relevant to capOS development:

  1. Add a Namespace capability with union semantics. Extend the existing Namespace design (from the storage proposal) with Plan 9-style union composition (before/after/replace). This enables overlay patterns for configuration, caching, and modularity.

  2. Implement a FileServer capability interface. Not as the universal abstraction, but as one interface for resources that are naturally file-like (config trees, debug introspection, POSIX compatibility). A FileServer cap is just another capability – no special kernel support needed.

  3. Prioritize promise pipelining. This is capnp’s killer feature over 9P and the biggest performance advantage for IPC-heavy workloads. Multiple logical operations collapse into one network/IPC round-trip. Async rings are in place; the remaining work is the Stage 6 pipeline dependency/result-cap mapping rule.

  4. Plan 9-style namespace construction in init. The boot manifest already describes which capabilities each service receives. Consider adding namespace-level composition to the manifest: “this service sees capability X as data/primary and capability Y as data/cache, with cache searched first” – union directory semantics expressed in capability terms.

  5. Study 9P’s exportfs pattern for network transparency. Plan 9’s exportfs re-exports a namespace subtree over the network. The capOS equivalent would be a proxy service that takes a set of local capabilities and makes them available as capnp RPC endpoints on the network. This is the “network transparency” roadmap item – 9P’s design proves it is achievable, and capnp’s richer type system makes it more robust.

  6. Do not replicate 9P’s weaknesses. The untyped byte-stream interface, the lack of structured errors, and the fixed operation set are 9P’s costs for universality. capOS pays none of these costs with Cap’n Proto. The temptation to “make everything a file for simplicity” should be resisted – typed capabilities are strictly more powerful, and the FileServer interface provides the file metaphor where needed without compromising the rest of the system.


Summary

Plan 9 / Inferno ConceptcapOS EquivalentGap / Action
Per-process namespace (bind/mount)Per-process capability tableAdd Namespace cap with union semantics
9P protocol (file operations)Cap’n Proto RPC (typed method calls)capnp is strictly superior for typed IPC; FileServer cap provides file semantics where needed
Union directoriesNo current equivalentAdd union composition to Namespace interface
File servers as servicesCapability-implementing processesAlready the model; manifest-driven service graph is close to Plan 9’s boot namespace construction
Network transparency via 9PNetwork transparency via capnp RPCSame goal, capnp adds promise pipelining and typed interfaces
exportfs (namespace re-export)Capability proxy serviceNot yet designed; high-value future work
Styx/9P as universal IPCCapnp messages as universal IPCAlready the model; prioritize fast-path and pipelining
Dis VM (portable, type-safe execution)WASI runtime (roadmap)Same goal, different mechanism
Limbo channels (typed local IPC)Not yet presentConsider for in-process threading
Authentication via auth fidNot yet designedCap’n Proto RPC has no built-in auth; needs design

References

  • Rob Pike, Dave Presotto, Sean Dorward, Bob Flandrena, Ken Thompson, Howard Trickey, Phil Winterbottom. “Plan 9 from Bell Labs.” Computing Systems, Vol. 8, No. 3, Summer 1995, pp. 221-254.
  • Rob Pike, Dave Presotto, Ken Thompson, Howard Trickey, Phil Winterbottom. “The Use of Name Spaces in Plan 9.” Operating Systems Review, Vol. 27, No. 2, April 1993, pp. 72-76.
  • Plan 9 Manual: intro(1), bind(1), mount(1), intro(5) (the 9P manual section).
  • Russ Cox, Eric Grosse, Rob Pike, Dave Presotto, Sean Quinlan. “Security in Plan 9.” USENIX Security 2002.
  • Sean Dorward, Rob Pike, Dave Presotto, Dennis Ritchie, Howard Trickey, Phil Winterbottom. “The Inferno Operating System.” Bell Labs Technical Journal, Vol. 2, No. 1, Winter 1997.
  • Phil Winterbottom, Rob Pike. “The Design of the Inferno Virtual Machine.” Bell Labs, 1997.
  • Vita Nuova. “The Dis Virtual Machine Specification.” 2003.
  • Vita Nuova. “The Limbo Programming Language.” 2003.
  • Sape Mullender (editor). “The 9P2000 Protocol.” Plan 9 manual, section 5 (intro(5)).
  • Kenichi Okada. “9P Resource Sharing Protocol.” IETF Internet-Draft, 2010.

Research: EROS, CapROS, and Coyotos

Deep analysis of persistent capability operating systems and their relevance to capOS.

1. EROS (Extremely Reliable Operating System)

1.1 Overview

EROS was designed and implemented by Jonathan Shapiro and collaborators at the University of Pennsylvania, starting in the late 1990s. It is a pure capability system descended from KeyKOS (developed at Key Logic in the 1980s). EROS’s defining feature is orthogonal persistence: the entire system state – processes, memory, capabilities – is transparently persistent. There is no distinction between “in memory” and “on disk.”

Key papers:

  • Shapiro, J. S., Smith, J. M., & Farber, D. J. “EROS: A Fast Capability System” (SOSP 1999)
  • Shapiro, J. S. “EROS: A Capability System” (PhD dissertation, 1999)
  • Shapiro, J. S. & Weber, S. “Verifying the EROS Confinement Mechanism” (IEEE S&P 2000)

1.2 The Single-Level Store

In a conventional OS, memory and storage are separate address spaces with different APIs (read/write vs mmap/file I/O). The programmer is responsible for explicitly loading data from disk into memory, modifying it, and writing it back. This creates an impedance mismatch that is the source of enormous complexity (serialization, caching, crash consistency, etc.).

EROS eliminates this distinction with a single-level store:

  • All objects (processes, memory pages, capability nodes) exist in a unified persistent object space.
  • There is no “file system” and no “load/save.” Objects simply exist.
  • The system periodically checkpoints the entire state to disk. Between checkpoints, modified pages are held in memory. After a crash, the system restores to the last consistent checkpoint.
  • From the application’s perspective, memory IS storage. There is no API for persistence – it happens automatically.

The single-level store in EROS operates on two primitive object types:

  1. Pages – 4KB data pages (the equivalent of both memory pages and file blocks).
  2. Nodes – 32-slot capability containers (the equivalent of both process state and directory entries).

Every page and node has a persistent identity (an Object ID, or OID). The kernel maintains an in-memory object cache and demand-pages objects from disk as needed. Modified objects are written back during checkpoints.

1.3 Checkpoint/Restart

EROS uses a consistent checkpoint mechanism inspired by KeyKOS:

How it works:

  1. The kernel periodically initiates a checkpoint (KeyKOS used a 5-minute interval; EROS used a configurable interval, typically seconds to minutes).
  2. All processes are momentarily frozen.
  3. The kernel snapshots the current state:
    • All dirty pages are marked for write-back.
    • All node state (capability tables, process descriptors) is serialized.
    • A consistent snapshot of the entire system is captured.
  4. Processes resume immediately – they continue modifying their own copies of pages (copy-on-write semantics ensure the checkpoint image is stable while new modifications accumulate).
  5. The snapshot is written to disk asynchronously while processes continue running.
  6. Once the write completes, the checkpoint is atomically committed (a checkpoint header on disk is updated).

What state is captured:

  • All memory pages (dirty pages since last checkpoint).
  • All nodes (capability slots, process registers, scheduling state).
  • The kernel’s object table (mapping OIDs to disk locations).
  • The capability graph (which process holds which capabilities).

Recovery after crash:

  • On boot, the kernel reads the last committed checkpoint header.
  • The system resumes from that exact state. All processes continue as if nothing happened (they may have lost a few seconds of work since the last checkpoint).
  • No fsck, no journal replay, no application-level recovery logic.

Performance characteristics:

  • Checkpoint cost is proportional to the number of dirty pages since the last checkpoint, not total system size.
  • Copy-on-write minimizes pause time – processes are frozen only long enough to mark pages, not to write them.
  • EROS achieved checkpoint times of a few milliseconds for the freeze phase, with asynchronous write-back taking longer depending on dirty set size.
  • The 1999 SOSP paper reported IPC performance within 2x of L4 (the fastest microkernel at the time) despite the persistence overhead.

1.4 Capabilities: Keys, Nodes, and Domains

EROS (following KeyKOS) uses a specific capability model with three fundamental concepts:

Keys (capabilities):

A key is an unforgeable reference to an object. Keys are the ONLY way to access anything in the system. There are several types:

  • Page keys – reference a persistent page. Can be read-only or read-write.
  • Node keys – reference a node (a 32-slot capability container). Can be read-only.
  • Process keys (called “domain keys” in KeyKOS) – reference a process, allowing control operations (start, stop, set registers).
  • Number keys – encode a 96-bit value directly in the key (no indirection). Used for passing constants through the capability mechanism.
  • Device keys – reference hardware device registers.
  • Forwarder keys – indirection keys used for revocation (see below).
  • Void keys – null/invalid keys, used as placeholders.

Nodes:

A node is a persistent container of exactly 32 key slots (in KeyKOS; EROS varied this slightly). Nodes serve multiple purposes:

  • Address space description: A tree of nodes with page keys at the leaves defines a process’s virtual address space. The kernel walks this tree to resolve virtual addresses to physical pages (analogous to page tables, but persistent and capability-based).
  • Capability storage: A process’s “capability table” is a node tree.
  • General-purpose data structure: Any capability-based data structure (directories, lists, etc.) is built from nodes.

Domains (processes):

A domain is EROS’s equivalent of a process. It consists of:

  • A domain root node with specific slots for:
    • Slot 0-15: general-purpose key registers (the process’s capability table)
    • Address space key (points to the root of the address space node tree)
    • Schedule key (determines CPU time allocation)
    • Brand key (identity for authentication)
    • Other control keys
  • The domain’s register state (general-purpose registers, IP, SP, flags)
  • A state (running, waiting, available)

The entire domain state is captured during checkpoint because it’s all stored in persistent nodes and pages.

1.5 The Keeper Mechanism

Each domain has a keeper key – a capability to another domain that acts as its fault handler. When a domain faults (page fault, capability fault, exception), the kernel invokes the keeper:

  1. The faulting domain is suspended.
  2. The kernel sends a message to the keeper describing the fault.
  3. The keeper can inspect and modify the faulting domain’s state (via the domain key), fix the fault (e.g., map a page, supply a capability), and restart it.

This is EROS’s equivalent of signal handlers or exception ports, but capability-mediated and fully general. Keepers enable:

  • Demand paging (the space bank keeper maps pages on fault)
  • Capability interposition (a keeper can wrap/restrict capabilities)
  • Process supervision (restart on crash)

1.6 Capability Revocation

Capability revocation – the ability to invalidate all copies of a capability – is one of the hardest problems in capability systems. EROS solves it with forwarder keys (called “sensory keys” in some descriptions):

How forwarders work:

  1. Instead of giving a client a direct key to a resource, the server creates a forwarder node.
  2. The forwarder contains a key to the real resource in one of its slots.
  3. The client receives a key to the forwarder, not the resource.
  4. When the client invokes the forwarder key, the kernel transparently redirects to the real resource.
  5. To revoke: the server rescinds the forwarder (sets a bit on the forwarder node). All outstanding forwarder keys become void keys. Invocations fail immediately.

Properties:

  • Revocation is O(1) – flip a bit on the forwarder node. No need to scan all processes for copies.
  • Revocation is transitive – if the revoked key was used to derive other keys (via further forwarders), those are also invalidated.
  • The client cannot distinguish a forwarder key from a direct key (the kernel handles the indirection transparently).
  • Revocation is immediate and irrevocable.

Space banks and revocation:

EROS uses space banks (inspired by KeyKOS) to manage resource allocation. A space bank is a capability that allocates pages and nodes. When a space bank is destroyed, ALL objects allocated from it are reclaimed. This provides bulk revocation of an entire subsystem.

1.7 Confinement

EROS provides a formally verified confinement mechanism. A confined subsystem cannot leak information to the outside world except through channels explicitly provided to it. Shapiro and Weber (IEEE S&P 2000) proved that EROS can construct a confined subsystem using:

  1. A constructor creates the confined process.
  2. The confined process receives ONLY the capabilities explicitly granted to it. It has no ambient authority, no access to timers (to prevent timing channels), and no access to storage (to prevent storage channels).
  3. The constructor verifies that no covert channels exist in the granted capability set.

This is relevant to capOS’s capability model: the same structural properties that make EROS confinement possible (no ambient authority, capabilities as the only access mechanism) are present in capOS’s design.


2. CapROS

2.1 Relationship to EROS

CapROS (Capability-based Reliable Operating System) is the direct successor to EROS. It was started by Charles Landau (who also worked on KeyKOS) and continues development based on the EROS codebase. CapROS is essentially “EROS in production” – the same architecture with engineering improvements.

2.2 Improvements Over EROS

Practical engineering focus:

  • EROS was a research system; CapROS aims to be deployable.
  • CapROS added support for modern hardware (PCI, USB, networking).
  • Improved build system and development toolchain.

Persistence improvements:

  • CapROS refined the checkpoint mechanism for better performance with modern disk characteristics (SSDs change the cost model significantly – random writes are cheap, so the checkpoint layout can be optimized differently than for spinning disks).
  • Added support for larger persistent object spaces.
  • Improved crash recovery speed.

Device driver model:

  • CapROS runs device drivers as user-space processes (like EROS), each receiving only the device capabilities they need.
  • A device driver receives: device register keys (MMIO access), interrupt keys (to receive interrupts), and DMA buffer keys.
  • The driver CANNOT access other devices, other processes’ memory, or arbitrary I/O ports. It is confined to its specific device.
  • This is directly analogous to capOS’s planned device capability model (see the networking and cloud deployment proposals).

Linux compatibility layer:

  • CapROS includes a partial Linux kernel compatibility layer that allows some Linux device drivers to be compiled and run as CapROS user-space drivers. This pragmatically addresses the “driver availability” problem without compromising the capability model.

2.3 Current Status

CapROS development continued into the 2010s but has been relatively quiet. The codebase exists and runs on real x86 hardware. It is not widely deployed and remains primarily a research/demonstration system. The key contribution is demonstrating that the EROS/KeyKOS persistent capability model is viable on modern hardware and can support real device drivers and applications.

2.4 Device Drivers and Hardware Access

CapROS’s device driver isolation is worth examining in detail because capOS faces the same design decisions:

Device capability model:

Kernel
  │
  ├── DeviceManager capability
  │     │
  │     ├── grants DeviceMMIO(base, size) to driver
  │     ├── grants InterruptCap(irq_number) to driver
  │     └── grants DMAPool(phys_range) to driver
  │
  └── Driver process
        │
        ├── uses DeviceMMIO to read/write registers
        ├── uses InterruptCap to wait for interrupts
        ├── uses DMAPool to allocate DMA-safe buffers
        └── exports higher-level capability (e.g., NIC, Block)

The driver has no way to access memory outside its granted ranges. A buggy NIC driver cannot corrupt disk I/O or access other processes’ pages.


3. Coyotos

3.1 Design Philosophy

Coyotos was Jonathan Shapiro’s next-generation project after EROS, started around 2004. Where EROS was an implementation of the KeyKOS model in C, Coyotos aimed to be a formally verifiable capability OS from the ground up.

Key differences from EROS:

  • Verification-oriented design: Every kernel mechanism was designed to be amenable to formal verification. If a feature couldn’t be verified, it was redesigned or removed.
  • BitC language: A new programming language (BitC) was designed specifically for writing verified systems software.
  • Simplified object model: Coyotos reduced the number of primitive object types compared to EROS, making the verification target smaller.
  • No inline assembly in the verified core: The verified kernel core was to be written entirely in BitC, with a thin hardware abstraction layer underneath.

3.2 BitC Language

BitC was an ambitious attempt to create a language suitable for both systems programming and formal verification:

Design goals:

  • Type safety: Sound type system that prevents memory errors at compile time.
  • Low-level control: Direct memory layout control, no garbage collector, suitable for kernel code.
  • Formal reasoning: Type system designed so that proofs about programs could be mechanically checked.
  • Mutability control: Explicit distinction between mutable and immutable references (predating Rust’s borrow checker by several years).

Relationship to capability verification:

The key insight was that if the kernel is written in a language with a sound type system, and capabilities are represented as typed references in that language, then many capability safety properties (no forgery, no amplification) follow from type safety rather than requiring separate proofs.

Specifically:

  • Capabilities are opaque typed references – the type system prevents construction of capabilities from raw integers.
  • The lack of arbitrary pointer arithmetic prevents capability forgery.
  • Type-based access control means a read-only capability reference cannot be cast to a read-write one.

Outcome:

BitC was never completed. The language design proved extremely difficult – combining low-level systems programming with formal verification requirements created unsolvable tensions in the type system. Shapiro eventually acknowledged that the BitC approach was overambitious and shelved the project. (Rust, which appeared later, solved many of the same problems with a different approach – borrowing and lifetimes rather than full dependent types.)

3.3 Formal Verification Approach

Coyotos aimed to verify several key properties:

  1. Capability safety: No process can forge, modify, or amplify a capability. This was to be proved as a consequence of BitC’s type safety.
  2. Confinement: A confined subsystem cannot leak information except through authorized channels. EROS proved this informally; Coyotos aimed for machine-checked proofs.
  3. Authority propagation: Formal model of how authority flows through the capability graph, allowing static analysis of security policies.
  4. Memory safety: The kernel never accesses memory it shouldn’t, never double-frees, never uses after free. Type safety + linear types in BitC were intended to guarantee this.

The verification approach influenced later work on seL4, which successfully achieved formal verification of a capability microkernel (though in C with Isabelle/HOL proofs, not in a verification-oriented language).

3.4 Coyotos Memory Model

Coyotos simplified the EROS memory model while retaining persistence:

Objects:

  • Pages: 4KB data pages (same as EROS).
  • CapPages: Pages that hold capabilities instead of data. This replaced EROS’s fixed-size nodes with variable-size capability containers.
  • GPTs (Guarded Page Tables): A unified abstraction for address space construction. Instead of EROS’s separate node trees for address spaces, Coyotos uses GPTs that combine guard bits (for sparse address space construction, similar to Patricia trees) with page table semantics.
  • Processes: Similar to EROS domains but with a cleaner structure.
  • Endpoints: IPC communication endpoints (similar to L4 endpoints, replacing EROS’s direct domain-to-domain calls).

GPTs (Guarded Page Tables):

This was Coyotos’s most innovative memory model contribution. A GPT node has:

  • A guard value and guard length (for address space compression).
  • Multiple capability slots pointing to sub-GPTs or pages.
  • Hardware-independent address space description that the kernel translates to actual page tables on TLB miss.

The guard mechanism allows sparse address spaces without allocating intermediate page table levels. For example, a process that uses only two memory regions at addresses 0x1000 and 0x7FFF_F000 needs only a few GPT nodes, not a full 4-level page table tree.

Persistence:

Coyotos retained EROS’s checkpoint-based persistence but with a cleaner separation between the persistent object store and the in-memory cache. The simpler object model (fewer object types) made the checkpoint logic easier to verify.

3.5 Current Status

Coyotos was never completed. The BitC language proved too difficult, and Shapiro moved on to other work. However, Coyotos’s design documents and specifications remain valuable as a carefully reasoned evolution of the EROS model. The key ideas (GPTs, endpoint-based IPC, verification-oriented design) influenced other systems work.


4. Single-Level Store: Deep Dive

4.1 The Core Concept

The single-level store unifies two traditionally separate abstractions:

Traditional OSSingle-Level Store
Virtual memory (RAM, volatile)Unified persistent object space
File system (disk, persistent)Same unified space
mmap (bridge between the two)No bridge needed
Serialization (convert objects to bytes for storage)Objects are always in storable form
Crash recovery (fsck, journal replay)Checkpoint restore

In a single-level store, the programmer never thinks about persistence. Objects are created, modified, and eventually garbage collected. The system ensures they survive power failure without any explicit save operation.

4.2 Implementation in EROS

EROS’s single-level store works as follows:

Object storage on disk:

  • The disk is divided into two regions: the object store and the checkpoint log.
  • The object store holds the canonical copy of all objects (pages and nodes), indexed by OID.
  • The checkpoint log holds the most recently checkpointed versions of modified objects.

Object lifecycle:

  1. An object is created (allocated from a space bank). It receives a unique OID.
  2. The object exists in the in-memory object cache. It may be modified arbitrarily.
  3. During checkpoint, if the object is dirty, its current state is written to the checkpoint log.
  4. After the checkpoint commits, the logged version may be migrated to the object store (or left in the log until the next checkpoint).
  5. If the object is evicted from memory (memory pressure), it can be demand-paged back from disk.

Demand paging:

When a process accesses a virtual address that isn’t currently in physical memory:

  1. Page fault occurs.
  2. The kernel looks up the OID for that virtual page (by walking the address space capability tree).
  3. If the object is on disk, the kernel reads it into the object cache.
  4. The page is mapped into the process’s address space.
  5. The process continues, unaware that I/O occurred.

This is similar to demand paging in a conventional OS, but with a critical difference: the “backing store” is the persistent object store, not a swap partition. There is no separate swap space.

4.3 Performance Implications

Advantages:

  • No serialization overhead for persistence. Objects are stored in their in-memory format.
  • No double-buffering. A conventional OS may have a page in both the page cache and a file buffer; EROS has one copy.
  • Checkpoint cost is proportional to mutation rate, not data size.
  • Recovery is instantaneous – resume from last checkpoint, no log replay.

Disadvantages:

  • Checkpoint pause: Even with copy-on-write, there is a brief pause to snapshot the system state. KeyKOS/EROS measured this at milliseconds, but it can grow with the number of dirty pages.
  • Write amplification: Every modified page must be written to the checkpoint log, even if only one byte changed. This is worse than a log-structured filesystem that can coalesce small writes.
  • Memory pressure: The object cache competes with application working sets. Under heavy memory pressure, the system may thrash between paging objects in and checkpointing them out.
  • Large object stores: The OID-to-disk-location mapping must be kept in memory (or itself paged, adding complexity). For very large stores, this overhead grows.
  • No partial persistence: You can’t choose to make some objects transient and others persistent. Everything is persistent. This wastes disk bandwidth on objects that don’t need persistence (temporary buffers, caches, etc.).

4.4 Relationship to Persistent Memory (PMEM/Optane)

Intel Optane (3D XPoint, now discontinued but conceptually important) and other persistent memory technologies provide byte-addressable storage that survives power loss. This is remarkably close to what EROS simulates in software:

EROS Single-Level StorePMEM Hardware
Software checkpoint to diskHardware persistence on every write
Object cache in DRAMData in persistent memory
Demand paging from diskDirect load/store to persistent media
Crash = lose since last checkpointCrash = lose in-flight stores (cache lines)

PMEM makes the single-level store cheaper:

  • No checkpoint writes needed for objects stored in PMEM – they’re already persistent.
  • No demand paging from disk – PMEM is directly addressable.
  • Consistency requires cache line flush + fence (much cheaper than disk I/O).

But PMEM doesn’t eliminate the need for the store abstraction:

  • PMEM capacity is limited (compared to SSDs/HDDs). The object store may still need to tier between PMEM and block storage.
  • PMEM has higher latency than DRAM. The object cache still has value as a fast-path.
  • Crash consistency with PMEM requires careful ordering of writes (cache line flushes). The checkpoint model actually simplifies this – you don’t need per-object crash consistency, just per-checkpoint consistency.

Relevance to capOS:

Even without PMEM hardware, understanding the single-level store model informs how capOS can design its persistence layer. The key insight is that separating “in-memory format” from “on-disk format” creates unnecessary complexity. Cap’n Proto’s zero-copy serialization already blurs this line – a capnp message in memory has the same byte layout as on disk.


5. Persistent Capabilities

5.1 How Persistent Capabilities Survive Restarts

In EROS/KeyKOS, capabilities survive restarts because they are part of the checkpointed state:

  1. A capability is stored as a key in a node slot.
  2. The key contains: (object type, OID, permissions, other metadata).
  3. During checkpoint, all nodes (including their key slots) are written to disk.
  4. On restart, nodes are restored. Keys reference objects by OID. Since objects are also restored, the key resolves to the same object.

The critical property: capabilities are named by the persistent identity of their target, not by a volatile address. A key says “page #47293” not “memory address 0x12345.” Since page #47293 is persistent, the key is meaningful across restarts.

5.2 Consistency Model

EROS guarantees checkpoint consistency: the entire system is restored to the state at the last committed checkpoint. This means:

  • If process A sent a message to process B, and both the send and receive completed before the checkpoint, both see the message after restart.
  • If the send completed but the receive didn’t (checkpoint happened between them), both are rolled back to before the send. The message is lost, but the system is consistent.
  • There is no scenario where A thinks it sent a message but B never received it (or vice versa). The checkpoint captures a consistent global snapshot.

This is analogous to database transaction atomicity but applied to the entire system state.

5.3 Volatile State and Capabilities

Some capabilities reference inherently volatile state. EROS handles this through the object re-creation pattern:

Hardware devices:

  • Device keys reference hardware registers that don’t survive reboot.
  • On restart, the kernel re-initializes device state and re-creates device keys.
  • Processes that held device keys get valid keys again (pointing to the re-initialized device), but the device state itself is reset.
  • The process’s device driver is responsible for re-initializing the device to the desired state (this is application logic, not kernel logic).

Network connections:

  • EROS doesn’t have a native networking stack in the kernel, so this is handled at the application level.
  • A network service process re-establishes connections on restart.
  • Clients that held capabilities to network endpoints would invoke them, and the network service would transparently reconnect.
  • The capability abstraction hides the reconnection – the client’s code doesn’t change.

General pattern:

When a capability references state that can’t survive restart:

  1. The capability itself persists (it’s in a node slot, checkpointed).
  2. On restart, invoking the capability may trigger re-initialization.
  3. The keeper mechanism handles this: the target object’s keeper detects the stale state and re-initializes before completing the call.
  4. The client is unaware of the restart (or sees a transient error if re-initialization fails).

5.4 The Space Bank Model

Persistent capabilities create a garbage collection problem: when is it safe to reclaim a persistent object? EROS solves this with space banks:

  • A space bank is a capability that allocates objects (pages and nodes).
  • Every object is allocated from exactly one space bank.
  • Space banks can be hierarchical (a bank allocates from a parent bank).
  • Destroying a space bank reclaims ALL objects allocated from it.

This provides:

  • Bulk deallocation: Terminate a subsystem by destroying its bank.
  • Resource accounting: Each bank tracks how much space it has consumed.
  • Revocation: Destroying a bank revokes all capabilities to objects allocated from it (the objects cease to exist).

The space bank model avoids the need for a global garbage collector scanning the capability graph. Instead, resource lifetimes are explicitly managed through the bank hierarchy.


6. Relevance to capOS

6.1 Cap’n Proto as Persistent Capability Format

EROS stores capabilities as (type, OID, permissions) tuples in fixed-size node slots. capOS can do something analogous but more naturally, because Cap’n Proto already provides a serialization format for structured data:

A persistent capability in capOS could be a capnp struct:

struct PersistentCapRef {
  interfaceId @0 :UInt64;   # which capability interface
  objectId @1 :UInt64;      # persistent object identity
  permissions @2 :UInt32;   # bitmask of allowed methods
  epoch @3 :UInt64;         # revocation epoch (see below)
}

Why this works well with Cap’n Proto:

  • Zero-copy persistence: A capnp message in memory has the same byte layout as on disk. No serialization/deserialization step for persistence. This is the closest a modern system can get to EROS’s single-level store without hardware support.
  • Schema evolution: Cap’n Proto’s backwards-compatible schema evolution means persistent capability formats can evolve without breaking existing stored capabilities.
  • Cross-machine references: The same PersistentCapRef can reference a local or remote object. The objectId can include a machine/node identifier for distributed capabilities.
  • Type safety: The interfaceId field provides runtime type checking that EROS’s keys lacked (EROS keys are untyped references; the type is determined by the target object, not the key).

Difference from EROS:

EROS capabilities are kernel objects – the kernel knows about every key and mediates every invocation. In capOS, PersistentCapRef could be a user-space construct – a serialized reference that is resolved by the kernel (or a userspace capability manager) when invoked. This is a deliberate trade-off: less kernel complexity, more flexibility, but the kernel must validate references on use rather than at creation time.

6.2 Checkpoint/Restart Patterns for capOS

EROS’s checkpoint model provides several patterns capOS could adopt:

This is what capOS’s storage proposal already describes: services serialize their own state to the Store capability. This is simpler than EROS’s transparent persistence but requires application cooperation.

Service state → capnp serialize → Store.put(data) → persistent hash
On restart: Store.get(hash) → capnp deserialize → restore state

Advantages over EROS transparent persistence:

  • No kernel complexity for checkpointing.
  • Services control what is persistent and what is transient.
  • No “checkpoint pause” – services choose when to persist.
  • Natural fit with Cap’n Proto (state is already capnp).

Disadvantages:

  • Every service must implement save/restore logic.
  • No automatic consistency across services (each saves independently).
  • Programmer error can lead to inconsistent state after restart.

Pattern 2: Kernel-Assisted Checkpointing (Phase 2)

Add a Checkpoint capability that captures process state:

interface Checkpoint {
  # Save the calling process's state (registers, memory, cap table)
  save @0 () -> (handle :Data);
  # Restore a previously saved state
  restore @1 (handle :Data) -> ();
}

This is analogous to CRIU (Checkpoint/Restore in Userspace) on Linux but capability-mediated:

  • The kernel captures the process’s address space, register state, and capability table.
  • State is serialized as capnp messages and stored via the Store capability.
  • Restore creates a new process from the saved state.

Advantages:

  • Transparent to the application (no save/restore logic needed).
  • Can capture the full capability graph of a process.
  • Enables process migration between machines.

Disadvantages:

  • Kernel complexity for state capture.
  • Must handle capabilities that reference volatile state (open network connections, device handles).
  • Memory overhead for copy-on-write snapshots.

Pattern 3: Consistent Multi-Process Checkpointing (Phase 3)

EROS’s global checkpoint extended to capOS:

  • A CheckpointCoordinator service initiates a distributed snapshot.
  • All participating services freeze, checkpoint their state, then resume.
  • The coordinator records a consistent cut across all services.
  • Recovery restores all services to the same consistent point.

This requires:

  • A coordination protocol (similar to distributed database commit).
  • Services must participate in the protocol (register with the coordinator, respond to freeze/checkpoint/resume signals).
  • The coordinator must handle failures during the checkpoint itself.

This is the most complex option but provides the strongest consistency guarantees. It’s appropriate for capOS’s later stages when multi-service reliability matters.

6.3 Capability-Native Filesystem Design

EROS’s model and capOS’s Store proposal can be synthesized into a capability-native filesystem design:

Hybrid approach: Content-Addressed Store + Capability Metadata

capOS’s current Store proposal uses content-addressed storage (hash-based). This is good for immutable data but awkward for capability references (a capability’s target may change without the capability itself changing).

A better model, informed by EROS:

Persistent Object = (ObjectId, Version, CapnpData, CapSlots[])

Where:

  • ObjectId is a persistent identity (like EROS’s OID).
  • Version is a monotonic counter (for optimistic concurrency).
  • CapnpData is the object’s data payload as a capnp message.
  • CapSlots[] is a list of capability references embedded in the object (like EROS’s node slots).

This separates data from capability references, which is important because:

  • Data can be content-addressed (deduplicated by hash).
  • Capability references must be identity-addressed (two identical-looking references to different objects are different).
  • Revocation operates on capability references, not data.

The Namespace as Directory

capOS’s Namespace capability is the capability-native equivalent of a directory:

UnixEROScapOS
Directory (inode + dentries)Node with keys in slotsNamespace capability
Path traversalNode tree walkNamespace.resolve() chain
Permission bitsKey type + slot permissionsCapability attenuation
Hard linksMultiple keys to same objectMultiple refs to same hash
Symbolic linksForwarder keysRedirect capabilities

Journaling and Crash Consistency

EROS avoids journaling by using checkpoint-based consistency. capOS’s Store service needs its own consistency story:

Option A: Checkpoint-based (EROS-style)

  • Store service maintains an in-memory cache of recent modifications.
  • Periodically flushes a consistent snapshot to disk.
  • On crash, recovers to last flush point.
  • Simple but may lose recent writes.

Option B: Log-structured (modern)

  • All writes go to an append-only log.
  • A background compaction process builds indexed snapshots from the log.
  • On crash, replay the log from the last snapshot.
  • More complex but no data loss window.

Option C: Hybrid

  • Capability metadata (the namespace bindings) uses a write-ahead log for crash consistency.
  • Object data (capnp blobs in the content-addressed store) uses checkpoint-based consistency (losing a few blobs is tolerable; losing a namespace binding is not).

Option C is recommended for capOS: it provides strong consistency for the critical metadata while keeping the data path simple.

6.4 Transparent vs Explicit Persistence: Tradeoffs

AspectEROS TransparentcapOS ExplicitHybrid
Application complexityNone (automatic)High (must implement save/restore)Medium (opt-in transparency)
Kernel complexityVery high (checkpoint, COW, object store)Low (just IPC and memory)Medium (checkpoint capability)
ConsistencyStrong (global checkpoint)Weak (per-service)Medium (coordinator)
ControlNone (everything persists)Full (choose what to save)Selective
PerformanceCheckpoint pausesNo pauses, explicit I/O costConfigurable
Volatile stateKeeper mechanism handlesService handles reconnectionAnnotated capabilities
DebuggabilityHard (system is a black box)Easy (state is explicit capnp)Medium
Cap’n Proto fitNeutralExcellent (state = capnp)Good

Recommendation for capOS:

Start with explicit persistence (Phase 1 in the storage proposal) because:

  1. It’s dramatically simpler to implement.
  2. Cap’n Proto makes serialization nearly free anyway.
  3. It gives services control over what is persistent.
  4. It aligns with capOS’s existing Store/Namespace design.
  5. The kernel stays simple.

Then add opt-in kernel-assisted checkpointing (like the Checkpoint capability described above) for services that want transparent persistence. This gives the benefits of EROS’s model without forcing it on everything.

Never implement EROS’s fully transparent global persistence – the kernel complexity is enormous, the debugging experience is poor, and modern systems (with fast SSDs and capnp zero-copy serialization) don’t need it. The explicit model with good tooling is strictly better for a research OS.

6.5 Capability Revocation in capOS

EROS’s forwarder key model translates directly to capOS:

Epoch-based revocation:

Each capability includes a revocation epoch. The kernel (or capability manager) maintains a per-object epoch counter. When a capability is invoked:

  1. Check that the capability’s epoch matches the object’s current epoch.
  2. If it doesn’t match, the capability has been revoked – return an error.
  3. To revoke all capabilities to an object, increment the object’s epoch.

This is O(1) revocation (increment a counter) with O(1) check per invocation (compare two integers). It’s simpler than EROS’s forwarder mechanism and fits naturally into a capnp-serialized capability reference:

struct CapRef {
  objectId @0 :UInt64;
  epoch @1 :UInt64;        # revocation epoch
  permissions @2 :UInt32;  # method bitmask
  interfaceId @3 :UInt64;  # type of the capability
}

Space bank analog:

capOS can implement EROS’s space bank pattern using the Store:

  • Each “bank” is a Namespace prefix in the Store.
  • Objects allocated by a service are stored under its namespace.
  • Destroying the service’s namespace revokes access to all its objects.
  • Resource accounting is done by the Store service (track bytes per namespace).

6.6 Summary of Recommendations

EROS/CapROS/Coyotos ConceptcapOS Recommendation
Single-level storeDon’t implement (too complex for research OS). Use Cap’n Proto zero-copy as a lightweight equivalent.
Checkpoint/restartPhase 1: application-level (explicit capnp save/restore). Phase 2: Checkpoint capability for opt-in transparent persistence.
Persistent capabilitiesUse capnp PersistentCapRef struct with objectId + epoch. Store capability graph in the Store service.
Capability revocationEpoch-based revocation (increment counter, check on invocation). Simpler than EROS forwarders, same O(1) cost.
Space banksMap to Store namespaces. Destroying a namespace reclaims all objects.
Keeper/fault handlerMap to capOS’s supervisor mechanism (service-architecture proposal). Supervisor receives fault notifications and can restart/repair.
GPTs (Coyotos)Not needed – capOS uses hardware page tables directly. The sparse address-space idea remains relevant for future SharedBuffer/AddressRegion work beyond the current VirtualMemory cap.
ConfinementcapOS already has the structural prerequisites (no ambient authority). Formal confinement proofs are a future research direction.
Device isolationAlready planned in capOS (device capabilities with MMIO/interrupt/DMA grants). CapROS validates this approach works in practice.

Key References

  • Shapiro, J. S., Smith, J. M., Farber, D. J. “EROS: A Fast Capability System.” Proceedings of the 17th ACM Symposium on Operating Systems Principles (SOSP), 1999.
  • Shapiro, J. S. “EROS: A Capability System.” PhD dissertation, University of Pennsylvania, 1999.
  • Shapiro, J. S. & Weber, S. “Verifying the EROS Confinement Mechanism.” IEEE Symposium on Security and Privacy, 2000.
  • Hardy, N. “The Confused Deputy.” ACM SIGOPS Operating Systems Review, 1988. (Motivates capability-based access control.)
  • Hardy, N. “KeyKOS Architecture.” Operating Systems Review, 1985.
  • Landau, C. R. “The Checkpoint Mechanism in KeyKOS.” Proceedings of the Second International Workshop on Object Orientation in Operating Systems, 1992.
  • Shapiro, J. S. et al. “Coyotos Microkernel Specification.” Technical report, Johns Hopkins University, 2004-2008.
  • Shapiro, J. S. et al. “BitC Language Specification.” Technical report, Johns Hopkins University, 2004-2008.
  • Dennis, J. B. & Van Horn, E. C. “Programming Semantics for Multiprogrammed Computations.” Communications of the ACM, 1966. (Original capability concept.)
  • Levy, H. M. “Capability-Based Computer Systems.” Digital Press, 1984. (Comprehensive survey of capability systems including CAP, Hydra, iAPX 432, StarOS.)

LLVM Target Customization for capOS

Deep research report on creating custom LLVM/Rust/Go targets for a capability-based OS.

Status as of 2026-04-22: capOS still builds kernel and userspace with x86_64-unknown-none plus linker-script/build flags. A checked-in x86_64-unknown-capos custom target does not exist yet. Since this report was first written, PT_TLS parsing, userspace TLS block setup, FS-base save/restore, the VirtualMemory capability, and a #[thread_local] QEMU smoke have landed. Thread creation, a user-controlled FS-base syscall, futexes, a timer capability, and a Go port remain future work.

Table of Contents

  1. Custom OS Target Triple
  2. Calling Conventions
  3. Relocations
  4. TLS (Thread-Local Storage) Models
  5. Rust Target Specification
  6. Go Runtime Requirements
  7. Relevance to capOS

1. Custom OS Target Triple

Target Triple Format

LLVM target triples follow the format <arch>-<vendor>-<os> or <arch>-<vendor>-<os>-<env>:

  • arch: x86_64, aarch64, riscv64gc, etc.
  • vendor: unknown, apple, pc, etc. (often unknown for custom OSes)
  • os: linux, none, redox, hermit, fuchsia, etc.
  • env (optional): gnu, musl, eabi, etc.

For capOS, the eventual userspace target triple should be x86_64-unknown-capos. The kernel should keep using a freestanding target (x86_64-unknown-none) unless a kernel-specific target file becomes useful for build hygiene.

What LLVM Needs

LLVM’s target description consists of:

  1. Target machine: Architecture (instruction set, register file, calling conventions). x86_64 already exists in LLVM.
  2. Object format: ELF, COFF, Mach-O. capOS uses ELF.
  3. Relocation model: static, PIC, PIE, dynamic-no-pic.
  4. Code model: small, kernel, medium, large.
  5. OS-specific ABI details: Stack alignment, calling convention defaults, TLS model, exception handling mechanism.

LLVM does NOT need kernel-level knowledge of your OS. It needs to know how to generate correct object code for the target environment. The OS name in the triple primarily affects:

  • Default calling convention selection
  • Default relocation model
  • TLS model selection
  • Object file format and flags
  • C library assumptions (relevant for C compilation, less for Rust no_std)

Creating a New OS in LLVM (Upstream Path)

To add capos as a recognized OS in LLVM itself:

  1. Add the OS to llvm/include/llvm/TargetParser/Triple.h (the OSType enum)
  2. Add string parsing in llvm/lib/TargetParser/Triple.cpp
  3. Define ABI defaults in the relevant target (llvm/lib/Target/X86/)
  4. Update Clang’s driver for the new OS (clang/lib/Driver/ToolChains/, clang/lib/Basic/Targets/)

This is significant upstream work and not necessary initially. The pragmatic path is using Rust’s custom target JSON mechanism (see Section 5).

What Other OSes Do

OSLLVM statusApproach
RedoxUpstream in Rust; no dedicated LLVM OS enum in current LLVMFull triple x86_64-unknown-redox, Tier 2 in Rust
HermitUpstream in LLVM and Rustx86_64-unknown-hermit, Tier 3, unikernel
FuchsiaUpstream in LLVM and Rustx86_64-unknown-fuchsia, Tier 2
TheseusCustom target JSONUses x86_64-unknown-theseus JSON spec, not upstream
Blog OS (phil-opp)Custom target JSONUses JSON target spec, targets x86_64-unknown-none base
seL4/RobigaliaCustom target JSONModified from x86_64-unknown-none

Recommendation for capOS: keep the kernel on x86_64-unknown-none. Introduce a userspace-only custom target JSON when cfg(target_os = "capos") or toolchain packaging becomes valuable. Do not upstream a capos OS triple until the userspace ABI is stable.


2. Calling Conventions

LLVM Calling Conventions

LLVM supports numerous calling conventions. The ones relevant to capOS:

CCLLVM IDDescriptionRelevance
C0Default C calling convention (System V AMD64 ABI on x86_64)Primary for interop
Fast8Optimized for internal use, passes in registersRust internal use
Cold9Rarely-called functions, callee-save heavyError paths
GHC10Glasgow Haskell Compiler, everything in registersNot relevant
HiPE11Erlang HiPE, similar to GHCNot relevant
WebKit JS12JavaScript JITNot relevant
AnyReg13Dynamic register allocationJIT compilers
PreserveMost14Caller saves almost nothingInterrupt handlers
PreserveAll15Caller saves nothingContext switches
Swift16Swift self/error registersNot relevant
CXX_FAST_TLS17C++ TLS access optimizationTLS wrappers
X86_StdCall64Windows stdcallNot relevant
X86_FastCall65Windows fastcallNot relevant
X86_RegCall95Register-based callingPerformance-critical code
X86_INTR83x86 interrupt handlerIDT handlers
Win6479Windows x64 calling conventionNot relevant

System V AMD64 ABI (The Default for capOS)

On x86_64, the System V AMD64 ABI (CC 0, “C”) is the standard:

  • Integer args: RDI, RSI, RDX, RCX, R8, R9
  • Float args: XMM0-XMM7
  • Return: RAX (integer), XMM0 (float)
  • Caller-saved: RAX, RCX, RDX, RSI, RDI, R8-R11, XMM0-XMM15
  • Callee-saved: RBX, RBP, R12-R15
  • Stack alignment: 16-byte at call site
  • Red zone: 128 bytes below RSP (unavailable in kernel mode)

capOS already uses this convention – the syscall handler in kernel/src/arch/x86_64/syscall.rs maps syscall registers to System V registers before calling syscall_handler.

Customizing for a New OS Target

For a custom OS, calling convention customization is usually minimal:

  1. Kernel code: Disable the red zone (capOS already does this via x86_64-unknown-none which sets "disable-redzone": true). The red zone is unsafe in interrupt/syscall contexts.

  2. Userspace code: Standard System V ABI is fine. The red zone is safe in userspace.

  3. Syscall convention: This is an OS design choice, not an LLVM CC. capOS uses: RAX=syscall number, RDI-R9=args (matching System V for easy dispatch). Linux uses a slightly different register mapping (R10 instead of RCX for arg4, because SYSCALL clobbers RCX).

  4. Interrupt handlers: Use X86_INTR (CC 83) or manual save/restore. capOS currently uses manual asm stubs.

Cross-Language Interop Implications

LanguagesConventionNotes
Rust <-> RustRust ABI (unstable)Internal to a crate, not stable across crates
Rust <-> Cextern "C" (System V)Stable, well-defined. Used for libcapos API
Rust <-> GoComplex (see Section 6)Go has its own internal ABI (ABIInternal)
C <-> Goextern "C" via cgoGo’s cgo bridge, heavy overhead
Any <-> KernelSyscall conventionRegister-based, OS-defined, not a CC

Key point: The System V AMD64 ABI is the lingua franca. All languages can produce extern "C" functions. capOS should standardize on System V for all cross-language boundaries and capability invocations.

Go’s internal ABI (ABIInternal, using R14 as the g register) is different from System V. Go functions called from outside Go must go through a trampoline. This is handled by the Go runtime, not something capOS needs to solve at the LLVM level.


3. Relocations

LLVM Relocation Models

ModelFlagDescription
static-relocation-model=staticAll addresses resolved at link time. No GOT/PLT.
pic-relocation-model=picPosition-independent code. Uses GOT for globals, PLT for calls.
dynamic-no-pic-relocation-model=dynamic-no-picLike static but with dynamic linking support (macOS legacy).
ropi-relocation-model=ropiRead-only position-independent (ARM embedded).
rwpi-relocation-model=rwpiRead-write position-independent (ARM embedded).
ropi-rwpi-relocation-model=ropi-rwpiBoth ROPI and RWPI (ARM embedded).

Code Models (x86_64)

ModelFlagAddress RangeUse Case
small-code-model=small0 to 2GBUserspace default
kernel-code-model=kernelTop 2GB (negative 32-bit)Higher-half kernel
medium-code-model=mediumCode in low 2GB, data anywhereLarge data sets
large-code-model=largeNo assumptionsMaximum flexibility, worst performance

What capOS Currently Uses

From .cargo/config.toml:

[target.x86_64-unknown-none]
rustflags = ["-C", "link-arg=-Tkernel/linker-x86_64.ld", "-C", "code-model=kernel", "-C", "relocation-model=static"]
  • Kernel: code-model=kernel + relocation-model=static. Correct for a higher-half kernel at 0xffffffff80000000. All kernel symbols are in the top 2GB of virtual address space, so 32-bit sign-extended addressing works.

  • Init/demos/capos-rt userspace: The standalone userspace crates also target x86_64-unknown-none, pass -Crelocation-model=static, and select their linker scripts through per-crate build.rs files. The binaries are loaded at 0x200000. The pinned local toolchain (rustc 1.97.0-nightly, LLVM 22.1.2) prints x86_64-unknown-none with llvm-target = "x86_64-unknown-none-elf", code-model = "kernel", soft-float ABI, inline stack probes, and static PIE-capable defaults. A future x86_64-unknown-capos userspace target should set code-model = "small" explicitly instead of inheriting the freestanding kernel-oriented default.

Kernel vs. Userspace Requirements

Kernel:

  • Static relocations, kernel code model.
  • No PIC overhead needed – the kernel is loaded at a known address.
  • The linker script places everything in the higher half.
  • This is the correct and standard approach (Linux kernel does the same).

Userspace (current – static binaries):

  • Static relocations. A future custom userspace target should choose the small code model explicitly.
  • Simple, no runtime relocator needed.
  • Binary is loaded at a fixed address (0x200000).
  • Works perfectly for single-binary-per-address-space.

Userspace (future – if shared libraries or ASLR desired):

  • PIE (Position-Independent Executable) = PIC + static linking.
  • Requires a dynamic loader or kernel-side relocator.
  • Enables ASLR (Address Space Layout Randomization) for security.
  • Adds GOT indirection overhead (typically < 5% performance impact).

Position-Independent Code in a Capability Context

PIC/PIE is relevant to capOS for several reasons:

  1. ASLR: PIE enables loading binaries at random addresses, making ROP attacks harder. Even in a capability system, defense-in-depth matters.

  2. Shared libraries: If capOS ever supports shared objects (e.g., a shared libcapos.so), PIC is required for the shared library.

  3. WASI/Wasm: Not relevant – Wasm has its own memory model.

  4. Multiple instances: With static linking, two instances of the same binary can share read-only pages (text, rodata) if loaded at the same address. PIC/PIE allows sharing even at different addresses (copy-on-write for the GOT).

Recommendation for capOS: Keep static relocation for now. Consider PIE for userspace when implementing ASLR (after threading and IPC are stable). The kernel should remain static forever.


4. TLS (Thread-Local Storage) Models

LLVM TLS Models

LLVM supports four TLS models, in order from most dynamic to most constrained:

ModelDescriptionRuntime RequirementPerformance
general-dynamicAny module, any timeFull __tls_get_addr via dynamic linkerSlowest (function call per access)
local-dynamicSame module, any time__tls_get_addr for module base, then offsetSlow (one call per module per thread)
initial-execOnly modules loaded at startupGOT slot populated by dynamic linkerFast (one memory load)
local-execMain executable onlyDirect FS/GS offset, known at link timeFastest (single instruction)

How TLS Works on x86_64

On x86_64, TLS is accessed via the FS segment register:

  1. The OS sets the FS base address for each thread (via MSR_FS_BASE or arch_prctl(ARCH_SET_FS)).
  2. TLS variables are accessed as offsets from FS base:
    • local-exec: mov %fs:OFFSET, %rax (offset known at link time)
    • initial-exec: mov %fs:0, %rax; mov GOT_OFFSET(%rax), %rcx; mov %fs:(%rcx), %rdx
    • general-dynamic: call __tls_get_addr (returns pointer to TLS block)

Which Model for capOS?

Kernel:

  • The kernel does not use compiler TLS. Current TLS support is for loaded userspace ELF images only.
  • For SMP: per-CPU data via GS segment register (the standard approach). Set MSR_GS_BASE on each CPU to point to a PerCpu struct. swapgs on kernel entry switches between user and kernel GS base.
  • Kernel TLS model: Not applicable (per-CPU data is accessed via GS, not the compiler’s TLS mechanism).

Userspace (static binaries, no dynamic linker):

  • local-exec is the only correct choice. There’s no dynamic linker to resolve TLS relocations, so general-dynamic and initial-exec won’t work.
  • Implemented for the current single-threaded process model: the ELF parser records PT_TLS, the loader maps a Variant II TLS block plus TCB self pointer, and the scheduler saves/restores FS base on context switch.
  • Still missing for future threading and Go: a syscall or capability-authorized operation equivalent to arch_prctl(ARCH_SET_FS) so a runtime can set each OS thread’s FS base itself.

Userspace (with dynamic linker, future):

  • initial-exec for the main executable and preloaded libraries.
  • general-dynamic for dlopen()-loaded libraries.
  • Requires implementing __tls_get_addr in the dynamic linker.

TLS Initialization Sequence

For a statically-linked userspace binary with local-exec TLS:

1. Kernel creates thread
2. Kernel allocates TLS block (size from ELF TLS program header)
3. Kernel copies .tdata (initialized TLS) into TLS block
4. Kernel zeros .tbss (uninitialized TLS) in TLS block
5. Kernel sets FS base = TLS block address (writes MSR_FS_BASE)
6. Thread starts executing; %fs:OFFSET accesses TLS directly

The ELF file contains two TLS sections:

  • .tdata (PT_TLS segment, initialized thread-local data)
  • .tbss (zero-initialized thread-local data, like .bss but per-thread)

The PT_TLS program header tells the loader:

  • Virtual address and file offset of .tdata
  • p_memsz = total TLS size (including .tbss)
  • p_filesz = size of .tdata only
  • p_align = required alignment

FS/GS Base Register Usage Plan

RegisterUsed ByPurpose
FSUserspace threadsThread-local storage (set per-thread by kernel)
GSKernel (via swapgs)Per-CPU data (set per-CPU during boot)

This is the standard Linux convention and what Go expects (Go uses arch_prctl(ARCH_SET_FS) to set the FS base for each OS thread).

What capOS Has and Still Needs

  1. Implemented: parse PT_TLS in capos-lib/src/elf.rs.
  2. Implemented: allocate/map a TLS block during process image load in kernel/src/spawn.rs.
  3. Implemented: copy .tdata, zero .tbss, and write the TCB self pointer for the current Variant II static TLS layout.
  4. Implemented: save/restore FS base through kernel/src/sched.rs and kernel/src/arch/x86_64/tls.rs.
  5. Still needed: arch_prctl(ARCH_SET_FS) equivalent for Go settls() and future multi-threaded userspace.

5. Rust Target Specification

How Custom Targets Work

Rust supports custom targets via JSON specification files. The workflow:

  1. Create a <target-name>.json file
  2. Pass it to rustc: --target path/to/x86_64-unknown-capos.json
  3. Use with cargo via -Zbuild-std to build core/alloc/std from source

Target lookup priority:

  1. Built-in target names
  2. File path (if the target string contains / or .json)
  3. RUST_TARGET_PATH environment variable directories

The Rust target JSON schema is explicitly unstable. Generate examples from the pinned compiler with rustc -Z unstable-options --print target-spec-json and validate against that same compiler’s target-spec-json-schema before checking in a target file.

Viewing Existing Specs

# Print the JSON spec for a built-in target:
rustc +nightly -Z unstable-options --target=x86_64-unknown-none --print target-spec-json

# Print the JSON schema for all available fields:
rustc +nightly -Z unstable-options --print target-spec-json-schema

Example: x86_64-unknown-capos Kernel Target

Based on the current x86_64-unknown-none target, with capOS-specific adjustments. This is a sketch; regenerate from the pinned rustc schema before using it.

{
    "llvm-target": "x86_64-unknown-none-elf",
    "metadata": {
        "description": "capOS kernel (x86_64)",
        "tier": 3,
        "host_tools": false,
        "std": false
    },
    "data-layout": "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128",
    "arch": "x86_64",
    "cpu": "x86-64",
    "target-endian": "little",
    "target-pointer-width": 64,
    "target-c-int-width": 32,
    "os": "none",
    "env": "",
    "vendor": "unknown",
    "linker-flavor": "gnu-lld",
    "linker": "rust-lld",
    "pre-link-args": {
        "gnu-lld": ["-Tkernel/linker-x86_64.ld"]
    },
    "features": "-mmx,-sse,-sse2,-sse3,-ssse3,-sse4.1,-sse4.2,-avx,-avx2,+soft-float",
    "disable-redzone": true,
    "panic-strategy": "abort",
    "code-model": "kernel",
    "relocation-model": "static",
    "rustc-abi": "softfloat",
    "executables": true,
    "exe-suffix": "",
    "has-thread-local": false,
    "position-independent-executables": false,
    "static-position-independent-executables": false,
    "plt-by-default": false,
    "max-atomic-width": 64,
    "stack-probes": { "kind": "inline" }
}

Example: x86_64-unknown-capos Userspace Target

{
    "llvm-target": "x86_64-unknown-none-elf",
    "metadata": {
        "description": "capOS userspace (x86_64)",
        "tier": 3,
        "host_tools": false,
        "std": false
    },
    "data-layout": "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128",
    "arch": "x86_64",
    "cpu": "x86-64",
    "target-endian": "little",
    "target-pointer-width": 64,
    "target-c-int-width": 32,
    "os": "capos",
    "env": "",
    "vendor": "unknown",
    "linker-flavor": "gnu-lld",
    "linker": "rust-lld",
    "pre-link-args": {
        "gnu-lld": ["-Tinit/linker.ld"]
    },
    "features": "-mmx,-sse,-sse2,-sse3,-ssse3,-sse4.1,-sse4.2,-avx,-avx2,+soft-float",
    "disable-redzone": false,
    "panic-strategy": "abort",
    "code-model": "small",
    "relocation-model": "static",
    "rustc-abi": "softfloat",
    "executables": true,
    "exe-suffix": "",
    "has-thread-local": true,
    "position-independent-executables": false,
    "static-position-independent-executables": false,
    "max-atomic-width": 64,
    "plt-by-default": false,
    "stack-probes": { "kind": "inline" },
    "tls-model": "local-exec"
}

Key JSON Fields

FieldPurposeTypical Values
llvm-targetLLVM triple for code generationx86_64-unknown-none-elf (reuse existing backend)
osOS name (affects cfg(target_os = "..."))"none", "capos", "linux"
archArchitecture name"x86_64", "aarch64"
data-layoutLLVM data layout stringCopy from same-arch target
linker-flavorWhich linker to use"gnu-lld", "gcc", "msvc"
linkerLinker binary"rust-lld", "ld.lld"
featuresCPU features to enable/disableDisable SIMD/FPU until context switching saves that state
disable-redzoneDisable System V red zonetrue for kernel, false for userspace
code-modelLLVM code model"kernel", "small"
relocation-modelLLVM relocation model"static", "pic"
panic-strategyHow to handle panics"abort", "unwind"
has-thread-localEnable #[thread_local]true for userspace now that PT_TLS/FS base works
tls-modelDefault TLS model"local-exec" for static binaries
max-atomic-widthLargest atomic type (bits)64 for x86_64
pre-link-argsArguments passed to linker before user argsLinker script path
position-independent-executablesGenerate PIE by defaultfalse for now
exe-suffixExecutable file extension"" for ELF
stack-probesStack overflow detection mechanism{"kind": "inline"} in the current freestanding x86_64 spec

no_std vs std Support Path

Current state: capOS uses no_std + alloc. This works with any target, including x86_64-unknown-none.

Path to std support (what Redox, Hermit, and Fuchsia did):

  1. Phase 1: Custom target with os: "capos" (current report). Use -Zbuild-std=core,alloc to build core and alloc. No std.

  2. Phase 2: Add capOS to Rust’s std library. This requires:

    • Adding mod capos under library/std/src/sys/ with OS-specific implementations of: filesystem, networking, threads, time, stdio, process spawning, etc.
    • Each of these maps to capOS capabilities
    • Use cfg(target_os = "capos") throughout std
    • Build with -Zbuild-std=std
  3. Phase 3: Upstream the target (optional). Submit the target spec and std implementations to the Rust project. Requires sustained maintenance.

What Redox did: Redox implemented a full POSIX-like userspace (relibc) and added std support by implementing the sys module in terms of relibc syscalls. This made Redox a Tier 2 target with pre-built std artifacts.

What Hermit did: Hermit is a unikernel, so std is implemented directly in terms of Hermit’s kernel-level APIs. Tier 3, community maintained.

What Fuchsia did: Fuchsia implemented std using Fuchsia’s native zircon syscalls (handles, channels, VMOs – similar in spirit to capabilities). Tier 2.

Recommendation for capOS: Stay on no_std + alloc with the custom target JSON. std support is a large effort that should wait until the syscall surface is stable and threading works. When the time comes, Fuchsia’s approach (std over native capability syscalls) is the best model, since Fuchsia’s handle-based API is conceptually close to capOS’s capabilities.

Other OS Projects Reference

OSTargetTierstdApproach
Redoxx86_64-unknown-redox2Yesrelibc (custom libc) over Redox syscalls
Hermitx86_64-unknown-hermit3Yesstd directly over kernel API
Fuchsiax86_64-unknown-fuchsia2Yesstd over zircon handles (capability-like)
Theseusx86_64-unknown-theseusN/ANoCustom JSON, no_std, research OS
Blog OSCustom JSONN/ANoBased on x86_64-unknown-none
MOROSCustom JSONN/ANoSimple hobby OS

6. Go Runtime Requirements

Go’s Runtime Architecture

Go’s runtime is essentially a userspace operating system. It manages goroutine scheduling, garbage collection, memory allocation, and I/O multiplexing. The runtime interfaces with the actual OS through a narrow set of functions that each GOOS must implement.

Minimum OS Interface for a Go Port

Based on analysis of runtime/os_linux.go, runtime/os_plan9.go, and runtime/os_js.go, here is the minimum interface:

Tier 1: Absolute Minimum (single-threaded, like GOOS=js)

These functions are needed for “Hello, World!”:

func osinit()                                    // OS initialization
func write1(fd uintptr, p unsafe.Pointer, n int32) int32  // stdout/stderr output
func exit(code int32)                            // process termination
func usleep(usec uint32)                         // sleep (can be no-op initially)
func readRandom(r []byte) int                    // random data (for maps, etc.)
func goenvs()                                    // environment variables
func mpreinit(mp *m)                             // pre-init new M on parent thread
func minit()                                     // init new M on its own thread
func unminit()                                   // undo minit
func mdestroy(mp *m)                             // destroy M resources

Plus memory management (in runtime/mem_*.go):

func sysAllocOS(n uintptr) unsafe.Pointer        // allocate memory (mmap)
func sysFreeOS(v unsafe.Pointer, n uintptr)       // free memory (munmap)
func sysReserveOS(v unsafe.Pointer, n uintptr) unsafe.Pointer  // reserve VA range
func sysMapOS(v unsafe.Pointer, n uintptr)        // commit reserved pages
func sysUsedOS(v unsafe.Pointer, n uintptr)       // mark as used
func sysUnusedOS(v unsafe.Pointer, n uintptr)     // mark as unused (madvise)
func sysFaultOS(v unsafe.Pointer, n uintptr)      // remove access
func sysHugePageOS(v unsafe.Pointer, n uintptr)   // hint: use huge pages

Tier 2: Multi-threaded (real goroutines)

func newosproc(mp *m)                            // create OS thread (clone)
func exitThread(wait *atomic.Uint32)             // exit current thread
func futexsleep(addr *uint32, val uint32, ns int64)  // futex wait
func futexwakeup(addr *uint32, cnt uint32)        // futex wake
func settls()                                     // set FS base for TLS
func nanotime1() int64                            // monotonic nanosecond clock
func walltime() (sec int64, nsec int32)           // wall clock time
func osyield()                                    // sched_yield

Tier 3: Full Runtime (signals, profiling, network poller)

func sigaction(sig uint32, new *sigactiont, old *sigactiont)
func signalM(mp *m, sig int)                      // send signal to thread
func setitimer(mode int32, new *itimerval, old *itimerval)
func netpollopen(fd uintptr, pd *pollDesc) uintptr
func netpoll(delta int64) (gList, int32)
func netpollBreak()

Linux Syscalls Used by Go Runtime (Complete List)

From runtime/sys_linux_amd64.s:

Syscall#Go WrappercapOS Equivalent
read0runtime.readStore cap
write1runtime.write1Console cap
close3runtime.closefdCap drop
mmap9runtime.sysMmapVirtualMemory cap
munmap11runtime.sysMunmapVirtualMemory.unmap
brk12runtime.sbrk0VirtualMemory cap
rt_sigaction13runtime.rt_sigactionSignal cap (future)
rt_sigprocmask14runtime.rtsigprocmaskSignal cap (future)
sched_yield24runtime.osyieldsys_yield
mincore27runtime.mincoreVirtualMemory.query
madvise28runtime.madviseFuture VirtualMemory decommit/query semantics, or unmap/remap policy
nanosleep35runtime.usleepTimer cap
setitimer38runtime.setitimerTimer cap
getpid39runtime.getpidProcess info
clone56runtime.cloneThread cap
exit60runtime.exitsys_exit
sigaltstack131runtime.sigaltstackNot needed initially
arch_prctl158runtime.settlssys_arch_prctl (set FS base)
gettid186runtime.gettidThread info
futex202runtime.futexsys_futex
sched_getaffinity204runtime.sched_getaffinityCPU info
timer_create222runtime.timer_createTimer cap
timer_settime223runtime.timer_settimeTimer cap
timer_delete226runtime.timer_deleteTimer cap
clock_gettime228runtime.nanotime1Timer cap
exit_group231runtime.exitsys_exit
tgkill234runtime.tgkillThread signal (future)
openat257runtime.openNamespace cap
pipe2293runtime.pipe2IPC cap

Go’s TLS Model

Go uses arch_prctl(ARCH_SET_FS, addr) to set the FS segment base for each OS thread. The convention:

  • FS base points to the thread’s m.tls array
  • Goroutine pointer g is stored at -8(FS) (ELF TLS convention)
  • In Go’s ABIInternal, R14 is cached as the g register for performance
  • On signal entry or thread start, g is loaded from TLS into R14

Go does NOT use the compiler’s TLS mechanisms (no __thread or thread_local!). It manages TLS entirely in its own runtime via the FS register.

For capOS, this means the kernel needs:

  1. arch_prctl(ARCH_SET_FS) equivalent syscall
  2. The kernel must save/restore FS base on context switch
  3. Each thread’s FS base must be independently settable

Adding GOOS=capos to Go

Files that need to be created/modified in a Go fork:

src/runtime/
    os_capos.go           // osinit, newosproc, futexsleep, etc.
    os_capos_amd64.go     // arch-specific OS functions
    sys_capos_amd64.s     // syscall wrappers in assembly
    mem_capos.go          // sysAlloc/sysFree/etc. over VirtualMemory cap
    signal_capos.go       // signal stubs (no real signals initially)
    stubs_capos.go        // misc stubs
    netpoll_capos.go      // network poller (stub initially)
    defs_capos.go         // OS-level constants
    vdso_capos.go         // VDSO stubs (no VDSO)

src/syscall/
    syscall_capos.go      // Go's syscall package
    zsyscall_capos_amd64.go

src/internal/platform/
    (modifications to supported.go, zosarch.go)

src/cmd/dist/
    (modifications to add capOS to known OS list)

Estimated: ~2000-3000 lines for Phase 1 (single-threaded).

Feasibility Assessment

FeatureDifficultyBlocked On
Hello World (write + exit)EasyConsole capability plus exit syscall
Memory allocator (mmap)MediumVirtualMemory capability exists; Go glue and any missing query/decommit semantics remain
Single-threaded goroutines (M=1)MediumVirtualMemory cap + timer
Multi-threaded (real threads)HardKernel thread support, futex, runtime-controlled FS base
Network pollerHardAsync cap invocation, networking stack
Signal-based preemptionHardSignal delivery mechanism
Full stdlibVery HardPOSIX layer or native cap wrappers

7. Relevance to capOS

Practical Scope of Work

Phase 1: Custom Target JSON (Low effort, high value)

What: Create a userspace x86_64-unknown-capos.json target spec. Keep the kernel on x86_64-unknown-none unless a kernel JSON proves useful.

Why: Replaces the current approach of using x86_64-unknown-none with rustflags overrides. Makes the build cleaner, enables cfg(target_os = "capos") for conditional compilation, and is the foundation for everything else.

Effort: 1-2 hours for an initial file, plus recurring maintenance because Rust target JSON fields are not stable.

Blockers: None. Not required for the current no_std runtime path.

Phase 2: TLS Support (mostly landed, required for Go)

What: Parse PT_TLS from ELF, allocate per-thread TLS blocks, set FS base on context switch, add arch_prctl-equivalent syscall.

Why: Required for Go runtime (Go’s settls() sets FS base), for Rust #[thread_local] in userspace, and for C’s __thread.

Current state: PT_TLS parsing, static TLS mapping, FS-base context-switch state, and a Rust #[thread_local] smoke are implemented. Remaining work is the runtime-controlled FS-base operation and the thread model that makes it per-thread rather than per-process.

Blockers: Thread support for the multi-threaded case.

Phase 3: VirtualMemory Capability (implemented baseline, required for Go)

What: Implement the VirtualMemory capability interface. The current schema has map, unmap, and protect; Go may need decommit/query semantics later.

Why: Go’s memory allocator (sysAlloc, sysReserve, sysMap, etc.) needs mmap-like functionality. This is the single biggest kernel-side requirement for Go.

Current state: VirtualMemoryCap implements map/unmap/protect over the existing page-table code with ownership tracking and quota checks. Go-specific work still has to map runtime sysAlloc/sysReserve/sysMap expectations onto that interface.

Blockers: None for the baseline capability; timer/futex/threading still block useful Go.

Phase 4: Futex Operation (Low-medium effort, required for Go threading)

What: Implement futex(WAIT) and futex(WAKE) as a fast capability-authorized kernel operation.

Why: Go’s runtime synchronization (lock_futex.go) is built on futexes. The entire goroutine scheduler depends on futex-based sleeping.

Effort: ~100-200 lines for the first private-futex path. A wait queue keyed by address-space + userspace address is enough initially.

Blockers: Futex wait-queue design and, for full Go threading, the thread scheduler.

Phase 5: Kernel Threading (High effort, required for Go GOMAXPROCS>1)

What: Multiple threads per process sharing address space and cap table.

Why: Go’s newosproc() creates OS threads via clone(). Without real threads, Go is limited to GOMAXPROCS=1.

Effort: ~500-800 lines. Major scheduler extension.

Blockers: Scheduler, per-CPU data, SMP support.

Biggest Blockers for Go

In priority order after the 2026-04-22 TLS and VirtualMemory work:

  1. Timer / monotonic clock – Go’s scheduler needs nanotime() for goroutine scheduling decisions. Without a timer, Go cannot preempt goroutines or manage timeouts.

  2. Runtime-controlled FS base – Go calls arch_prctl(ARCH_SET_FS) on every new thread. capOS can load static ELF TLS today, but Go still needs a way to set the runtime’s own TLS base.

  3. Futex – Go’s M:N scheduler depends on futex for sleeping/waking OS threads. Without futex, Go falls back to spin-waiting (wasteful) or simply cannot block.

  4. Thread creation – Required for GOMAXPROCS > 1. Phase 1 Go can work single-threaded.

  5. Go runtime port glue – map sysAlloc/write1/exit/random/env/time to capOS runtime and capabilities.

Biggest Blockers for C

C is much simpler than Go:

  1. Linker and toolchain setup – Need a cross-compilation toolchain targeting capOS (Clang with the custom target, or GCC cross-compiler).
  2. libcapos.a with C headers – Rust library with extern "C" API.
  3. musl integration (optional) – For full libc, replace musl’s __syscall() with capability invocations.
1. Custom userspace target JSON          [optional build hygiene]
     |
2. VirtualMemory capability              [done: baseline map/unmap/protect]
     |
3. TLS support (PT_TLS, FS base)         [done for static ELF processes]
     |
4. Futex authority cap + measured ABI    [extends scheduler]
     |
5. Timer capability (monotonic clock)    [extends PIT/HPET driver]
     |
6. Go Phase 1: minimal GOOS=capos       [single-threaded, M=1]
     |
7. Kernel threading                      [major scheduler work]
     |
8. Go Phase 2: multi-threaded           [GOMAXPROCS>1, concurrent GC]
     |
9. C toolchain + libcapos               [parallel with Go work]
     |
10. Go Phase 3: network poller          [depends on networking stack]

Steps 1-5 are kernel prerequisites. Step 6 is the Go fork. Steps 7-10 are incremental improvements that can proceed in parallel.

Key Architectural Decisions for capOS

  1. Keep x86_64-unknown-none for kernel, x86_64-unknown-capos for userspace. The kernel does not benefit from a custom OS target (it’s freestanding). Userspace benefits from cfg(target_os = "capos").

  2. Use local-exec TLS model for static binaries. No dynamic linker means no general-dynamic or initial-exec TLS. local-exec is zero-overhead.

  3. Implement FS base save/restore early. Both Go and Rust #[thread_local] need it. It’s a small addition to context switch code.

  4. VirtualMemory cap stays on the Go critical path. The baseline exists; the Go port still needs exact runtime allocator semantics and any missing query/decommit behavior.

  5. Futex is the synchronization primitive. Both Go and any future pthreads implementation need futex. Keep authority capability-based, but measure whether the hot path should use a compact transport operation rather than generic Cap’n Proto method dispatch.

  6. Signals can be deferred. Go can start with cooperative-only preemption (no SIGURG). Signal delivery is complex and can come much later.

Cap’n Proto Error Handling: Research Notes

Research on how Cap’n Proto handles errors at the protocol, schema, and Rust crate levels. Used as input for the capOS error handling proposal.


1. Protocol-Level Exception Model (rpc.capnp)

The Cap’n Proto RPC protocol defines an Exception struct used in three positions: Message.abort, Return.exception, and Resolve.exception.

struct Exception {
  reason @0 :Text;
  type @3 :Type;
  enum Type {
    failed @0;        # deterministic bug/invalid input; retrying won't help
    overloaded @1;    # temporary lack of resources; retry with backoff
    disconnected @2;  # connection to necessary capability was lost
    unimplemented @3; # server doesn't implement the method
  }
  obsoleteIsCallersFault @1 :Bool;
  obsoleteDurability @2 :UInt16;
  trace @4 :Text;     # stack trace from the remote server
}

The four exception types describe client response strategy, not error semantics:

TypeClient response
failedLog and propagate. Don’t retry.
overloadedRetry with exponential backoff.
disconnectedRe-establish connection, retry.
unimplementedFall back to alternative methods.

2. Rust capnp Crate (v0.25.x)

Core error types

#![allow(unused)]
fn main() {
pub type Result<T> = ::core::result::Result<T, Error>;

#[derive(Debug, Clone)]
pub struct Error {
    pub kind: ErrorKind,
    pub extra: String,  // human-readable description (requires `alloc`)
}

#[derive(Debug, Clone, Copy, PartialEq, Eq)]
#[non_exhaustive]
pub enum ErrorKind {
    // Four RPC-mapped kinds (match Exception.Type)
    Failed,
    Overloaded,
    Disconnected,
    Unimplemented,

    // Wire format validation errors (~40 more variants)
    BufferNotLargeEnough,
    EmptyBuffer,
    MessageContainsOutOfBoundsPointer,
    MessageIsTooDeeplyNested,
    ReadLimitExceeded,
    TextContainsNonUtf8Data(core::str::Utf8Error),
    // ... etc
}
}

Constructor functions: Error::failed(s), Error::overloaded(s), Error::disconnected(s), Error::unimplemented(s).

The NotInSchema(u16) type handles unknown enum values or union discriminants.

std::io::Error mapping

When std feature is enabled, From<std::io::Error> maps:

  • TimedOut -> Overloaded
  • BrokenPipe/ConnectionRefused/ConnectionReset/ConnectionAborted/NotConnected -> Disconnected
  • UnexpectedEof -> PrematureEndOfFile
  • Everything else -> Failed

3. capnp-rpc Rust Crate Error Mapping

Bidirectional conversion between wire Exception and capnp::Error:

Sending (Error -> Exception):

#![allow(unused)]
fn main() {
fn from_error(error: &Error, mut builder: exception::Builder) {
    let typ = match error.kind {
        ErrorKind::Failed => exception::Type::Failed,
        ErrorKind::Overloaded => exception::Type::Overloaded,
        ErrorKind::Disconnected => exception::Type::Disconnected,
        ErrorKind::Unimplemented => exception::Type::Unimplemented,
        _ => exception::Type::Failed,  // all validation errors -> Failed
    };
    builder.set_type(typ);
    builder.set_reason(&error.extra);
}
}

Receiving (Exception -> Error): Maps exception::Type back to ErrorKind, preserving the reason string.

Server traits return Promise<(), capnp::Error>. Client gets Promise<Response<Results>, capnp::Error>.

4. Cap’n Proto Error Handling Philosophy

From KJ library documentation and Kenton Varda:

“KJ exceptions are meant to express unrecoverable problems or logistical problems orthogonal to the API semantics; they are NOT intended to be used as part of your API semantics.”

“In the Cap’n Proto world, ‘checked exceptions’ (where an interface explicitly defines the exceptions it throws) do NOT make sense.”

Exceptions: infrastructure failures (network down, bug, overload). Application errors: should be modeled in the schema return types.

5. Schema Design Patterns for Application Errors

Generic Result pattern

struct Error {
    code @0 :UInt16;
    message @1 :Text;
}

struct Result(Ok) {
    union {
        ok @0 :Ok;
        err @1 :Error;
    }
}

interface MyService {
    doThing @0 (input :Text) -> (result :Result(Text));
}

Constraint: generic type parameters bind only to pointer types (Text, Data, structs, lists, interfaces), not primitives (UInt32, Bool). So Result(UInt64) doesn’t work – need a wrapper struct.

Per-method result unions

interface FileSystem {
    open @0 (path :Text) -> (result :OpenResult);
}

struct OpenResult {
    union {
        file @0 :File;
        notFound @1 :Void;
        permissionDenied @2 :Void;
        error @3 :Text;
    }
}

Unions must be embedded in structs (no free-standing unions). This allows adding new fields later without breaking compatibility.

6. How Other Cap’n Proto Systems Handle Errors

Sandstorm

Uses the exception mechanism for infrastructure errors. Capabilities report errors through disconnection. The grain.capnp schema does not define explicit error types. util.capnp documents errors as “It will throw an exception if any error occurs.”

Cloudflare Workers (workerd)

Uses Cap’n Proto for internal RPC. JavaScript Error.message and Error.name are preserved across RPC; stack traces and custom properties are stripped. Does not model errors in capnp schema – relies on exception propagation.

OCapN (Open Capability Network)

Adopted the same four-kind exception model for cross-system compatibility. Diagnostic information is non-normative. Security concern: exception objects may leak sensitive information (stack traces, paths) at CapTP boundaries.

Kenton Varda expressed reservations about unimplemented (ambiguity about whether the direct method or callees failed) and disconnected (requires catching at specific stack frames for meaningful retry).

7. Relevance to capOS

capOS uses the capnp crate but not capnp-rpc. Manual dispatch goes through CapObject::call() with caller-provided params/result buffers. Current error handling:

  • capnp::Error::failed() for semantic errors
  • capnp::Error::unimplemented() for unknown methods
  • ? for deserialization errors (naturally produce capnp::Error)
  • Transport errors become CQE status codes.
  • Kernel-produced CapException values are serialized into result buffers for capability-level failures and decoded by capos-rt.

The capnp::Error type carries the information needed for CapException: kind maps to ExceptionType, and extra maps to message.


Sources

  • Cap’n Proto RPC Protocol: https://capnproto.org/rpc.html
  • Cap’n Proto C++ RPC: https://capnproto.org/cxxrpc.html
  • Cap’n Proto Schema Language: https://capnproto.org/language.html
  • Cap’n Proto FAQ: https://capnproto.org/faq.html
  • KJ exception.h: https://github.com/capnproto/capnproto/blob/master/c%2B%2B/src/kj/exception.h
  • rpc.capnp schema: https://github.com/capnproto/capnproto/blob/master/c%2B%2B/src/capnp/rpc.capnp
  • OCapN error handling discussion: https://github.com/ocapn/ocapn/issues/10
  • Cap’n Proto usage patterns: https://github.com/capnproto/capnproto/discussions/1849
  • capnp-rpc Rust crate: https://crates.io/crates/capnp-rpc
  • Cloudflare Workers RPC errors: https://developers.cloudflare.com/workers/runtime-apis/rpc/error-handling/
  • Sandstorm util.capnp: https://docs.rs/crate/sandstorm/0.0.5/source/schema/util.capnp

OS Error Handling in Capability Systems: Research Notes

Research on error handling patterns in capability-based and microkernel operating systems. Used as input for the capOS error handling proposal.


1. seL4

Error Codes

seL4 defines 11 kernel error codes in errors.h:

typedef enum {
    seL4_NoError            = 0,
    seL4_InvalidArgument    = 1,
    seL4_InvalidCapability  = 2,
    seL4_IllegalOperation   = 3,
    seL4_RangeError         = 4,
    seL4_AlignmentError     = 5,
    seL4_FailedLookup       = 6,
    seL4_TruncatedMessage   = 7,
    seL4_DeleteFirst        = 8,
    seL4_RevokeFirst        = 9,
    seL4_NotEnoughMemory    = 10,
} seL4_Error;

Error Return Mechanism

  • Capability invocations (kernel object operations) return seL4_Error directly.
  • IPC messages use seL4_MessageInfo_t with label, length, extraCaps, capsUnwrapped. The label is copied unmodified – kernel doesn’t interpret it.
  • MR0 (Message Register 0) carries return codes for kernel object invocations via seL4_Call.

Error Propagation

Fault handler mechanism: each TCB has a fault endpoint capability. On fault (capability fault, VM fault, etc.):

  1. Kernel blocks the faulting thread.
  2. Kernel sends an IPC to the fault endpoint with fault-type-specific fields.
  3. Fault handler (separate process) receives, fixes, and replies.
  4. Kernel resumes the faulting thread.

Design Choices

  • seL4_NBSend on invalid capability: silently fails (prevents covert channels).
  • seL4_Send/seL4_Call on invalid capability: returns seL4_FailedLookup.
  • No application-level error convention – user servers choose their own protocol.
  • Partial capability transfer: if some caps in a multi-cap transfer fail, already-transferred caps succeed; extraCaps reflects the successful count.

Sources

  • seL4 errors.h: https://github.com/seL4/seL4/blob/master/libsel4/include/sel4/errors.h
  • seL4 IPC tutorial: https://docs.sel4.systems/Tutorials/ipc.html
  • seL4 fault handlers: https://docs.sel4.systems/Tutorials/fault-handlers.html
  • seL4 API reference: https://docs.sel4.systems/projects/sel4/api-doc.html

2. Fuchsia / Zircon

zx_status_t

Signed 32-bit integer. Negative = error, ZX_OK (0) = success.

Categories:

CategoryExamples
GeneralZX_ERR_INTERNAL, ZX_ERR_NOT_SUPPORTED, ZX_ERR_NO_RESOURCES, ZX_ERR_NO_MEMORY
ParameterZX_ERR_INVALID_ARGS, ZX_ERR_WRONG_TYPE, ZX_ERR_BAD_HANDLE, ZX_ERR_BUFFER_TOO_SMALL
StateZX_ERR_BAD_STATE, ZX_ERR_NOT_FOUND, ZX_ERR_TIMED_OUT, ZX_ERR_ALREADY_EXISTS, ZX_ERR_PEER_CLOSED
PermissionZX_ERR_ACCESS_DENIED
I/OZX_ERR_IO, ZX_ERR_IO_REFUSED, ZX_ERR_IO_DATA_INTEGRITY, ZX_ERR_IO_DATA_LOSS

FIDL Error Handling (Three Layers)

Layer 1: Transport errors. Channel broke. Currently all transport-level FIDL errors close the channel. Client observes ZX_ERR_PEER_CLOSED.

Layer 2: Epitaphs (RFC-0053). Server sends a special final message before closing a channel, explaining why. Wire format: ordinal 0xFFFFFFFF, error status in the reserved uint32 of the FIDL message header. After sending, server closes the channel.

Layer 3: Application errors (RFC-0060). Methods declare error types:

Method() -> (string result) error int32;

Serialized as:

union MethodReturn {
    MethodResult result;
    int32 err;
};

Error types constrained to int32, uint32, or an enum thereof. Deliberately no standard error enum – each service defines its own error domain. Rationale: standard error enums “try to capture more detail than we think is appropriate.”

C++ binding: zx::result<T> (specialization of fit::result<zx_status_t, T>).

Sources

  • Zircon errors: https://fuchsia.dev/fuchsia-src/concepts/kernel/errors
  • RFC-0060 error handling: https://fuchsia.dev/fuchsia-src/contribute/governance/rfcs/0060_error_handling
  • RFC-0053 epitaphs: https://fuchsia.dev/fuchsia-src/contribute/governance/rfcs/0053_epitaphs

3. EROS / KeyKOS / Coyotos

KeyKOS Invocation Message Format

KC (Key, Order_code)
   STRUCTFROM(arg_structure)
   KEYSFROM(arg_key_slots)
   STRUCTTO(reply_structure)
   KEYSTO(reply_key_slots)
   RCTO(return_code_variable)
  • Order code: small integer selecting the operation (method selector).
  • Return code: integer returned by the invoked object via RCTO.
  • Data string: bulk data parameter (up to ~4KB).
  • Keys: up to 4 capability parameters in each direction.

Invocation Primitives

  • CALL: send + block for reply. Kernel synthesizes a resume key (capability to resume caller) as 4th key parameter to callee.
  • RETURN: reply using a resume key + go back to waiting.
  • FORK: send and continue (fire-and-forget).

Keeper Error Handling

Every domain has a domain keeper slot. On hardware trap (illegal instruction, divide-by-zero, protection fault):

  1. Kernel invokes the keeper as if the domain had issued a CALL.
  2. Keeper receives fault information in the message.
  3. Keeper can fix and resume (via resume key) or terminate.
  4. A non-zero return code from a key invocation triggers the keeper mechanism.

Coyotos (EROS Successor) – Formalized Error Model

Cleanly separates invocation-level vs application-level exceptions:

Invocation-level (before the target processes the message): MalformedSyscall, InvalidAddress, AccessViolation, DataAccessTypeError, CapAccessTypeError, MalformedSpace, MisalignedReference

Application-level: signaled via OPR0.ex flag bit in the reply control word. If set, remaining parameter words contain a 64-bit exception code plus optional info.

Sources

  • KeyKOS architecture: https://dl.acm.org/doi/pdf/10.1145/858336.858337
  • Coyotos spec: https://hydra-www.ietfng.org/capbib/cache/shapiro:coyotosspec.html
  • EROS (SOSP 1999): https://sites.cs.ucsb.edu/~chris/teaching/cs290/doc/eros-sosp99.pdf

4. Plan 9 / 9P

9P2000 Rerror Format

size[4] Rerror tag[2] ename[s]
  • ename[s]: variable-length UTF-8 string describing the error.
  • No Terror message – only servers send errors.
  • String-based, not numeric. Conventional strings (“permission denied”, “file not found”) but no fixed taxonomy.

9P2000.u Extension (Unix compatibility)

size[4] Rerror tag[2] ename[s] errno[4]

Adds a 4-byte Unix errno as a hint. Clients should prefer the string. ERRUNDEF sentinel when Unix errno doesn’t apply.

Design Rationale

Avoids “errno fragmentation” where different Unix variants assign different numbers to the same condition. The string is authoritative; the number is an optimization for Unix-compatibility clients.

Sources

  • 9P2000 RFC: https://ericvh.github.io/9p-rfc/rfc9p2000.html
  • 9P2000.u RFC: https://ericvh.github.io/9p-rfc/rfc9p2000.u.html

5. Genode

RPC Exception Propagation

GENODE_RPC_THROW(func_type, ret_type, func_name,
                 GENODE_TYPE_LIST(Exception1, Exception2, ...),
                 arg_type...)

Only the exception type crosses the boundary – exception objects (fields, messages) are not transferred. Server encodes a numeric Rpc_exception_code, client reconstructs a default-constructed exception of the matching type.

Undeclared exceptions: undefined behavior (server crash or hung RPC).

Infrastructure-Level Errors

  • RPC_INVALID_OPCODE: dispatched operation code doesn’t match.
  • Rpc_exception_code: integral type, computed as RPC_EXCEPTION_BASE - index_in_exception_list.
  • Ipc_error: kernel IPC failure (server unreachable).
  • Server death: capabilities become invalid, subsequent invocations produce Ipc_error.

Sources

  • Genode RPC: https://genode.org/documentation/genode-foundations/20.05/functional_specification/Remote_procedure_calls.html
  • Genode IPC: https://genode.org/documentation/genode-foundations/23.05/architecture/Inter-component_communication.html

6. Cross-System Comparison: Transport vs Application Errors

Every capability/microkernel IPC system separates two failure modes:

  1. Transport errors – the invocation mechanism failed before the target processed the request (bad handle, insufficient rights, target dead, malformed message, timeout).

  2. Application errors – the service processed the request and returned a meaningful error (not found, resource exhausted, invalid operation).

SystemTransport errorsApplication errors
seL4seL4_Error (11 values) from syscallIPC message payload (user-defined)
Zirconzx_status_t (~30 values) from syscallFIDL per-method error type
EROS/CoyotosInvocation exceptions (kernel)OPR0.ex flag + code in reply
Plan 9Connection lossRerror with string
GenodeIpc_error + RPC_INVALID_OPCODEC++ exceptions via GENODE_RPC_THROW
Cap’n Proto RPCdisconnected/unimplementedfailed/overloaded or schema types

Common pattern: small kernel error code set for transport + typed service-specific errors for application.


7. POSIX errno: Strengths and Weaknesses for Capability Systems

Strengths

  • Simple (single integer, zero overhead on success).
  • Universal (every Unix developer knows it).
  • Low overhead (no allocation on error path).

Weaknesses for Capability Systems

  • Ambient authority assumption: EACCES/EPERM assume ACL-style access control. In capability systems, having the capability IS the permission.
  • Global flat namespace: all errors share one integer space. Capability systems have typed interfaces; errors should be scoped per-interface.
  • No structured information: just an integer, no “which argument” or “how much memory needed.”
  • Thread-local state: clobbered by intermediate calls, breaks down with async IPC or promise pipelining.
  • No transport/application distinction: EBADF (transport) and ENOENT (application) in the same space.
  • Not composable across trust boundaries: callee’s errno meaningless in caller’s address space without explicit serialization.

No capability system uses a POSIX-style global errno namespace.

IX-on-capOS Hosting Research

Research note on using IX as a package corpus and content-addressed build model for a more mature capOS system. It explains what IX provides, why it is useful for capOS, and how to extract the most value from it without importing CPython/POSIX assumptions as an architectural dependency.

What IX Is

IX is a source-based package/build system. It describes packages as templates, expands those templates into build descriptors and shell scripts, fetches and verifies source inputs, executes dependency-ordered builds, stores outputs in a content-addressed store, and publishes usable package environments through realm mappings.

For capOS, IX should be treated as three separable assets:

  • a package corpus with thousands of package definitions and accumulated build knowledge;
  • a content-addressed build/store model that already fits reproducible artifact management;
  • a compact Python control plane that can be adapted once authority-bearing operations move behind capOS services.

IX should not be treated as a requirement to reproduce Unix inside capOS. Its current implementation uses CPython, Jinja2, subprocesses, shell tools, filesystem paths, symlinks, hardlinks, signals, and process groups because it runs on Unix-like hosts today. Those are implementation assumptions, not the part worth preserving unchanged.

Why IX Is Useful for capOS

capOS needs a credible path from isolated demos to a useful userspace closure. IX is useful because it supplies a package/build corpus and model that can exercise the exact system boundaries capOS needs to grow:

  • process spawning with explicit argv, env, cwd, stdio, and exit status;
  • fetch, archive extraction, and content verification as auditable services;
  • Store and Namespace capabilities instead of ambient global filesystem authority;
  • build sandboxing with explicit input, scratch, output, network, and resource policies;
  • static-tool bootstrapping before a full dynamic POSIX environment exists;
  • differential testing against the existing host IX implementation.

The main value is leverage. IX can give capOS real package metadata, real build scripts, and real toolchain pressure without making CPython or a broad POSIX personality the first required userspace milestone.

Best Way to Get the Most from IX

The optimal strategy is to preserve IX’s package corpus and build semantics while replacing the Unix-shaped execution boundary with capability-native services.

The high-value path is:

  1. Run upstream IX on the host first to build and validate early capOS artifacts.
  2. Use CPython/Jinja2 on the host as a reference oracle, not as the in-system foundation.
  3. Render IX templates through a Rust ix-template component that implements the subset IX actually uses.
  4. Run the adapted IX planner/control plane on native MicroPython once capOS has enough runtime support.
  5. Move fetch, extract, build, Store commit, Namespace publish, and process lifecycle into typed capOS services.

This gets most of IX’s value: package knowledge, reproducible build structure, and a practical self-hosting path. It avoids the lowest-value part: spending early capOS effort on a large CPython/POSIX compatibility layer just to preserve upstream implementation details.

Position

CPython is not an architectural prerequisite for IX-on-capOS.

It is a compatibility shortcut for running upstream IX with minimal changes. For a clean capOS-native integration, the better design is:

  • keep IX’s package corpus and content-addressed build model;
  • adapt IX’s Python control-plane code instead of preserving every CPython and POSIX assumption;
  • run the adapted control plane on a native MicroPython port;
  • move build execution, fetching, archive extraction, store mutation, and sandboxing into typed capOS services;
  • render IX templates through a Rust template service or tightly scoped IX template engine, not full Jinja2 on MicroPython;
  • keep CPython on the host as a differential test oracle and bootstrap tool, not as a required foundation layer for capOS.

MicroPython is a credible sweet spot only with that boundary. It is not a credible sweet spot if the requirement is “make upstream Jinja2, subprocess, fcntl, process groups, and Unix filesystem behavior all work inside MicroPython.”

Sources Inspected

  • Upstream IX repository: https://github.com/pg83/ix
  • IX package guide: PKGS.md
  • IX core: core/
  • IX templates: pkgs/die/
  • Bundled IX template deps: deps/jinja-3.1.6/, deps/markupsafe-3.0.3/
  • MicroPython library docs: https://docs.micropython.org/en/latest/library/index.html
  • MicroPython CPython-difference docs: https://docs.micropython.org/en/latest/genrst/
  • MicroPython porting docs: https://docs.micropython.org/en/latest/develop/index.html
  • Jinja docs: https://jinja.palletsprojects.com/en/latest/intro/
  • MiniJinja docs: https://docs.rs/minijinja/latest/minijinja/

Upstream IX Shape

IX is a source-based, content-addressed package/build system. Package definitions are Jinja templates under pkgs/, mostly named ix.sh, and the template hierarchy under pkgs/die/ expands those package descriptions into JSON descriptors and shell build scripts.

The inspected clone has:

  • 3788 package ix.sh files;
  • 66 files under pkgs/die;
  • a template chain centered on base.json, ix.json, script.json, sh0.sh, sh1.sh, sh2.sh, sh.sh, base.sh, std/ix.sh, and language/build-system templates for C, Rust, Go, Python, CMake, Meson, Ninja, WAF, GN, Kconfig, and shell-only generated packages.

The IX template surface is broad but not arbitrary Jinja. In the package tree surveyed, the Jinja tags used were:

TagCount
block14358
endblock14360
extends3808
if / endif451 / 451
include344
else123
set / endset52 / 52
for / endfor49 / 49
elif23

No macro, import, from, with, filter, raw, or call tags were found in the inspected tree. That matters: IX’s template needs are probably a finite subset around inheritance, blocks, self.block(), super(), includes, conditionals, loops, assignments, expressions, and custom filters.

IX’s own Jinja wrapper is small. core/j2.py defines:

  • custom loader with // root handling;
  • include inlining;
  • filters such as b64e, b64d, jd, jl, group_by, basename, dirname, ser, des, lines, eval, defined, field, pad, add, preproc, parse_urls, parse_list, list_to_json, and fjoin.

That makes the template layer replaceable. The risk is not “Jinja is impossible.” The risk is “full upstream Jinja2 drags in a CPython-shaped runtime just to implement a template subset IX mostly uses in a disciplined way.”

Current IX Runtime Surface

The IX Python core uses ordinary host-scripting features:

  • os, os.path, json, hashlib, base64, random, string, functools, itertools, platform, getpass;
  • shutil.which, shutil.rmtree, shutil.move;
  • subprocess.run, check_call, check_output;
  • os.execvpe, os.kill, os.setpgrp, signal.signal;
  • fcntl.fcntl to reset stdout flags;
  • asyncio for graph scheduling;
  • multiprocessing.cpu_count;
  • contextvars fallback support for asyncio.to_thread;
  • tarfile, zipfile;
  • ssl, urllib3, usually only to suppress certificate warnings while fetchers are shell-driven;
  • os.symlink, os.link, os.rename, os.makedirs, open, and file tests.

core/execute.py is the important boundary. It schedules a DAG, prepares output directories, calls shell commands with environment variables and stdin, checks output touch files, and kills the process group on failure.

core/cmd_misc.py and core/shell_cmd.py cover fetch, extraction, hash checking, archive unpacking, and hardlinking fetched inputs.

core/realm.py maps build outputs into realm names using symlinks and metadata under /ix/realm.

core/ops.py selects an execution mode. Today the modes are local, system, fake, and molot. A capOS executor mode is the correct integration point.

CPython Path

CPython is the obvious route for upstream compatibility:

  • upstream Jinja2 is designed for modern Python and uses normal CPython-style standard library facilities;
  • IX’s current Python code assumes subprocess, asyncio, fcntl, shutil, archive modules, and process semantics;
  • CPython plus libcapos-posix would let a large fraction of that code run with limited changes.

That does not make CPython the right product dependency for IX-on-capOS. CPython pulls in a large libc/POSIX surface and encourages preserving Unix process and filesystem assumptions that capOS should make explicit through capabilities.

CPython should be used in two places:

  1. Host-side bootstrap and reference evaluation.
  2. Optional compatibility mode once libcapos-posix is mature.

It should not be the required path for a clean IX-capOS integration.

If CPython is needed later, capOS has two routes:

  1. Native CPython through musl plus libcapos-posix.
  2. CPython compiled to WASI and run through a native WASI runtime.

The native POSIX route is the only route that makes sense for IX-style build workloads. It needs fd tables, path lookup, read/write/close/lseek, directory iteration, rename/unlink/mkdir, time, memory mapping, posix_spawn, pipes, exit status, and eventually sockets. That is the same compatibility work needed for shell tools and build systems, so it should arrive as part of the general userspace-compatibility track, not as an IX-specific dependency.

The WASI route is useful for sandboxed or compute-heavy Python, but it is a poor fit for IX package builds because IX fundamentally drives external tools, filesystem trees, fetchers, and process lifecycles. WASI CPython can be useful as a script sandbox, not as the main IX appliance runtime.

MicroPython Path

MicroPython is attractive because capOS needs an embeddable system scripting runtime before it needs a full desktop Python environment.

The upstream docs frame MicroPython as a Python implementation with a smaller, configurable library set. The latest library docs list micro versions of modules relevant to IX, including asyncio, gzip, hashlib, json, os, platform, random, re, select, socket, ssl, struct, sys, time, zlib, and _thread, while warning that most standard modules are subsets and that port builds may include only part of the documented surface.

That is a good fit for capOS. It means a capOS port can expose a deliberately chosen OS surface instead of pretending to be Linux.

MicroPython should host:

  • package graph traversal;
  • package metadata parsing;
  • target/config normalization;
  • dependency expansion;
  • high-level policy;
  • command graph generation;
  • calls into capOS-native services.

MicroPython should not own:

  • generic subprocess emulation;
  • shell execution internals;
  • process groups or Unix signals;
  • TLS/network fetching;
  • archive formats beyond small helper cases;
  • hardlink/symlink implementation;
  • content store mutation;
  • build sandboxing;
  • parallel job scheduling if that wants kernel-visible resource control.

Those belong in capOS services.

Native MicroPython Port Shape

A capOS MicroPython port should be a new MicroPython platform port, not the Unix port with a large compatibility shim underneath.

The port should provide:

  • VM startup through capos-rt;
  • heap allocation from a fixed initial heap first, then VirtualMemory when growth is available;
  • stdin/stdout/stderr backed by granted stream or Console capabilities;
  • module import from a read-only Namespace plus frozen modules;
  • a small VFS adapter over Store/Namespace for scripts and package metadata;
  • native C/Rust extension modules for capOS capabilities;
  • deterministic error mapping from capability exceptions to Python exceptions.

The initial built-in surface should be deliberately small:

  • sys with argv/path/modules;
  • os path and file operations backed by a granted namespace;
  • time backed by a clock capability;
  • hashlib, json, binascii/base64, random, struct;
  • optional asyncio if the planner keeps Python-level concurrency;
  • no general-purpose subprocess until the service boundary proves it is necessary.

For IX, the MicroPython port should ship frozen planner modules and native bindings to ix-template, BuildCoordinator, Store, Namespace, Fetcher, and Archive. That keeps the trusted scripting surface small and avoids import-time dependency drift.

Jinja2 and MicroPython

Full Jinja2 compatibility on MicroPython remains unproven and is probably not the optimal target.

Current Jinja docs say Jinja supports Python 3.10 and newer, depends on MarkupSafe, and compiles templates to optimized Python code. The bundled IX Jinja tree imports modules such as typing, weakref, importlib, contextlib, inspect, ast, types, collections, itertools, io, and MarkupSafe. Some of these can be ported or stubbed, but that is a CPython compatibility project, not a small MicroPython extension.

The better path is to treat IX’s template language as an input format and render it with a capOS-native component.

Recommended template strategy:

  1. Build an ix-template Rust component using MiniJinja or a smaller IX-specific template subset.
  2. Register IX’s custom filters from core/j2.py.
  3. Implement IX’s loader semantics: // package-root paths, relative includes, and cached sources.
  4. Reject unsupported Jinja constructs with deterministic errors.
  5. Keep CPython/Jinja2 as a host-side oracle for differential testing until the capOS renderer matches the package corpus.

MiniJinja is a practical candidate because it is Rust-native, based on Jinja2 syntax/behavior, supports custom filters and dynamic objects, and has feature flags for trimming unused template features. IX needs multi-template support because it uses extends, include, and block.

If MiniJinja compatibility is insufficient, the fallback is not CPython by default. The fallback is an IX-template subset evaluator that implements the constructs actually used by pkgs/.

Optimal Architecture

The clean design is an IX-capOS build appliance, not a Unix personality layer that happens to run IX.

flowchart TD
    CLI[ix CLI or build request] --> Planner[ix planner on MicroPython]
    Planner --> Template[ix-template renderer]
    Planner --> Graph[normalized build graph]
    Template --> Graph

    Graph --> Coordinator[capOS BuildCoordinator service]
    Coordinator --> Fetcher[Fetcher service]
    Coordinator --> Extractor[Archive service]
    Coordinator --> Store[Store service]
    Coordinator --> Sandbox[BuildSandbox service]

    Fetcher --> Store
    Extractor --> Store
    Sandbox --> Proc[ProcessSpawner]
    Sandbox --> Scratch[writable scratch namespace]
    Sandbox --> Inputs[read-only input namespaces]
    Proc --> Tools[sh, make, cc, cargo, go, coreutils]
    Sandbox --> Output[write-once output namespace]
    Output --> Store
    Store --> Realm[Namespace snapshot / realm publish]

The planner remains small and scriptable. The authority-bearing work happens in services:

  • BuildCoordinator: owns graph execution and job state.
  • Store: content-addressed objects and output commits.
  • Namespace: names, realms, snapshots, and package environments.
  • Fetcher: network-capable source acquisition with explicit TLS and cache policy.
  • Archive: deterministic extraction and path-safety checks.
  • BuildSandbox: constructs per-build capability sets.
  • ProcessSpawner: starts shell/tools with controlled argv, env, cwd, stdio, and granted capabilities.
  • Toolchain packages: statically linked tools built externally first, then eventually by IX itself.

The adapted IX planner should call service APIs instead of shelling out for operations that are native capOS concepts.

Control-Plane Boundary

MicroPython should see a narrow, high-level API. It should not synthesize Unix from first principles.

Example shape:

import ixcapos
import ixtemplate

pkg = ixcapos.load_package("bin/minised")
desc = ixtemplate.render_package(pkg.name, pkg.context)
graph = ixcapos.plan(desc, target="x86_64-unknown-capos")
result = ixcapos.build(graph)
ixcapos.publish_realm("dev", result.outputs)

The Python layer can still look like IX. The implementation behind it should be capability-native.

Service API Sketch

The exact schema should follow the project schema style, but this is the shape of the boundary:

interface BuildCoordinator {
  plan @0 (package :Text, target :Text, options :BuildOptions)
      -> (graph :BuildGraph);
  build @1 (graph :BuildGraph) -> (result :BuildResult);
  publish @2 (realm :Text, outputs :List(OutputRef))
      -> (namespace :Namespace);
}

interface BuildSandbox {
  run @0 (command :Command, inputs :List(Namespace),
          scratch :Namespace, output :Namespace, policy :SandboxPolicy)
      -> (status :ExitStatus, log :BlobRef);
}

interface Fetcher {
  fetch @0 (url :Text, sha256 :Data, policy :FetchPolicy)
      -> (blob :BlobRef);
}

interface Archive {
  extract @0 (archive :BlobRef, policy :ExtractPolicy)
      -> (tree :Namespace);
}

Important policy fields:

  • network allowed or denied;
  • wall-clock and CPU budgets;
  • maximum output bytes;
  • allowed executable namespaces;
  • allowed output path policy;
  • whether timestamps are normalized;
  • whether symlinks are preserved, rejected, or translated;
  • whether hardlinks become store references or copied files.

Store and Realm Mapping

IX’s /ix/store maps well to capOS Store.

IX’s realms should not be literal symlink trees in capOS. They should be named Namespace snapshots:

IX conceptcapOS mapping
/ix/store/<uid>-nameStore object/tree with stable content hash and metadata
build output dirwrite-once output namespace
build temp dirscratch namespace with cleanup policy
realmnamed Namespace snapshot
symlink from realm to outputNamespace binding or bind manifest
hardlinked source cacheStore reference or copy-on-write blob binding
touch output sentinelbuild-result metadata, optionally synthetic file for compatibility

This preserves IX’s reproducibility model without importing global Unix authority.

Process and Filesystem Requirements

A mature capOS needs these primitives before IX builds can run natively:

  • ProcessSpawner and ProcessHandle;
  • argv/env/cwd/stdin/stdout/stderr passing;
  • exit status;
  • pipes or stream capabilities;
  • fd-table support in the POSIX layer for ported tools;
  • read-only input namespaces;
  • writable scratch namespaces;
  • write-once output namespaces;
  • directory listing, create, rename, unlink, and metadata;
  • symlink translation or explicit rejection policy;
  • hardlink translation or store-reference fallback;
  • monotonic time;
  • resource limits;
  • cancellation.

For package builds, the tool surface is larger than IX’s Python surface:

  • sh;
  • find, sed, grep, awk, sort, xargs, install, cp, mv, rm, ln, chmod, touch, cat;
  • tar, gzip, xz, zstd, zip, unzip;
  • make, cmake, ninja, meson, pkg-config;
  • C compiler/linker/archive tools;
  • cargo and Rust toolchains;
  • Go toolchain;
  • Python only for packages that build with Python.

IX’s static-linking bias helps because the early tool closure can be imported as statically linked binaries.

What to Patch Out of IX

For a clean capOS fit, patch or replace these upstream assumptions:

Upstream assumptioncapOS replacement
subprocess.run everywhereBuildSandbox.run() or ProcessSpawner
process groups and SIGKILLProcessHandle.killTree() or sandbox cancellation
fcntl stdout flag resetremove or make no-op
chrt, nicescheduler/resource policy on sandbox
sudo, su, chownno permission-bit authority; use capability grants
unshare, tmpfs, jailBuildSandbox with explicit caps
/ix/store global pathStore capability plus namespace mount view
/ix/realm symlink treeNamespace snapshot/publish
hardlinks for fetched filesStore refs or copy fallback
curl/wget subprocess fetchFetcher service
Python tarfile/zipfileArchive service
asyncio executorBuildCoordinator scheduler

This is more invasive than a “light patch”, but it is cleaner. The IX package corpus and target/build knowledge are preserved; Unix process plumbing is not.

MicroPython Port Scope

The MicroPython port should be sized around IX planner needs plus general system scripting:

Native modules:

  • capos: bootstrap capabilities, typed capability calls, errors.
  • ixcapos: package graph and build-service client bindings.
  • ixtemplate: template render calls if the renderer is an embedded Rust/C component.
  • ixstore: Store and Namespace helpers.

Python/micro-library requirements:

  • json;
  • hashlib;
  • base64 or binascii;
  • os.path subset;
  • random;
  • time;
  • small shutil subset for path operations if old IX code remains;
  • small asyncio only if planner concurrency remains in Python.

Avoid implementing:

  • general subprocess;
  • general fcntl;
  • full signal;
  • full multiprocessing;
  • full tarfile;
  • full zipfile;
  • full ssl/urllib3;
  • full Jinja2.

Those are symptoms of preserving the wrong boundary.

CPython Still Has a Role

CPython remains useful even if it is not a capOS prerequisite:

  • run upstream IX on the development host;
  • compare rendered descriptors from CPython/Jinja2 against ix-template;
  • generate fixtures for the capOS renderer;
  • bootstrap the first static tool closure;
  • serve as a later optional POSIX compatibility demo.

Differential testing should be explicit:

flowchart LR
    Pkg[IX package] --> Cpy[Host CPython + Jinja2]
    Pkg --> Cap[capOS ix-template]
    Cpy --> A[descriptor A]
    Cap --> B[descriptor B]
    A --> Diff[normalized diff]
    B --> Diff
    Diff --> Corpus[compatibility corpus]

This makes CPython a test oracle, not a trusted runtime dependency inside capOS.

Staged Plan

Stage A: Host IX builds capOS artifacts

Run IX on Linux host first. Add a capos target and recipes for static capOS ELFs. This validates package metadata, target triples, linker flags, and static closure assumptions before capOS hosts any of it.

Outputs:

  • x86_64-unknown-capos target model in IX;
  • recipes for libcapos, capos-rt, shell/coreutils candidates, MicroPython, and archive/fetch helpers;
  • static artifacts imported into the boot image or Store.

Stage B: Template compatibility harness

Build ix-template on the host. Render a package corpus through CPython/Jinja2 and through ix-template. Normalize JSON/script output and record divergences.

Outputs:

  • supported IX template subset;
  • custom filter implementation;
  • fixture corpus;
  • list of unsupported packages or constructs.

Stage C: Native MicroPython port

Port MicroPython to capOS as a normal native userspace program using capos-rt and a small libc/POSIX subset only where needed.

Outputs:

  • REPL or script runner;
  • frozen IX planner modules;
  • native capos, ixcapos, and ixtemplate modules;
  • no promise of full CPython compatibility.

Stage D: BuildCoordinator and sandboxed execution

Implement capOS-native build services and run simple package builds using externally supplied static tools.

Outputs:

  • build graph execution;
  • per-build scratch/output namespaces;
  • deterministic logs and output commits;
  • cancellation and resource policies.

Stage E: IX package corpus migration

Patch IX templates for capOS target semantics. Start with simple C/static packages, then Rust, then Go.

Outputs:

  • C/static package subset;
  • regular Rust package support once regular Rust runtime/toolchain work is ready;
  • Go package support when GOOS=capos or imported Go toolchain support is credible;
  • WASI packages as a separate target family where useful.

Stage F: Self-hosting

Run the IX-capOS appliance inside capOS to rebuild a meaningful part of its own userspace closure.

Outputs:

  • build the MicroPython IX planner inside capOS;
  • build core shell/coreutils/archive tools inside capOS;
  • build libcapos and selected static service binaries;
  • eventually build Rust and Go runtime/toolchain pieces.

Why This Is Better Than “CPython First”

The CPython-first route optimizes for running upstream IX quickly. The MicroPython-plus-services route optimizes for capOS’s actual design:

  • capability authority stays typed and explicit;
  • build isolation is native instead of Linux namespace emulation;
  • Store/Namespace are first-class rather than hidden behind /ix;
  • fetch/archive/build operations are auditable services;
  • the scripting runtime remains small;
  • the system does not need full CPython before it can have a package manager;
  • CPython can still be added later through the POSIX layer without blocking IX-capOS.

The tradeoff is that IX-capOS becomes a real port/fork at the control-plane boundary. That is acceptable for a clean capability-native fit.

Risks

Template compatibility is the main technical risk. IX uses a restricted-looking Jinja subset, but exact self.block(), super(), whitespace, expression, and undefined-value behavior must match closely enough for package hashes to remain stable. This needs corpus testing, not confidence.

Build-script compatibility is the largest scope risk. Even if IX planning is native, the package corpus still executes conventional build systems. capOS must provide enough shell, coreutils, archive, compiler, and filesystem behavior for those tools.

Toolchain bootstrapping is a long dependency chain. The first useful IX-capOS system will import statically linked tools from a host. Native self-hosting is late-stage work.

Store semantics need care around directories, symlinks, hardlinks, mtimes, and executable bits. These details affect build reproducibility and package compatibility.

MicroPython must not grow into a bad CPython clone. If many missing modules are implemented only to satisfy upstream IX assumptions, the design boundary has failed.

Recommendation

Adopt IX as a package corpus and build model, not as a CPython/POSIX program to preserve unchanged.

The optimal capOS-native solution is:

  1. Host-side upstream IX remains available for bootstrap and oracle tests.
  2. ix-template in Rust renders the actual IX template subset.
  3. Native MicroPython runs the adapted IX planner/control plane.
  4. capOS services execute all authority-bearing operations: fetch, extract, build sandbox, Store commit, Namespace publish, and process lifecycle.
  5. CPython is deferred to general POSIX compatibility and optional tooling.

This makes MicroPython the sweet spot for the in-system IX control plane while avoiding the trap of turning MicroPython into CPython.

Research: Out-of-Kernel Scheduling

Survey of whether capOS can move CPU scheduler implementation out of the kernel, which parts are normally kept privileged, and which policy has been moved to user-space services or loadable policy modules in prior systems.

Scope

“User-space scheduler” is an overloaded term. The question here is narrower than language/runtime scheduling: can the OS CPU scheduler itself be moved out of the kernel?

This report separates the relevant models:

ModelSchedulesKernel seesExamples
User-controlled kernel schedulingKernel threads / scheduling contextsPrivileged mechanism plus user policy inputsL4 user-level scheduling, seL4 MCS, ARINC partition schedulers on seL4
Dynamic in-kernel policyKernel threadsPolicy loaded from user space but executed in kernelLinux sched_ext, Ekiben, Bossa
Whole-machine core arbitrationCores granted to applications/runtimesKernel threads pinned, parked, or revokedArachne, Shenango, Caladan
In-process M:N runtimeGoroutines, virtual threads, fibers, async tasksA smaller set of OS threadsGo, Java virtual threads, Erlang, Tokio
User-level thread packageUser-level threads or taskletsOne or more kernel execution contextsCapriccio, Argobots
Kernel-assisted two-level runtime schedulingUser threads plus kernel eventsVirtual processors / activationsScheduler activations, Windows UMS

The common boundary in prior systems is: the kernel allocates protected execution resources, handles blocking and preemption, and enforces isolation. User space supplies domain policy: which goroutine, actor, task, request, or coroutine runs next.

Feasibility Assessment

Moving the entire scheduler out of the kernel is not viable in a protected, preemptive system if “scheduler” means the code that runs on timer interrupts, chooses an immediately runnable kernel thread, saves/restores CPU state, changes page tables, updates per-CPU state, and enforces CPU-time isolation. That mechanism is part of the CPU protection boundary.

Moving scheduler policy out of the kernel is viable. A capOS-like kernel can act as a small CPU driver that enforces runnable-state invariants, capability-authorized scheduling contexts, budgets, priorities, CPU affinity, timeout faults, and IPC donation. A privileged user-space scheduler service can own admission control, budgets, priorities, placement, CPU partitioning, and service-specific policy.

The design point supported by the surveyed systems is not “no scheduler in kernel.” It is “minimal kernel dispatch and enforcement, user-space policy.”

Executive Conclusions

  1. The next-thread dispatch path is normally kept in kernel mode. It runs when the current user process may be untrusted, blocked, faulting, or out of budget.
  2. User space can own policy if the kernel exposes scheduling contexts as capability-controlled CPU-time objects. Thread creation and thread handles should follow the same capability-first model.
  3. Consulting a user-space scheduler server on every timer tick adds context switches to the hottest path and creates a bootstrap problem when the scheduler server itself is not runnable.
  4. seL4 MCS is the most directly comparable model: scheduling contexts are explicit objects, budgets are enforced by the kernel, and passive servers can run on caller-donated scheduling contexts.
  5. L4 user-level scheduling experiments show that user-directed scheduling is possible, with reported overhead from 0 to 10 percent compared with a pure in-kernel scheduler for their workload. That is plausible for policy changes, not for every dispatch decision.
  6. seL4 user-mode partition schedulers show the downside: a prototype partitioned scheduler measured substantial overhead because each scheduling event crosses the user/kernel boundary.
  7. sched_ext and Ekiben are useful evidence for pluggable scheduler policy, but they still execute policy in or near the kernel. They do not prove that the dispatch mechanism can be a normal user process.
  8. Whole-machine core arbiters such as Arachne, Shenango, and Caladan support a different split: the kernel still schedules threads, while a privileged control plane grants, revokes, and places cores at coarser granularity.
  9. Direct-switch IPC and scheduling-context donation reduce the priority inversion and dispatch-overhead risks that appear when capability servers are scheduled only by per-process priorities.
  10. Pure M:1 user-level threads are insufficient for capOS as the only threading story. They are fast, but one blocking syscall, page fault wait, or long CPU loop can stall unrelated user threads unless every blocking operation is converted to async form.
  11. M:N runtimes need a small OS contract: capability-created kernel threads, TLS/FS-base state, capability-authorized futex-style wait/wake, monotonic timers, async I/O/event notification, and a way to detect or avoid kernel blocking.
  12. Scheduler activations solved the right conceptual problem but exposed a complicated upcall contract. A capability OS can get most of the benefit with simpler primitives: async capability rings, notification objects, futexes, and explicit thread objects.
  13. Work-stealing with per-worker local queues is the dominant general-purpose runtime design. It gives locality and scale, but it needs explicit fairness guards and I/O polling integration.
  14. SQPOLL-style polling is a scheduling decision. It trades a core for lower submission latency and depends on SMP plus explicit CPU ownership.
  15. A generic language scheduler in the kernel is a separate design from out-of-kernel CPU policy. Go, Rust async, actor runtimes, and POSIX layers need kernel mechanisms that let them implement their own policy.

Privileged Mechanisms

The following responsibilities are mechanism, not policy. Moving them to a normal user process either breaks protection or puts a user/kernel round trip on the critical path:

  • Save and restore CPU register context.
  • Switch page tables / address spaces.
  • Update per-CPU current-thread state, kernel stack, TSS/RSP0, and syscall stack state.
  • Handle timer interrupts and IPIs.
  • Maintain a safe runnable/blocked/exited state machine.
  • Enforce CPU budgets and preempt a thread that exceeds its budget.
  • Choose an emergency runnable thread when the policy owner is dead, blocked, or malicious.
  • Run idle and halt safely when no runnable work exists.
  • Integrate scheduling with blocking syscalls, page faults, futex waits, and IPC wakeups.
  • Preserve invariants under SMP races.

These are exactly the parts currently concentrated in kernel/src/sched.rs and the x86 context-switch path. They can be simplified and made more generic, but they remain required somewhere privileged.

Policy Surface

The following are policy examples that can be owned by a privileged user-space service once scheduling contexts exist:

  • Admission control: which process/thread is allowed to consume CPU time.
  • Priority assignment and dynamic priority changes.
  • Budget/period selection for temporal isolation.
  • CPU affinity and CPU partitioning decisions.
  • Core grants for SQPOLL, device polling, network stacks, and latency-sensitive services.
  • Overload handling policy.
  • Per-service or per-tenant fair-share policy.
  • Instrumentation-driven tuning.
  • Runtime-specific hints, such as “latency-sensitive”, “batch”, “driver”, or “poller”.

This split gives a capOS-like system policy freedom while preserving a small, auditable kernel CPU mechanism.

Viable Architectures

1. Minimal Kernel Scheduler Plus User Policy Service

This is one capOS-compatible design point.

The kernel implements:

  • Thread states and per-CPU run queues.
  • Priority/budget-aware dispatch.
  • Scheduling-context objects.
  • Timer-driven budget accounting.
  • Timeout faults or notifications.
  • Capability-checked operations to bind/unbind scheduling contexts to threads.
  • Emergency fallback policy.

A user-space sched service implements:

  • System policy loaded from the boot manifest.
  • Resource partitioning between services.
  • Priority/budget updates.
  • CPU pinning and SQPOLL grants.
  • Diagnostics and policy reload.

The policy service is invoked on configuration changes and timeout faults, not on every context switch.

2. seL4-MCS-Style Scheduling Contexts

seL4 MCS makes CPU time a first-class kernel object. A thread needs a scheduling context to run. A scheduling context carries budget, period, and priority. The kernel enforces the budget with a sporadic-server model. Passive servers can block without their own scheduling context; callers donate their scheduling context through synchronous IPC, and the context returns on reply.

This maps directly to capOS:

SchedContext {
    budget_ns
    period_ns
    priority
    cpu_mask
    remaining_budget
    timeout_endpoint
}

Kernel responsibilities:

  • Enforce budget and period.
  • Dispatch runnable threads with eligible scheduling contexts.
  • Donate and return contexts across direct-switch IPC.
  • Notify user space on timeout or depletion.

User-space responsibilities:

  • Create and distribute scheduling-context capabilities.
  • Decide budgets and priorities.
  • Build passive service topologies.
  • React to timeout faults.

This moves scheduling policy out without moving the hot dispatch mechanism out.

3. Hierarchical User-Level Scheduler

L4 research evaluated exporting scheduling to user level through a hierarchical user-level scheduling architecture. The reported application overhead was 0 to 10 percent compared with a pure in-kernel scheduler in their evaluation, and the design enabled user-directed scheduling.

This is possible, but the cost model is sensitive:

  • Every policy decision that requires a scheduler-server round trip is expensive.
  • The scheduler server needs guaranteed CPU time, or the system can deadlock.
  • Faults and interrupts still need kernel fallback.
  • SMP multiplies races around run queues, CPU ownership, and migration.

This architecture is viable for coarse-grained partition scheduling, VM scheduling, or policy control. As a first general dispatch path, it has higher latency and bootstrap risk than an in-kernel dispatcher.

4. Dynamic In-Kernel Policy

Linux sched_ext lets user space load BPF scheduler programs, but the policy runs inside the kernel scheduler framework. The kernel preserves integrity by falling back to the fair scheduler if the BPF scheduler errors or stalls runnable tasks. Ekiben similarly targets high-velocity Linux scheduler development with safe Rust policies, live upgrade, and userspace debugging.

This model is a later-stage option for dynamic scheduler experiments, but it is not “scheduler in user space.” It also adds verifier/runtime complexity.

5. Core Arbiter / Resource Manager

Arachne, Shenango, and Caladan move high-level core allocation decisions out of the ordinary kernel scheduler path. Applications or runtimes know which cores they own, while an arbiter grants and revokes cores based on load or interference.

This model is useful for capOS after SMP:

  • grant cores to NIC drivers, network stacks, or SQPOLL workers;
  • revoke poller cores under CPU pressure;
  • isolate latency-sensitive services from batch work;
  • expose CPU ownership through capabilities.

It does not remove the kernel dispatcher. It changes the granularity of policy from “which thread next” to “which service owns this CPU budget.”

Classic Problem: Kernel Threads vs User Threads

The scheduler activations paper is still the cleanest statement of the core problem: kernel threads have integration with blocking and preemption, while user-level threads have cheaper context switching and better policy control. The failure mode of user-level threads layered naively on kernel threads is that kernel events are hidden from the runtime. A kernel thread can block in the kernel while runnable user threads exist, and the kernel can preempt a kernel thread without telling the runtime which user thread was stopped.

Scheduler activations address this by giving each address space a “virtual multiprocessor.” The kernel allocates processors to address spaces and vectors events to the user scheduler when processors are added, preempted, blocked, or unblocked. The activation is both an execution context and a notification vehicle.

The lesson for capOS is not to copy the full activation API. The durable idea is the split:

  • Kernel owns physical CPU allocation, protection, preemption, and blocking.
  • Runtime owns which application-level work item runs on a granted execution context.
  • Kernel-visible blocking must create a runtime-visible event, or it must be avoided by making the operation async.

For capOS, async capability rings already avoid many blocking syscalls. The remaining hard cases are futex waits, page faults that require I/O, synchronous IPC, and preemption of long-running runtime tasks.

Runtime Schedulers in Practice

Go

Go uses an M:N scheduler with three central concepts:

  • G: goroutine.
  • M: worker thread.
  • P: processor token required to execute Go code.

The Go runtime distributes runnable goroutines over worker threads, keeps per-P queues for scalability, uses global queues and netpoller integration for fairness and I/O, and parks/unparks OS threads conservatively to avoid wasting CPU. Its own source comments call out why centralized state and direct handoff were rejected: centralization hurts scalability, while eager handoff hurts locality and causes thread churn.

Preemption is mixed. Go has synchronous safe points and asynchronous preemption using OS mechanisms such as signals. The runtime can only safely stop a goroutine at points where stack and register state can be scanned.

Implications for capOS:

  • Initial GOOS=capos can run with GOMAXPROCS=1 and cooperative preemption, but useful Go requires kernel threads, futexes, FS-base/TLS, a monotonic timer, and an async network poller.
  • A signal clone is not strictly required if capOS provides a runtime-visible timer/preemption notification and the Go port accepts cooperative-first behavior.
  • The kernel must schedule threads, not processes, before Go can use multiple cores.

Java Virtual Threads

JDK virtual threads use M:N scheduling: many virtual threads are mounted on a smaller number of platform threads. The default scheduler is a FIFO-mode work-stealing ForkJoinPool; the platform thread currently carrying a virtual thread is called its carrier.

The design is intentionally not pure cooperative scheduling from the application’s perspective: most JDK blocking operations unmount the virtual thread, freeing the carrier. But some operations pin the virtual thread to the carrier, notably native calls and some synchronized regions. The JEP also notes that the scheduler does not currently implement CPU time-sharing for virtual threads.

Implications for capOS:

  • “Blocking” compatibility requires library/runtime cooperation, not just a scheduler. The runtime needs blocking operations to yield carriers.
  • Native calls and pinned regions remain a general M:N hazard. capOS cannot make that disappear in the kernel.

Tokio and Rust Async Executors

Tokio represents the async executor model rather than stackful green threads. Tasks run until they return Poll::Pending, so fairness depends on cooperative yield points and wakeups. Tokio’s multi-thread scheduler uses one global queue, per-worker local queues, work stealing, an event interval for I/O/timer checks, and a LIFO slot optimization for locality.

Implications for capOS:

  • A capos-rt async executor can integrate capability-ring completions, notification objects, and timers as wake sources.
  • A cooperative budget is mandatory. A future that never awaits can monopolize a worker until kernel preemption takes the whole OS thread away.
  • A single global CQ per process can become an executor bottleneck if many worker threads consume completions. Per-thread or sharded wake queues are likely needed after SMP.

Erlang/BEAM

BEAM schedulers run lightweight Erlang processes on scheduler threads. The runtime exposes scheduler count and binding controls, and Erlang processes are preempted by reductions rather than OS timer slices. This shows a different point in the design space: the language VM owns fairness because it controls execution of bytecode.

Implications for capOS:

  • Managed runtimes can implement stronger fairness than native async libraries because they control instruction dispatch or compiler-inserted safe points.
  • Native Rust/C userspace cannot rely on that unless the compiler/runtime inserts yield or safe-point checks.

Capriccio and Argobots

Capriccio showed that a user-level thread package can scale to very high concurrency by combining cooperative user-level threads, asynchronous I/O, O(1) thread operations, linked stacks, and resource-aware scheduling. The important lesson is that the thread abstraction can survive high concurrency when the runtime controls stacks and blocking.

Argobots generalizes lightweight execution units into user-level threads and tasklets over execution streams. It is designed as a substrate for higher-level systems such as OpenMP and MPI, with customizable schedulers. This is directly relevant to capOS because it argues for low-level runtime mechanisms, not one global scheduling policy.

Lithe

Lithe targets composition of parallel libraries. Its thesis is that a universal task abstraction or one global scheduler does not compose well when multiple parallel libraries are nested. Instead, physical hardware threads are shared through an explicit resource interface, while each library keeps its own task representation and scheduling policy.

Implications for capOS:

  • Avoid oversubscription by making CPU grants visible to user space.
  • A future CpuSet or scheduling-context capability could let runtimes know how much parallelism they are actually allowed to use.
  • Nested runtimes benefit from the ability to donate or yield execution resources without going through a process-global policy singleton.

Kernel Interfaces That Matter

Futexes

Futexes are the standard split-lock design: user space does the uncontended fast path with atomics, and the kernel only participates to sleep or wake threads. Linux also has priority-inheritance futex operations for cases where the kernel must manage lock-owner priority propagation.

For capOS:

  • Implement futex as a capability-authorized primitive. Do not assume generic Cap’n Proto method encoding is acceptable for the hot path; measure it against a compact operation before fixing the ABI.
  • Key futex wait queues by (address_space, user_virtual_address) for private futexes. Shared-memory futexes eventually need a memory-object identity plus offset.
  • Support timeout against monotonic time first. Requeue and PI futexes can wait.

Restartable Sequences

Linux rseq lets user space maintain per-CPU data without heavyweight atomics and lets a thread cheaply read its current CPU/node. The current kernel docs also describe scheduler time-slice extensions for short critical sections.

For capOS:

  • rseq-style current-CPU access becomes useful after SMP and per-CPU run queues.
  • It is not a first threading prerequisite. Futex, TLS, and kernel threads come first.
  • If added, expose a small per-thread ABI page with cpu_id, node_id, and an abort-on-migration critical-section protocol.

io_uring SQPOLL

SQPOLL moves submission from syscall-driven to polling-driven. A kernel thread polls the submission queue and submits work as soon as userspace publishes SQEs. This reduces submission latency and syscall overhead for sustained I/O, but it burns CPU and needs careful affinity.

capOS already has an io_uring-inspired capability ring, so the analogy is direct:

  • Current tick-driven ring processing is correct for a toy system but couples invocation latency to timer frequency.
  • A kernel-side SQ polling thread interacts badly with single-CPU systems. On a single CPU it competes with the application it is supposed to accelerate.
  • Make SQPOLL a scheduling/capability decision: the process donates or is granted a CPU budget for the poller.
  • Completion handling remains a separate problem. A runtime still needs to poll CQs or block on notifications.

sched_ext

Linux sched_ext is not a normal user-level thread scheduler. It is a scheduler class whose behavior is defined by BPF programs loaded from user space. The kernel docs emphasize that sched_ext can be enabled and disabled dynamically, can group CPUs freely, and falls back to the default scheduler if the BPF scheduler misbehaves. The docs also warn that the scheduler API has no stability guarantee.

For capOS:

  • The relevant idea is safe, dynamically replaceable policy with kernel integrity fallback.
  • Copying the BPF ABI is not required. capOS can get a smaller version through privileged scheduler-policy capabilities later.
  • Keep early scheduling policy in kernel Rust until the invariants are clear.

Whole-Machine User-Space/Core Schedulers

Arachne

Arachne is a user-level thread system for very short-lived threads. It is core-aware: applications know which cores they own and control placement of work on those cores. A central arbiter reallocates cores among applications. The published results report strong memcached and RAMCloud improvements, and the implementation requires no Linux kernel modifications.

Takeaway: user-level scheduling gets much better when the runtime has explicit core ownership. Blindly creating more kernel threads and hoping the OS scheduler does the right thing is a weaker contract.

Shenango

Shenango targets datacenter services with microsecond-scale tail-latency goals. It uses kernel-bypass networking and an IOKernel on a dedicated core to steer packets and reallocate cores across applications every 5 microseconds. The key policy is rapid core reallocation based on whether queued work is waiting long enough to imply congestion.

Takeaway: a dedicated scheduling/control core can be worthwhile when latency SLOs are tighter than normal kernel scheduling reaction times. It is expensive and only justified for sustained latency-sensitive workloads.

Caladan

Caladan extends the idea from load to interference. It uses a centralized scheduler core and kernel module to monitor and react to memory hierarchy and hyperthread interference at microsecond scale. Its main claim is that static partitioning of cores, caches, and memory bandwidth is neither necessary nor sufficient for rapidly changing workloads.

Takeaway: CPU scheduling is not only “which runnable thread next.” On modern machines it is also placement relative to caches, sibling SMT threads, memory bandwidth, and bursty workload phase changes.

Design Axes

AxisOptionsPractical conclusion
Stack modelStackless tasks, segmented/growing stacks, fixed stacksRust async uses stackless futures; Go/Java need runtime-managed stacks; POSIX threads need fixed or growable user stacks
PreemptionCooperative, safe-point, signal/upcall, timer-forced OS preemptionKernel preemption alone protects the system; runtime fairness needs safe points or cooperative budgets
BlockingConvert all operations to async, add carriers, kernel upcallsAsync caps reduce blocking; Go/POSIX still need kernel threads and futexes
QueueingGlobal queue, per-worker queues, work stealing, priority queuesPer-worker queues plus stealing are the default; add global fairness escape hatches
CPU ownershipInvisible OS scheduling, affinity hints, explicit CPU grantsExplicit grants matter for high-performance runtimes and SQPOLL
Cross-process callsQueue through scheduler, direct switch, scheduling donationDirect switch and scheduling-context donation reduce sync IPC overhead and inversion
IsolationBest-effort fairness, priorities, budget/period contextsCloud-oriented capOS eventually needs budget/period scheduling contexts

capOS Design Options

Option: Minimal Kernel Mechanism Plus User Policy

This option keeps dispatch and enforcement in the kernel, replaces the current round-robin process scheduler with a minimal kernel CPU mechanism, and moves policy to user space through scheduling-context capabilities.

The kernel side covers:

  • dispatching the next runnable thread on each CPU;
  • enforcing budget/period/priority invariants;
  • handling interrupts, blocking, wakeups, and exits;
  • direct-switch IPC and scheduling-context donation;
  • an emergency fallback policy.

The user-space scheduler service covers:

  • policy configuration from the manifest;
  • per-service budgets, periods, priorities, and CPU masks;
  • admission control for new processes and threads;
  • SQPOLL/core grants;
  • response to timeout faults and overload telemetry.

This gives a capOS-like system the exokernel/microkernel benefit of policy freedom without putting a user-space server on the context-switch fast path.

Possible Implementation Sequence

  1. Thread scheduler in kernel. Convert from process scheduling to thread scheduling, with per-thread kernel stack, saved registers, FS base, and shared process address space/cap table.
  2. Scheduling contexts. Add kernel objects that carry budget, period, priority, CPU mask, and timeout endpoint. Initially assign one default context per thread.
  3. ThreadSpawner and ThreadHandle capabilities. Expose thread creation and lifecycle through capabilities from the start. Bootstrap grants init the initial authority; init or a scheduler service delegates it under quota.
  4. Scheduling-context donation for IPC. Baseline direct-switch IPC handoff exists for blocked Endpoint receivers; add budget/priority donation and return once scheduling contexts exist.
  5. User-space policy service. Let init or a sched service create and update scheduling contexts via capabilities.
  6. SMP core ownership. After per-CPU run queues and TLB shootdown exist, allow the scheduler service to manage CPU masks and SQPOLL/poller grants.
  7. Optional dynamic policy. Much later, consider sched_ext-like policy modules if Rust/verifier infrastructure exists. This is not a prerequisite.

Minimal Kernel API Sketch

interface SchedulerControl {
    createContext @0 (budgetNs :UInt64, periodNs :UInt64, priority :UInt16)
        -> (context :SchedulingContext);
    setCpuMask @1 (context :SchedulingContext, mask :Data) -> ();
    bind @2 (thread :ThreadHandle, context :SchedulingContext) -> ();
    unbind @3 (thread :ThreadHandle) -> ();
    setTimeoutEndpoint @4 (context :SchedulingContext, endpoint :Endpoint) -> ();
    stats @5 (context :SchedulingContext) -> (consumedNs :UInt64, throttled :Bool);
}

interface SchedulingContext {
    yieldTo @0 (thread :ThreadHandle) -> ();
    consumed @1 () -> (consumedNs :UInt64);
}

interface ThreadSpawner {
    create @0 (
        entry :UInt64,
        stackTop :UInt64,
        arg :UInt64,
        context :SchedulingContext,
        flags :UInt64
    ) -> (thread :ThreadHandle);
}

interface ThreadHandle {
    join @0 (timeoutNs :UInt64) -> (status :Int64);
    exitCode @1 () -> (exited :Bool, status :Int64);
    bind @2 (context :SchedulingContext) -> ();
}

The hot path does not invoke these methods; they are control-plane operations.

Dependency: In-Process Threading

Kernel threads inside a process are a dependency for sophisticated user-level thread support:

  • Thread object with saved registers, per-thread kernel stack, user stack pointer, FS base, state, and parent process reference.
  • Scheduler runs threads, not processes.
  • Process owns address space and cap table; threads share both.
  • Process context switch saves/restores FS base today; thread scheduling must make that state per-thread.
  • Thread creation is exposed first as a ThreadSpawner capability; bootstrap grants initial authority to init, and later policy delegates it through the capability graph.
  • Thread exit reclaims the thread stack and wakes joiners if join exists.

This directly unblocks Go phase 2, POSIX pthread compatibility, native thread-local storage, and any multi-worker Rust async executor.

Dependency: Futex and Timer

A minimal capability-authorized futex primitive has this shape:

futex_wait(futex_space, uaddr, expected, timeout_ns) -> Result
futex_wake(futex_space, uaddr, max_count) -> usize

Required semantics:

  • wait checks that *uaddr == expected while holding the futex wait-lock equivalent, then blocks the current thread.
  • wake makes up to max_count waiters runnable.
  • Timeouts use monotonic ticks or a timer wheel/min-heap.
  • Return values must distinguish woken, timed out, interrupted, and value mismatch.

The authority should be capability-based from the start, for example through a FutexSpace, WaitSet, or memory-object-derived capability. The encoding is still a measurement question. Generic capnp capability calls may be acceptable if measured overhead is close to a compact operation; otherwise futex should use a dedicated compact capability-authorized operation because the primitive sits on the runtime parking path.

Measure this before fixing the ABI:

  • CAP_OP_NOP: ring validation plus CQE post, with no cap lookup or capnp.
  • Empty and small NullCap calls through normal cap lookup, method dispatch, capnp param decode, and capnp result encode.
  • Futex-shaped compact operation carrying cap_id, uaddr, expected, and timeout/max_count, initially returning without blocking.
  • Later, real blocking paths: failed wait, wake with no waiters, wait-to-block, wake-to-runnable, and wake-to-resume.

The useful decision is not “capability or syscall”; it is “generic capnp method or compact capability-authorized scheduler primitive.” Authority remains in the capability model either way.

Near Term: Runtime Event Integration

For capos-rt, design the executor around kernel completion sources:

  • Capability-ring CQ entries wake tasks waiting on cap invocations.
  • Notification objects wake tasks waiting on interrupts, timers, or service events.
  • Futex wakes resume parked worker threads.
  • Timers can be integrated as wakeups instead of periodic polling.

The executor policy can start simple:

  • One worker per kernel thread.
  • Local FIFO queue per worker.
  • One global injection queue.
  • Work stealing when local and global queues are empty.
  • Cooperative operation budget, then requeue.

Stage 6: IPC Scheduling

For synchronous IPC, direct switch has been introduced before priority scheduling:

  • If client A calls server B and B is blocked in receive, switch A -> B directly without picking an unrelated runnable thread. This is implemented for the current single-CPU Endpoint path.
  • Mark A blocked on reply.
  • Future fastpath work can transfer a small message inline; use shared buffers for large data.

Scheduling-context donation then adds the budget/priority transfer:

  • The server runs the request using the caller’s scheduling context.
  • The caller’s budget covers client + server work.
  • Passive servers can exist without independent CPU budget and only run when a caller donates one.

This avoids priority inversion through the capability graph and matches the service architecture better than per-process priorities alone.

Stage 7: SMP and Core Ownership

Once per-CPU scheduler queues exist, these become policy surfaces:

  • CPU affinity depends on correct migration and TLB shootdown.
  • A CpuSet or SchedulingContext capability can describe allowed CPUs, budget, period, and priority.
  • Cheap current-CPU exposure depends on a stable per-thread ABI page.
  • SQPOLL can be gated on available CPU budget to avoid unlimited poller creation.

Risks and Failure Modes

  • M:1 green threads do not provide Go or POSIX compatibility by themselves.
  • A normal user-space process choosing the next thread on every timer tick puts a context-switch round trip on the hot path.
  • Recovery from scheduler-service failure cannot depend solely on the scheduler service being runnable.
  • A Go-like G/M/P scheduler in the kernel couples language runtime policy to the kernel.
  • Generic Cap’n Proto capability calls may be too heavy for every synchronization primitive. Measure generic calls against compact capability-authorized operations before fixing the futex ABI.
  • sched_ext-like dynamic policy loading depends on mature scheduler invariants and verifier/runtime machinery.
  • SQPOLL on a single-core system can compete with the application it is meant to accelerate.

Open Questions

  1. Does capOS need scheduler-activation-style upcalls? Async caps and notification objects cover many of the same cases with less machinery.
  2. How can runtime preemption work without Unix signals? Options are cooperative-only, timer notification to a runtime handler, or a kernel forced safe-point ABI. Cooperative-only is one first-support option for Go.
  3. How are shared-memory futex keys represented? Private futexes can key on address space and virtual address. Shared futexes need memory-object identity and offset.
  4. Should futex wait/wake use generic capnp capability methods or a compact capability-authorized operation? The answer should come from the measurement plan above, not from assumption.
  5. How much policy belongs in the boot manifest versus a long-running sched service? Static embedded systems can use manifest policy. Cloud or developer systems need runtime policy updates.
  6. What is the emergency fallback if the scheduler service exits? Options are a tiny kernel round-robin fallback for privileged recovery threads, a pinned immortal scheduler thread, or panic. The first is the only robust development choice.

Source Notes

  • Anderson et al., “Scheduler Activations: Effective Kernel Support for the User-Level Management of Parallelism” (SOSP 1991): https://polaris.imag.fr/vincent.danjean/papers/anderson.pdf
  • “Towards Effective User-Controlled Scheduling for Microkernel-Based Systems” (L4 user-level scheduling): https://os.itec.kit.edu/21_738.php
  • Asberg and Nolte, “Towards a User-Mode Approach to Partitioned Scheduling in the seL4 Microkernel”: https://www.es.mdh.se/pdf_publications/2641.pdf
  • Kang et al., “A User-Mode Scheduling Mechanism for ARINC653 Partitioning in seL4”: https://link.springer.com/chapter/10.1007/978-981-10-3770-2_10
  • L4Re overview: https://l4re.org/doc/l4re_intro.html
  • Liedtke, “On micro-kernel construction”: https://elf.cs.pub.ro/soa/res/lectures/papers/lietdke-1.pdf
  • seL4 MCS tutorial: https://docs.sel4.systems/Tutorials/mcs.html
  • seL4 design principles: https://microkerneldude.org/2020/03/11/sel4-design-principles/
  • Linux kernel sched_ext documentation: https://www.kernel.org/doc/html/next/scheduler/sched-ext.html
  • Arun et al., “Agile Development of Linux Schedulers with Ekiben”: https://arxiv.org/abs/2306.15076
  • Williams, “An Implementation of Scheduler Activations on the NetBSD Operating System” (USENIX 2002): https://web.mit.edu/nathanw/www/usenix/freenix-sa/freenix-sa.html
  • Microsoft, “User-Mode Scheduling”: https://learn.microsoft.com/en-us/windows/win32/procthread/user-mode-scheduling
  • Go runtime scheduler source: https://go.dev/src/runtime/proc.go
  • Go preemption source: https://go.dev/src/runtime/preempt.go
  • OpenJDK JEP 444, “Virtual Threads”: https://openjdk.org/jeps/444
  • Tokio runtime scheduling documentation: https://docs.rs/tokio/latest/tokio/runtime/
  • von Behren et al., “Capriccio: Scalable Threads for Internet Services” (SOSP 2003): https://web.stanford.edu/class/archive/cs/cs240/cs240.1046/readings/capriccio-sosp-2003.pdf
  • Argobots paper page: https://www.anl.gov/argonne-scientific-publications/pub/137165
  • Argobots project: https://www.argobots.org/
  • Pan et al., “Lithe: Enabling Efficient Composition of Parallel Libraries” (HotPar 2009): https://www.usenix.org/legacy/event/hotpar09/tech/full_papers/pan/pan_html/
  • Linux futex(2) manual: https://man7.org/linux/man-pages/man2/futex.2.html
  • Linux kernel restartable sequences documentation: https://docs.kernel.org/userspace-api/rseq.html
  • io_uring_sqpoll(7) manual: https://manpages.debian.org/testing/liburing-dev/io_uring_sqpoll.7.en.html
  • Qin et al., “Arachne: Core-Aware Thread Management” (OSDI 2018): https://www.usenix.org/conference/osdi18/presentation/qin
  • Ousterhout et al., “Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads” (NSDI 2019): https://www.usenix.org/conference/nsdi19/presentation/ousterhout
  • Fried et al., “Caladan: Mitigating Interference at Microsecond Timescales” (OSDI 2020): https://www.usenix.org/conference/osdi20/presentation/fried