capOS Documentation
capOS is a research operating system where every kernel service and every
cross-process service is a typed Cap’n Proto capability invoked through a
shared-memory ring. There is no ambient authority, no global path
namespace, and the only remaining syscalls are cap_enter and exit. The
current implementation boots on x86_64 QEMU, loads a Cap’n Proto boot
manifest, starts manifest-declared services, and exercises ring-native IPC,
capability transfer, and init-driven spawning through QEMU smoke binaries.
Use this book as the current system manual. It separates implemented behavior from proposals, research notes, and operational planning files. What capOS Is has the short version of what makes the design unusual.
Start Here
- What capOS Is describes the implemented system model and the main authority boundaries.
- Current Status lists what works today, what is partial, and what remains future work.
- Build, Boot, and Test gives the commands used to build the ISO, boot QEMU, and run host-side validation.
- Repository Map maps the main subsystems to source files.
Deeper References
- Capability Model explains the capability-table and invocation model.
- Authority Accounting records the current transfer/accounting design.
- DMA Isolation, Trusted Build Inputs, and Panic Surface Inventory cover security and verification inventories.
- Research Index links prior-art notes used to shape the design.
- mdBook Documentation Site Proposal defines the documentation structure and status vocabulary.
Operational planning still lives outside the book in ROADMAP.md,
WORKPLAN.md, and REVIEW_FINDINGS.md. Treat those as live planning and review
records, not stable architecture pages.
What capOS Is
A research kernel that boots on x86_64 QEMU. The rest of this page is about why it looks the way it does — the specific design bets behind the code — not a feature inventory. For the feature-by-feature matrix, see Current Status.
Status: Partially implemented.
What Makes capOS Different
capOS is a research vehicle for a few specific design bets. Each is unusual on its own; the combination is the point.
- Everything is a typed capability. System resources are accessed through
Cap’n Proto interfaces defined in
schema/capos.capnp. There is no ambient authority — no global path namespace, no open-by-name, no implicit inherit. A process can only invoke objects present in its local capability table. - The interface IS the permission. Instead of a parallel READ/WRITE/EXEC
rights bitmask (Zircon, seL4), attenuation is a narrower capability: a
wrapper
CapObjectexposing fewer methods, or anEndpointclient facet that cannotRECV/RETURN. The kernel just dispatches; policy lives in interfaces. See Capability Model. - io_uring-style shared-memory ring for every call. Every process owns a
submission/completion queue page. Userspace writes SQEs with a normal
memory store; the kernel processes them through
cap_enter. New operations are SQE opcodes (CALL,RECV,RETURN,RELEASE,NOP), not new syscalls. The remaining syscall surface iscap_enterandexit. - Release is transport, not an application method. Dropping the last
owned handle in
capos-rtsubmits aCAP_OP_RELEASESQE; the kernel removes the slot. Noclose()method on every interface, no mutable table self-reference during dispatch. - Capability transfer is first-class. Copy and move descriptors ride
sideband on
CALL/RETURNSQEs. Move reserves the sender slot until the receiver accepts and preflight checks pass, then commits or rolls back atomically — no lost, duplicated, or half-inserted authority. - Cap’n Proto wire format end-to-end. The same encoding describes the boot manifest, runtime method calls, and future persistence/remote transparency. The CQE log is itself a serialized capnp message stream, which opens the door to record/replay, audit, and migration as OS primitives rather than external tooling.
- Host-testable pure logic. Cap-table, frame-bitmap, ELF parser, frame
ledger, lazy buffers, and the ring model all live in
capos-lib/capos-configand run undercargo test-lib, Miri, Loom, Kani, andproptestwithout any kernel scaffolding. Kernel glue stays thin. - Schema-first boot.
system.cueis compiled to a Cap’n ProtoSystemManifestembedded as the single Limine boot module. The manifest carries binaries, capability grants, exports, badges, and restart metadata as typed structured data — not shell scripts or baked environment variables.
Execution Model
Each process owns an address space, a local capability table, a mapped
capability-ring page, and a read-only CapSet page that enumerates its
bootstrap handles. The kernel enters Ring 3 with iretq and returns through
cap_enter or the timer. Ordinary capability calls progress only via
cap_enter; timer-side polling handles non-CALL ring work and call targets
that are explicitly safe for interrupt dispatch. Details in
Process Model,
Capability Ring, and
Scheduling.
Boot Flow
The kernel receives exactly one Limine module — a Cap’n Proto
SystemManifest compiled from system.cue — validates it, loads the
referenced ELFs, builds per-service capability tables and CapSet pages, and
starts the scheduler. The default boot still wires the service graph in the
kernel; the selected milestone is to move generic manifest execution into
init through ProcessSpawner. Full walkthrough in
Boot Flow and
Manifest and Service Startup.
Authority Boundaries
Authority is carried by cap-table hold edges with generation-tagged
CapIds. Ring 0 ↔ Ring 3, capability table ↔ kernel object, endpoint IPC,
copy/move transfer, manifest/boot-package, and process spawn are the
boundaries reviewers care about; each one fails closed at hostile input. See
Trust Boundaries for the boundary table and
Authority Accounting for the
transfer and quota invariants.
What capOS Is Not
A POSIX clone, a microkernel-shaped Linux replacement, or a production OS. It is a place to try the above choices and see which ones survive contact with real workloads. See Build, Boot, and Test to run it.
Current Status
This page describes current repository behavior, not the full long-term design.
For operational priority and open review items, read WORKPLAN.md and
REVIEW_FINDINGS.md.
Implemented
Boot and Kernel Baseline
- Limine boots the x86_64 kernel in QEMU.
- The kernel initializes serial output, GDT, IDT, PIC, PIT, syscall MSRs, memory management, page tables, heap allocation, and the global capability registry.
- The kernel creates its own page tables with per-section permissions and keeps the higher-half direct map for physical memory access.
- SMEP/SMAP are enabled when the QEMU CPU advertises support.
Code: kernel/src/main.rs, kernel/src/arch/x86_64/, kernel/src/mem/.
Validation: cargo build --features qemu, make run.
Process and Userspace Runtime
- Processes have isolated address spaces, per-process kernel stacks, CapSet bootstrap pages, capability rings, and local capability tables.
- ELF loading supports static no_std userspace binaries and TLS setup.
capos-rtowns the userspace entry path, allocator initialization, ring-client access, typed clients, result-cap parsing, and owned-handle release.
Code: kernel/src/spawn.rs, kernel/src/process.rs, capos-rt/src/,
init/src/main.rs, demos/.
Validation: make capos-rt-check, make run, make run-spawn.
Capability Ring and IPC
- The shared ring ABI supports CALL, RECV, RETURN, RELEASE, and NOP transport operations.
cap_enterprocesses submissions and can block until completions arrive or a timeout expires.- Endpoints route ring-native IPC between processes.
- Direct IPC handoff lets a blocked receiver run before unrelated round-robin work after a matching CALL arrives.
- Transport errors and application exceptions are surfaced through CQEs and typed runtime client errors.
Code: capos-config/src/ring.rs, kernel/src/cap/ring.rs,
kernel/src/cap/endpoint.rs, capos-rt/src/ring.rs,
capos-rt/src/client.rs.
Validation: cargo test-ring-loom, make run.
Capabilities
Implemented kernel capabilities include:
- Console for serial output.
- FrameAllocator for physical frame grants.
- Endpoint for IPC rendezvous.
- VirtualMemory for anonymous user page map, unmap, and protect operations.
- ProcessSpawner and ProcessHandle for init-driven child process creation and wait semantics.
Code: kernel/src/cap/console.rs, kernel/src/cap/frame_alloc.rs,
kernel/src/cap/endpoint.rs, kernel/src/cap/virtual_memory.rs,
kernel/src/cap/process_spawner.rs.
Validation: make run, make run-spawn, cargo test-lib.
Capability Transfer and Release
- IPC CALL and RETURN support sideband transfer descriptors.
- Copy and move transfer are implemented.
- Move transfer reserves the sender slot until destination insertion and commit.
- Transfer result caps carry interface ids to userspace.
CAP_OP_RELEASEremoves local capability-table slots and is integrated with runtime owned-handle drop.
Code: kernel/src/cap/transfer.rs, kernel/src/cap/ring.rs,
capos-lib/src/cap_table.rs, capos-rt/src/ring.rs.
Validation: cargo test-lib, make run.
Manifest Tooling and Smokes
tools/mkmanifestturnssystem.cueinto a Cap’n Proto boot manifest.- The build uses repo-pinned Cap’n Proto and CUE tool paths through the
Makefile; direct
mkmanifestinvocation also rejects missing, unpinned, or version-mismatched CUE compilers. - Default QEMU smoke services cover CapSet bootstrap, Console paths, ring corruption handling, reserved opcodes, NOP, ring fairness, TLS, VirtualMemory, FrameAllocator cleanup, Endpoint cleanup, and cross-process IPC.
system-spawn.cuedrives the ProcessSpawner smoke whereinitspawns endpoint, IPC, VirtualMemory, and FrameAllocator cleanup children and checks hostile spawn inputs.
Code: tools/mkmanifest/, system.cue, system-spawn.cue, demos/.
Validation: cargo test-mkmanifest, make generated-code-check, make run,
make run-spawn.
Partially Implemented
Init-Owned Service Startup
init can use ProcessSpawner in the spawn smoke, but default make run still
uses the kernel boot path to create all manifest services. The selected
milestone is to make default boot use init to validate and execute the service
graph.
Current blockers are tracked in WORKPLAN.md and REVIEW_FINDINGS.md.
Manifest schema-version guardrails, BootPackage authority exposure to init,
init-side manifest graph validation, ProcessSpawner badge attenuation, direct
manifest-tool CUE pin enforcement, generic manifest spawning, and child-local
FrameAllocator/VirtualMemory grants are in place for the spawn-manifest path.
Remaining milestone work is legacy kernel service-graph retirement for the
default boot path.
Hardware and Networking
The QEMU virtio-net path has legacy PCI config-space enumeration and a
make run-net boot target. A virtio-net driver, smoltcp integration, ICMP, and
TCP smoke are not implemented.
Code: kernel/src/pci.rs, kernel/src/arch/x86_64/pci_config.rs,
tools/qemu-net-harness.sh.
Validation: make run-net, make qemu-net-harness for the existing PCI smoke
path.
Security and Verification Track
The repo has Miri, proptest, fuzz, Loom, Kani, generated-code, dependency policy, trusted-build-input, panic-surface, and DMA-isolation work. Coverage is not complete for every trust boundary.
References: Trusted Build Inputs, Panic Surface Inventory, DMA Isolation, and Security and Verification Proposal.
Future Work
Future architecture includes generic manifest execution in init, service restart policy, capability-scoped system monitoring, notification objects, promise pipelining, epoch revocation, shared-buffer capabilities, scheduling-context donation, session quotas, SMP, storage and naming, userspace networking, cloud boot support, user identity, policy enforcement, boot-to-shell authentication, text shell launch, and broader language/runtime support.
Design references:
- Service Architecture
- Storage and Naming
- Networking
- SMP
- Userspace Binaries
- Shell
- Boot to Shell
- System Monitoring
- User Identity and Policy
Build, Boot, and Test
The commands below are the current local workflow for the x86_64 QEMU target.
The root Cargo configuration defaults to x86_64-unknown-none, so host tests
must use the repo aliases instead of bare cargo test.
Prerequisites
Expected host tools:
- Rust nightly from
rust-toolchain.toml makeqemu-system-x86_64xorrisocurl,sha256sum, and standard build tools for pinned tool downloads- Go, used by the Makefile to install the pinned CUE compiler when needed
- Optional policy and proof tools for extended checks:
cargo-deny,cargo-audit,cargo-fuzz,cargo-miri, andcargo-kani
The Makefile pins and verifies:
- Limine at the commit recorded in
Makefile - Cap’n Proto compiler version
1.2.0 - CUE version
0.16.0
Pinned tools are installed under the clone-shared .capos-tools directory next
to the git common directory.
Build the ISO
make
This builds:
- the kernel with the default bare-metal target;
initas a standalone release userspace binary;- release-built demo service binaries under
demos/; - the
capos-rt-smokebinary; manifest.binfromsystem.cue;capos.isowith Limine boot files.
Relevant files: Makefile, limine.conf, system.cue, tools/mkmanifest/.
Boot QEMU
make run
This builds the ISO with the qemu feature, boots QEMU with serial on stdio,
and uses the isa-debug-exit device so a clean kernel halt exits QEMU with the
expected status.
The default smoke path should print kernel startup diagnostics, manifest
service creation, demo output, and final halt. Current default smokes include
CapSet bootstrap, capos-rt-smoke, Console paths, ring corruption recovery,
reserved opcode handling, NOP, ring fairness, TLS, VirtualMemory,
FrameAllocator cleanup, Endpoint cleanup, and cross-process IPC.
Spawn Smoke
make run-spawn
This boots with system-spawn.cue. Only init is boot-launched by the manifest;
init uses ProcessSpawner to launch endpoint, IPC, VirtualMemory, and
FrameAllocator cleanup demo children, wait for ProcessHandles, and exercise
hostile spawn inputs.
This is the current validation path for init-driven process creation. It is not yet the default manifest executor.
Networking and Measurement Targets
make run-net
make qemu-net-harness
make run-measure
make run-netattaches a QEMU virtio-net PCI device and exercises current PCI enumeration diagnostics.make qemu-net-harnessruns the scripted net smoke path.make run-measureenables the separatemeasurefeature for benchmark-only counters and cycle measurements. Do not treat it as the normal dispatch build.
Formatting and Generated Code
make fmt
make fmt-check
make generated-code-check
make fmtformats the kernel workspace plus standaloneinit,demos, andcapos-rtcrates.make fmt-checkverifies formatting without modifying files.make generated-code-checkverifies checked-in Cap’n Proto generated code against the repo-pinned compiler path.
Host Tests
cargo test-config
cargo test-ring-loom
cargo test-lib
cargo test-mkmanifest
make capos-rt-check
cargo test-configruns shared config, manifest, ring, and CapSet tests on the host target.cargo test-ring-loomruns the bounded Loom model for SQ/CQ protocol invariants.cargo test-libruns host tests for pure shared logic such as ELF parsing, capability tables, frame allocation, and related property tests.cargo test-mkmanifestruns host tests for manifest generation.make capos-rt-checkbuilds the standalone runtime smoke binary with the userspace relocation flags used by the boot image.
Extended Verification
make dependency-policy-check
make fuzz-build
make fuzz-smoke
make kani-lib
cargo miri-lib
These require optional tools. Use them when changing dependency policy, manifest parsing, ELF parsing, capability-table/frame logic, or proof-covered shared code. See the Security and Verification Proposal for the rationale behind the extended verification tiers.
Validation Rule
For behavior changes, a clean build is not enough. The relevant QEMU process
must exercise the behavior and print observable output that proves the path
works. make run is the default end-to-end gate; make run-spawn,
make run-net, or make run-measure are additional gates for their specific
features.
Repository Map
This map names the main source locations for the current system. It is not an ownership file; use it to find the code behind architecture and validation claims.
Root Files
README.mdgives the compact project overview.ROADMAP.mdrecords long-range stages and broad feature direction.WORKPLAN.mdrecords the current selected milestone and implementation ordering.REVIEW_FINDINGS.mdrecords open review findings and verification history.REVIEW.mddefines review expectations.Makefilebuilds pinned tools, userspace binaries, manifests, ISO images, QEMU targets, formatting checks, generated-code checks, and policy checks.rust-toolchain.tomlpins the Rust toolchain..cargo/config.tomlsets the default bare-metal target and useful cargo aliases.
Schema and Shared ABIs
schema/capos.capnpdefines capability interfaces, manifest structures, exceptions, ProcessSpawner, ProcessHandle, and transfer-related schema.capos-config/src/manifest.rsdefines the host and no_std manifest model.capos-config/src/ring.rsdefinesCapRingHeader, SQE/CQE structures, opcodes, flags, and transport error constants shared by kernel and userspace.capos-config/src/capset.rsdefines the read-only bootstrap CapSet ABI.capos-config/src/cue.rssupports evaluated CUE-style manifest data.capos-config/tests/ring_loom.rsmodels bounded ring protocol behavior with Loom.
Validation: cargo test-config, cargo test-ring-loom,
make generated-code-check.
Shared Pure Logic
capos-lib/src/elf.rsparses ELF64 images for kernel loading and host tests.capos-lib/src/cap_table.rsimplementsCapId, capability-table storage, stale-generation checks, grant preparation, transfer transaction helpers, commit, and rollback.capos-lib/src/frame_bitmap.rsimplements the host-testable physical frame bitmap core.capos-lib/src/frame_ledger.rstracks outstanding FrameAllocator grants.capos-lib/src/lazy_buffer.rsprovides bounded lazy buffers used by ring scratch paths.
Validation: cargo test-lib, cargo miri-lib, make kani-lib, fuzz targets
under fuzz/fuzz_targets/.
Kernel
kernel/src/main.rsis the boot entry point, hardware setup sequence, manifest parsing path, and boot-launched service creation path.kernel/src/spawn.rsloads user ELF images, creates process state, maps bootstrap pages, and enqueues spawned processes.kernel/src/process.rsdefinesProcess, process states, kernel stacks, and initial userspace CPU context.kernel/src/sched.rsimplements the single-CPU scheduler, timer-driven preemption, blockingcap_enter, direct IPC handoff, and deferred cancellation wakeups.kernel/src/serial.rsimplements COM1 output and kernel print macros.kernel/src/pci.rsimplements the current QEMU virtio-net PCI enumeration smoke path.
Validation: cargo build --features qemu, make run, make run-spawn,
make run-net.
Kernel Architecture
kernel/src/arch/x86_64/gdt.rssets up kernel/user segments and TSS state.kernel/src/arch/x86_64/idt.rshandles exceptions and timer interrupts.kernel/src/arch/x86_64/syscall.rsimplements syscall MSR setup and entry.kernel/src/arch/x86_64/context.rsdefines timer context-switch state.kernel/src/arch/x86_64/pic.rsandpit.rsconfigure legacy interrupt hardware.kernel/src/arch/x86_64/smap.rsenables SMEP/SMAP and brackets user memory access.kernel/src/arch/x86_64/tls.rshandles FS-base/TLS support.kernel/src/arch/x86_64/pci_config.rsprovides legacy PCI config I/O.
Kernel Memory
kernel/src/mem/frame.rswraps the shared frame bitmap with Limine memory map initialization and global kernel access.kernel/src/mem/paging.rsmanages page tables, address spaces, permissions, user mappings, W^X enforcement, and address-space teardown.kernel/src/mem/heap.rsinitializes the kernel heap.kernel/src/mem/validate.rsvalidates user buffers before kernel access.
Related docs: DMA Isolation, Trusted Build Inputs.
Kernel Capabilities
kernel/src/cap/mod.rsinitializes kernel capabilities and resolves manifest service capability tables.kernel/src/cap/table.rsre-exports shared capability-table logic and owns the kernel-global table.kernel/src/cap/ring.rsvalidates and dispatches ring SQEs.kernel/src/cap/transfer.rsvalidates transfer descriptors and prepares transfer transactions.kernel/src/cap/endpoint.rsimplements Endpoint CALL, RECV, RETURN, queued state, cleanup, and cancellation behavior.kernel/src/cap/console.rsimplements serial Console.kernel/src/cap/frame_alloc.rsimplements FrameAllocator.kernel/src/cap/virtual_memory.rsimplements per-process anonymous memory operations.kernel/src/cap/process_spawner.rsimplements ProcessSpawner and ProcessHandle.kernel/src/cap/null.rsimplements the measurement-only NullCap.
Related docs: Capability Model, Authority Accounting.
Userspace
init/is the standalone init process. In the spawn smoke, it uses ProcessSpawner, grants initial child capabilities, waits on ProcessHandles, and checks hostile spawn inputs.capos-rt/src/entry.rsowns the runtime entry path and bootstrap validation.capos-rt/src/alloc.rsinitializes the userspace heap.capos-rt/src/syscall.rsprovides raw syscall wrappers.capos-rt/src/capset.rsprovides typed CapSet lookup helpers.capos-rt/src/ring.rsimplements the safe single-owner ring client, out-of-order completion handling, transfer descriptor packing, and result-cap parsing.capos-rt/src/client.rsimplements typed clients for Console, ProcessSpawner, and ProcessHandle.capos-rt/src/bin/smoke.rsis the runtime smoke binary packaged by the default manifest.
Validation: make capos-rt-check, make run, make run-spawn.
Demo Services
demos/ is a nested userspace smoke-test workspace. Each demo is a release-built
service binary packaged into the boot manifest:
capset-bootstrapconsole-pathsring-corruptionring-reserved-opcodesring-nopring-fairnessunprivileged-strangertls-smokevirtual-memoryframe-allocator-cleanupendpoint-roundtripipc-serveripc-client
Shared demo support lives in demos/capos-demo-support/src/lib.rs.
Validation: make run, make run-spawn.
Manifest and Tooling
system.cueis the default manifest source.system-spawn.cueis the ProcessSpawner smoke manifest source.tools/mkmanifest/evaluates manifest input, embeds binaries, validates manifest shape, and writes Cap’n Proto bytes.tools/check-generated-capnp.shverifies checked-in generated schema output.tools/qemu-net-harness.shruns the current QEMU net harness.fuzz/contains fuzz targets for manifest Cap’n Proto decoding, mkmanifest JSON conversion/validation, and ELF parsing.
Validation: cargo test-mkmanifest, make generated-code-check,
make fuzz-build, make fuzz-smoke.
Documentation
docs/capability-model.mdis the current capability architecture reference.docs/*-design.mdfiles record targeted implemented or accepted designs.docs/proposals/contains accepted, future, exploratory, and rejected designs.docs/research.mdanddocs/research/summarize prior art.docs/proposals/mdbook-docs-site-proposal.mddefines the documentation site structure and status vocabulary used by these Start Here pages.
Boot Flow
Boot flow defines the trusted path from firmware-owned machine state to the first user processes. It establishes memory management, interrupt/syscall entry, capability tables, process rings, and the boot manifest authority graph.
Status: Partially implemented. Limine boot, kernel initialization, manifest parsing, ELF loading, process
creation, and QEMU halt-on-success are implemented. The current default boot
still lets the kernel interpret the whole service graph. The selected milestone
in WORKPLAN.md is to move manifest graph execution into init.
Current Behavior
Firmware loads Limine, Limine loads the kernel and exactly one module, and the
kernel treats that module as a Cap’n Proto SystemManifest. The kernel rejects
boots with any module count other than one.
kmain initializes serial output, x86_64 descriptor tables, memory, paging,
SMEP/SMAP, the kernel capability table, the idle process, PIC, and PIT. It then
parses and validates the manifest, loads each service ELF into a fresh
AddressSpace, builds per-service capability tables and read-only CapSet
pages, enqueues the processes, and starts the scheduler.
flowchart TD
Firmware[UEFI or QEMU firmware] --> Limine[Limine bootloader]
Limine --> Kernel[kmain]
Limine --> Module[manifest.bin boot module]
Kernel --> Arch[serial, GDT, IDT, syscall MSRs]
Kernel --> Memory[frame allocator, heap, paging, SMEP/SMAP]
Kernel --> Manifest[parse and validate SystemManifest]
Manifest --> Images[parse and map service ELFs]
Manifest --> Caps[build CapTables and CapSet pages]
Images --> Processes[create Process structs and rings]
Caps --> Processes
Processes --> Scheduler[start round-robin scheduler]
Scheduler --> User[enter first user process]
The invariant is that no user service starts until manifest binary references, authority graph structure, and bootstrap capability source/interface checks have passed.
Design
The boot path is deliberately single-shot. The kernel receives a single packed manifest and validates the graph before creating any process. ELF parsing is cached per binary name, but each service gets its own address space, user stack, TLS mapping if present, ring page, and CapSet mapping.
The default manifest (system.cue) packages init, capos-rt-smoke, and the
demo services directly. The spawn manifest (system-spawn.cue) packages only
init as an initial service and grants it ProcessSpawner plus endpoint caps;
init then spawns selected child services.
Future behavior is narrower: the kernel should start only init with fixed
bootstrap authority and a manifest or boot-package capability. init should
validate and execute the service graph through ProcessSpawner.
Invariants
- Limine must provide exactly one boot module, and that module is the manifest.
- Manifest validation must complete before any declared service is enqueued.
- Service ELF load failures roll back frame allocations before boot continues or fails.
- Kernel page tables are active and HHDM user access is stripped before SMEP/SMAP are enabled.
- The kernel passes
_start(ring_addr, pid, capset_addr)in RDI, RSI, and RDX. - CapSet metadata is read-only user memory; the ring page is writable user memory.
- QEMU-feature boots halt through
isa-debug-exitwhen no runnable processes remain.
Code Map
kernel/src/main.rs-kmain, manifest module handling, validation, service image loading, process enqueue, halt path.kernel/src/spawn.rs- ELF-to-address-space loading, fixed user stack, TLS mapping,Processconstruction helpers.kernel/src/process.rs- process bootstrap context, ring page mapping, CapSet page mapping.kernel/src/cap/mod.rs- manifest capability resolution and CapSet entry construction.capos-config/src/manifest.rs- manifest decode, schema-version guardrails, and graph/source/binary validation.tools/mkmanifest/src/lib.rs- host-side manifest validation and binary embedding.system.cueandsystem-spawn.cue- default and spawn-focused boot graphs.limine.confandMakefile- bootloader config, ISO construction, QEMU targets.
Validation
make runvalidates the default manifest, kernel-side service startup, process creation, scheduler entry, and clean QEMU halt.make run-spawnvalidates init-owned spawning throughProcessSpawnerfor the current transition path.cargo test-configcovers manifest decode, roundtrip, and validation logic.cargo test-mkmanifestcovers host-side manifest conversion and embedding checks.make generated-code-checkverifies checked-in Cap’n Proto generated output.
Open Work
- Move default service graph interpretation from the kernel into
init. - Retire kernel-side service graph wiring after default
make runproves the init-owned path.
Process Model
The process model defines how capOS represents isolated user programs, how they receive authority, how they enter and leave the scheduler, and how a parent can observe a child.
Status: Partially implemented. Processes, isolated address spaces, ELF loading, fixed bootstrap ABI, exit cleanup, process handles, and init-driven child spawning are implemented. Restart policy, kill, generic post-spawn grants, and init-side manifest graph execution remain open.
Current Behavior
A Process owns a user address space, a per-process capability table, a ring
scratch area, a kernel stack, a saved CPU context, a mapped capability ring, and
an optional read-only CapSet page. Process IDs are assigned by an atomic counter.
ELF images are loaded into fresh user address spaces. PT_LOAD segments are
mapped with page permissions derived from ELF flags, the user stack is fixed at
0x40_0000, and PT_TLS data is mapped into a per-process TLS area below the
ring page. The process starts from a synthetic CpuContext that returns to
Ring 3 with iretq.
ProcessSpawner lets a holder spawn packaged boot binaries, grant selected
caps to the child, and receive a non-transferable ProcessHandle result cap.
ProcessHandle.wait either completes immediately for an already-exited child
or registers one waiter.
Design
Process construction separates image loading from capability-table assembly.
The kernel first maps all boot-launched service images, then builds capability
tables for all services so service-sourced caps can resolve against declared
exports. Spawned children use the same image loading and Process creation
helpers, but their grants are supplied by the calling process through
ProcessSpawner.
Each process starts with three machine arguments:
RDI- fixed ring virtual address (RING_VADDR).RSI- process ID.RDX- fixed CapSet virtual address, or zero if no CapSet is mapped.
Exit releases authority before the Process storage is dropped. The scheduler
switches to the kernel page table before address-space teardown, cancels
endpoint state for the exiting pid, completes any pending process waiter, and
defers the final process drop until execution is on another kernel stack.
Future process lifecycle work should keep authority transfer explicit: parents should not gain ambient access to child internals, and child grants should come from named caps plus interface checks.
Invariants
- A process cannot access a resource unless its local
CapTableholds a cap. - Bootstrap CapSet metadata is immutable from userspace.
- A stale
CapIdgeneration must not name a reused cap-table slot. ProcessSpawnerraw grants require a copy-transferable cap or an endpoint owner cap; client-endpoint grants attenuate endpoint authority.ProcessSpawnerkernel-source grants are limited to fresh child-local address-space-bound caps; they cannot be badged or exported from init.ProcessHandlecaps are non-transferable.- At most one waiter may be registered on a
ProcessHandle. - Process exit releases cap-table authority before the kernel stack frame is freed.
Code Map
kernel/src/process.rs-Process, bootstrap CPU context, ring/CapSet mapping, exit capability cleanup.kernel/src/spawn.rs- ELF mapping, stack mapping, TLS mapping, process construction helpers.kernel/src/sched.rs- process table, process handles, wait completion, exit path.kernel/src/cap/process_spawner.rs-ProcessSpawnerCap,ProcessHandleCap, spawn grant validation, child-local kernel grants, child CapSet construction.capos-lib/src/cap_table.rs-CapIdgeneration and cap-table operations.capos-config/src/capset.rs- fixed CapSet page ABI.schema/capos.capnp-ProcessSpawner,ProcessHandle, andCapGrant.init/src/main.rs- current init-side spawn smoke and hostile spawn checks.
Validation
make runvalidates kernel-launched service processes, CapSet bootstrap, exit cleanup, and clean halt.make run-spawnvalidatesProcessSpawner,ProcessHandle.wait, child grants, init-spawned IPC demos, and hostile spawn failures.cargo test-libcoversCapTablegeneration, stale-slot, and transfer primitives.cargo test-configcovers CapSet and manifest metadata used to build process grants.cargo build --features qemuverifies the kernel and QEMU-only paths compile.
Open Work
- Make default boot launch only
initand execute the validated service graph through init. - Add lifecycle operations such as kill and post-spawn grants only after their authority semantics are explicit.
- Implement restart policy outside the kernel-side static boot graph.
Capability Model
How capabilities work in capOS.
Status: Partially implemented. Generation-tagged cap tables, typed schema interface IDs, manifest/CapSet grants, badges, transport-level release, and Endpoint copy/move transfer are implemented. Revocation propagation, persistence, and bulk-data capabilities remain future work.
What is a Capability
A capability in capOS is a reference to a kernel object that carries:
- An interface (what methods can be called), defined by a Cap’n Proto schema
- A permission (the object it references, enforced by the kernel)
- A wire format (Cap’n Proto serialized messages for all invocations)
A process can only access a resource if it holds a capability to it. There is no ambient authority – no global namespace, no “open by path” syscall, no implicit resource access.
Schema as Contract
Capability interfaces are defined in .capnp schema files under schema/.
The schema is the canonical interface definition. Currently defined:
interface Console {
write @0 (data :Data) -> ();
writeLine @1 (text :Text) -> ();
}
interface FrameAllocator {
allocFrame @0 () -> (physAddr :UInt64);
freeFrame @1 (physAddr :UInt64) -> ();
allocContiguous @2 (count :UInt32) -> (physAddr :UInt64);
}
interface VirtualMemory {
map @0 (hint :UInt64, size :UInt64, prot :UInt32) -> (addr :UInt64);
unmap @1 (addr :UInt64, size :UInt64) -> ();
protect @2 (addr :UInt64, size :UInt64, prot :UInt32) -> ();
}
interface Endpoint {}
interface ProcessSpawner {
spawn @0 (name :Text, binaryName :Text, grants :List(CapGrant)) -> (handleIndex :UInt16);
}
interface ProcessHandle {
wait @0 () -> (exitCode :Int64);
}
interface BootPackage {
manifestSize @0 () -> (size :UInt64);
readManifest @1 (offset :UInt64, maxBytes :UInt32) -> (data :Data);
}
# Management-only introspection. Ordinary handle release uses the system
# transport opcode CAP_OP_RELEASE, not a method here.
interface CapabilityManager {
list @0 () -> (capabilities :List(CapabilityInfo));
# grant is planned for Stage 6 (IPC and Capability Transfer)
}
Each interface has a unique 64-bit TYPE_ID generated by the Cap’n Proto
compiler. TYPE_ID is the schema constant. interface_id is the runtime
metadata used by CapSet/bootstrap descriptions and endpoint delivery headers.
Method dispatch uses the interface assigned to the capability entry plus
method_id; method_id selects a method inside that schema.
This is not capability identity. A CapId is the authority-bearing handle in
a process table, analogous to an fd. Multiple capabilities can expose the same
interface:
cap_id=3-> serial-backedConsolecap_id=4-> log-buffer-backedConsolecap_id=5->Consoleproxy served by another process
All three use the same Console TYPE_ID, but they are different objects
with different authority. The manifest/CapSet should record the expected schema
TYPE_ID as interface metadata for typed handle construction. Normal CALL SQEs
do not need to repeat it because the kernel or serving transport can derive it
from the target capability entry. CapSqe keeps reserved tail padding for ABI
stability.
The kernel exposes the initial CapSet to each process as a read-only
4 KiB page mapped at capos_config::capset::CAPSET_VADDR and passes its
address in RDX to _start. The page starts with a
CapSetHeader { magic, version, count } and is followed by
CapSetEntry { cap_id, name_len, interface_id, name: [u8; 32] } records
in manifest declaration order. Userspace looks up caps by the manifest
name rather than by numeric index (capos_config::capset::find), so
grants can be reordered in system.cue without breaking clients. The
mapping is installed without WRITABLE so userspace cannot mutate its
own bootstrap authority map.
Security invariant: a CapTable entry exposes one public interface. If the
same backing state must be available through multiple interfaces, mint multiple
capability entries, each wrapping the same state with a narrower interface.
Do not grant one handle that accepts unrelated interface_id values; that
makes hidden authority easy to miss during review.
Invocation Path
Capabilities are invoked via a shared-memory capability ring (io_uring- inspired). Each process has a submission queue (SQ) and completion queue (CQ) mapped into its address space. Two invocation paths exist:
Caller builds capnp params message
→ serialize to bytes (write_message_to_words)
→ write CALL SQE to SQ ring (pure userspace memory write)
→ advance SQ tail
→ caller invokes cap_enter for ordinary capability methods
(timer polling only runs explicitly interrupt-safe CALL targets)
→ kernel reads SQE, validates user buffers
→ CapTable.call(cap_id, method_id, bytes)
→ kernel writes CQE to CQ ring
... caller reads CQE after cap_enter, or spin-polls only for
interrupt-safe/non-CALL ring work ...
→ caller reads CQE result
CapObject::call does not receive a caller-supplied interface ID. The cap
table derives the invoked interface from the target entry before invoking the
object. The SQE carries only the capability handle and method ID because each
capability entry owns one public interface:
#![allow(unused)]
fn main() {
pub trait CapObject: Send + Sync {
fn interface_id(&self) -> u64;
fn label(&self) -> &str;
fn call(
&self,
method_id: u16,
params: &[u8],
result: &mut [u8],
reply_scratch: &mut dyn ReplyScratch,
) -> capnp::Result<CapInvokeResult>;
}
}
All communication goes through serialized capnp messages, even when caller and callee are in the same address space. This ensures the wire format is always exercised and makes the transition to cross-address-space IPC seamless.
The result buffer is supplied by the caller (the user-validated SQE result
region). Implementations serialize directly into it and return the number of
bytes written, so the kernel’s dispatch path does not allocate an intermediate
Vec<u8> per invocation.
Capability Table
Each process has its own capability table (CapTable), created at process
startup. The kernel also maintains a global table (KERNEL_CAPS) for
kernel-internal use. Each table maps a CapId (u32) to a boxed CapObject.
CapId encoding: [generation:8 | index:24]. The generation counter increments
when a slot is freed, so stale CapIds (from a previous occupant of the slot)
are rejected with CapError::StaleGeneration rather than accidentally
referring to a different capability.
Operations:
insert(obj)– register a new capability, returns its CapIdget(id)– look up a capability by ID (validates generation)remove(id)– revoke a capability, bumps slot generationcall(id, method_id, params)– dispatch a method call against the interface assigned to the capability entry
Each service receives capabilities from cap::create_all_service_caps(),
which runs a two-pass resolution over the whole manifest: pass 1 materializes
each service’s kernel-sourced caps as Arc<dyn CapObject> and records its
declared exports; pass 2 assembles each service’s CapTable in declaration
order, cloning the exported Arc when another service’s CapRef resolves
via CapSource::Service. Declaration order is preserved because numeric
CapIds are assigned by insertion order and smoke tests depend on specific
indices. CapRef.source is a structured capnp union, not an authority string:
struct CapRef {
name @0 :Text;
expectedInterfaceId @1 :UInt64;
union {
unset @2 :Void; # invalid; keeps omitted sources fail-closed
kernel @3 :KernelCapSource;
service @4 :ServiceCapSource;
}
}
enum KernelCapSource {
console @0;
endpoint @1;
frameAllocator @2;
virtualMemory @3;
}
struct ServiceCapSource {
service @0 :Text;
export @1 :Text;
}
The source selector chooses the object or authority to grant. The
expectedInterfaceId value is a schema compatibility check against the
constructed object, not the authority selector itself. This distinction matters
because different objects can implement the same interface.
Transport-Level Capability Lifetime
Cap’n Proto applications do not usually model capability lifetime as an application method on every interface. The RPC transport owns capability reference bookkeeping.
The standard Cap’n Proto RPC protocol is stateful per connection. Each side
keeps four tables: questions, answers, imports, and exports. Import/export IDs
are connection-local, not global object names. When an exported capability is
sent over the connection, the export reference count is incremented. When the
importing side drops its last local reference, the transport sends Release
to decrement the remote export count. Implementations may batch these releases.
If the connection is lost, in-flight questions fail, imports become broken, and
exports/answers are implicitly released. Persistent capabilities, when
implemented, are a separate SturdyRef mechanism and should not be treated as
owned pointers.
References:
This distinction matters for capOS:
close()is application protocol. AFile.close()method can flush dirty state, commit metadata, or tell a server that a session should end.Release/ cap drop is transport protocol. It removes one reference from the caller’s local capability namespace and eventually lets the serving side reclaim the object if no references remain.- Process exit is bulk transport cleanup. Dropping the process must release all caps in its table, cancel pending calls, and wake peers waiting on those calls.
capOS therefore needs a system transport layer in the userspace runtime
(capos-rt / later language runtimes), not just raw SQE helpers. That transport
should own typed client handles, local reference counts, promise-pipelined
answers, and broken-cap state. When the last local handle is dropped, it should
submit a transport-level release operation to the kernel ring.
Ordinary handle release is a transport concern, not an application method.
The target design: the generated client drops the last local handle
(RAII / GC / finalizer), the runtime transport submits the CAP_OP_RELEASE
ring opcode, and the kernel removes the caller’s CapTable slot with mutable
access to that table. Encoding release as a
regular method call on CapabilityManager was rejected because it would
mutate the same table used to dispatch the call; CapabilityManager is
therefore management-only (list(), later grant()), not the default
release path. CAP_OP_FINISH remains reserved in the same transport opcode
namespace for application-level “end of work” signals that the transport must
deliver reliably, so the kernel can tell them apart from a truly malformed
opcode.
Current status: the kernel dispatches CAP_OP_RELEASE as a local cap-table
slot removal and fails closed for stale or non-owned cap IDs. capos-rt
bootstrap handles remain explicitly non-owning, while adopted owned handles
queue CAP_OP_RELEASE on final drop. Result-cap adoption validates the
kernel-supplied interface ID before producing an owned typed handle.
CAP_OP_FINISH remains reserved and returns CAP_ERR_UNSUPPORTED_OPCODE.
Process exit remains the fallback cleanup path for unreleased local slots.
Access Control: Interfaces, Not Rights Bitmasks
capOS deliberately does not use a rights bitmask (READ/WRITE/EXECUTE) on capability entries, despite this being standard in Zircon and seL4. The reason is that Cap’n Proto typed interfaces already serve as the access control mechanism, and a parallel rights system creates an impedance mismatch.
Why rights bitmasks exist in other systems: Zircon and seL4 use rights
because their syscall interfaces are untyped – a handle is an opaque reference
to a kernel object, and the kernel needs something to decide which fixed
syscalls are allowed. capOS has typed interfaces where the .capnp schema
defines exactly what methods exist.
capOS’s approach: the interface IS the permission. To restrict what a caller can do, grant a narrower capability:
Fetch(full HTTP) →HttpEndpoint(scoped to one origin)Store(read-write) →Storewrapper that rejects write methodsNamespace(full) →Namespacescoped to a prefix
The “restricted” capability is a different CapObject implementation that
wraps the original. The kernel doesn’t know or care – it dispatches to
whatever CapObject is in the slot. Attenuation is userspace/schema logic,
not a kernel mechanism.
When transfer control is needed (Stage 6): meta-rights for the capability itself (can it be transferred? duplicated?) may be added as a small bitmask. These are about the reference, not the referenced object, and don’t overlap with interface-level method access control.
See research.md for the cross-system analysis that led to this decision (§1 Capability Table Design).
Planned Enhancements (from research)
Tracked in ROADMAP.md
Stages 5-6:
- Badge (from seL4) – u64 value per capability entry, delivered to the
server on invocation. Implemented for manifest cap refs, IPC transfer, and
ProcessSpawnerendpoint-client minting so servers can distinguish callers without separate capability objects per client. - Epoch (from EROS) – per-object revocation epoch. Incrementing the epoch invalidates all outstanding references. O(1) revoke, O(1) check.
Current Limitations
- Blocking wait exists, but waits are still process-level.
cap_enter(min_complete, timeout_ns)processes pending SQEs and can block the current process until enough CQEs exist or a finite timeout expires. It is not yet a general futex/thread wait primitive; in-process threading and futex-shaped measurements are tracked separately. - No persistence. Capabilities exist only at runtime.
- Capability transfer is implemented for Endpoint CALL/RECV/RETURN. Transfer descriptors on the capability ring let callers and receivers copy or move transferable local caps through IPC messages. See storage-and-naming-proposal.md “IPC and Capability Transfer” for the full design.
- Transfer ABI (3.6.0 draft). Sideband transfer descriptors are defined in
capos-config/src/ring.rsasCapTransferDescriptor:cap_idis the sender-side local capability-table handle.transfer_modeis eitherCAP_TRANSFER_MODE_COPYorCAP_TRANSFER_MODE_MOVE.xfer_cap_countinCapSqeis the descriptor count.- For CALL/RETURN, descriptors are packed at
addr + lenafter the payload bytes and must be aligned toCAP_TRANSFER_DESCRIPTOR_ALIGNMENT. - Result-cap insertion semantics are defined by
CapCqe:resultreports normal payload bytes, whilecap_countreports how manyCapTransferResult { cap_id, interface_id }records were appended immediately after those payload bytes inresult_addrwhenCAP_CQE_TRANSFER_RESULT_CAPSis set. User space must bound-checkresult + cap_count * CAP_TRANSFER_RESULT_SIZEagainst its requestedresult_len. - Transfer-bearing SQEs are fail-closed:
- unsupported-by-kernel-transfer path:
CAP_ERR_TRANSFER_NOT_SUPPORTED(until sideband transfer is enabled), - malformed descriptor metadata (invalid mode, reserved bits, non-zero
_reserved0, misalignment, overflow):CAP_ERR_INVALID_TRANSFER_DESCRIPTOR, - all other reserved-field misuse remains
CAP_ERR_INVALID_REQUEST.
- unsupported-by-kernel-transfer path:
- No revocation propagation. Removing a table entry doesn’t invalidate copies or derived capabilities. Epoch-based revocation is planned.
- No bulk data path. All data goes through capnp message copy. SharedBuffer / MemoryObject capability needed for file I/O, networking, GPU data plane. See storage-and-naming-proposal.md “Shared Memory for Bulk Data” for the interface design.
Future Directions
- Capability transfer. Cross-process capability calls already go through
the kernel via Endpoint objects with RECV/RETURN SQE opcodes on the
existing per-process capability ring (no new syscalls). The remaining
transfer work will carry capability references with sideband descriptors and
install result caps in the receiver’s local table. See
storage-and-naming-proposal.md for how
this enables
Directory.open()returning File caps,Namespace.sub()returning scoped Namespace caps, etc. - Persistence. Serialize capability state to storage using capnp format. Restore capabilities across reboots.
- Network transparency. Forward capability calls to remote machines using the same capnp wire format. A remote Console capability looks identical to a local one.
Capability Ring
The capability ring is the userspace-to-kernel transport for capability invocation. It avoids one syscall per operation while preserving a typed Cap’n Proto method boundary and explicit completion reporting.
Status: Implemented. The shared-memory ring, cap_enter, CALL/RECV/RETURN/RELEASE/NOP dispatch,
structured transport errors, bounded scratch buffers, and Loom ring model are
implemented. FINISH, promise pipelining, multishot, link, drain, and SQPOLL
remain future work.
Current Behavior
Each non-idle process gets one 4 KiB ring page mapped at RING_VADDR. The page
contains a volatile header, a 16-entry submission queue, and a 32-entry
completion queue. Userspace writes CapSqe records, advances sq_tail, and
uses cap_enter(min_complete, timeout_ns) to make ordinary calls progress.
sequenceDiagram
participant U as Userspace runtime
participant R as Ring page
participant K as Kernel ring dispatcher
participant C as Capability object
U->>R: write CapSqe and advance sq_tail
U->>K: cap_enter(min_complete, timeout_ns)
K->>R: read sq_head..sq_tail
K->>K: validate SQE fields and user buffers
K->>C: call method or endpoint operation
C-->>K: completion, pending, or error
K->>R: write CapCqe and advance cq_tail
K-->>U: return available CQE count
U->>R: read matching CapCqe
Timer polling also processes each current process’s ring before preemption, but
only non-CALL operations and CALL targets that explicitly allow interrupt
dispatch may run there. Ordinary CALLs wait for cap_enter.
Why ordinary CALL waits for
cap_enter: Submitting aCALLSQE is only a shared-memory write. The kernel still needs a safe execution point to drain the ring and run capability code. Timer polling runs in interrupt context, so it must not execute arbitrary capability methods that may allocate, block on locks, mutate page tables, spawn processes, parse Cap’n Proto messages, or perform IPC side effects.cap_enteris the normal process-context drain point: it processes pending SQEs, posts CQEs, and then either returns the available completion count or blocks until enough completions arrive. The design keeps SQE publication syscall-free and batchable, keeps the syscall ABI limited toexitandcap_enter, and avoids turning the timer interrupt into a general capability executor. A future SQPOLL-style path can remove the explicit syscall from the hot path only by running dispatch in a worker context, not from arbitrary timer interrupt execution.
Design
CapSqe is a fixed 64-byte ABI record. CAP_OP_CALL names a local cap-table
slot and method ID plus parameter/result buffers. CAP_OP_RECV and
CAP_OP_RETURN implement endpoint IPC. CAP_OP_RELEASE removes a local
cap-table slot through the transport. CAP_OP_NOP measures the fixed ring
path. CAP_OP_FINISH is ABI-reserved and currently returns
CAP_ERR_UNSUPPORTED_OPCODE.
The kernel copies user params into preallocated per-process scratch, dispatches
capability methods, serializes results directly into caller-provided result
buffers, and posts CapCqe. A successful method returns non-negative bytes
written. Transport failures are negative CAP_ERR_* codes. Application
exceptions are serialized CapException payloads with
CAP_ERR_APPLICATION_EXCEPTION.
Transfer-bearing CALL and RETURN SQEs pack CapTransferDescriptor records
after the params/result payload. Successful result-cap transfers append
CapTransferResult records after normal result bytes.
Future behavior should use the reserved SQE fields for system transport features, not ad hoc per-interface extensions.
Invariants
- SQ and CQ sizes are powers of two and fixed by the ABI.
- Unknown opcodes fail closed;
FINISHis reserved, not silently accepted. - Reserved fields must be zero for currently implemented opcodes.
cap_enterrejectsmin_complete > CQ_ENTRIES.- User buffers must be in user address space with page permissions matching read/write intent before copy.
- Timer dispatch must not run capabilities that allocate, block on locks, or mutate page tables unless the cap explicitly opts in.
- Per-dispatch SQE processing is bounded by
SQ_ENTRIES. - Transfer descriptors must be aligned, valid, and bounded by
MAX_TRANSFER_DESCRIPTORS.
Code Map
capos-config/src/ring.rs- shared ring ABI, opcodes, errors, SQE/CQE structs, endpoint message headers, transfer records.kernel/src/cap/ring.rs- kernel dispatcher, SQE validation, CQE posting, cap calls, endpoint CALL/RECV/RETURN, release, transfer framing.kernel/src/arch/x86_64/syscall.rs-cap_entersyscall.kernel/src/sched.rs- timer polling, cap-enter blocking, direct IPC wake.kernel/src/process.rs- ring page allocation and mapping.capos-rt/src/ring.rs- runtime ring client, pending calls, transfer packing, result-cap parsing.capos-rt/src/entry.rs- single-owner runtime ring client token and release queue flushing.capos-config/tests/ring_loom.rs- bounded producer/consumer model.
Validation
cargo test-ring-loomvalidates SQ/CQ producer-consumer behavior, capacity, FIFO, CQ overflow/drop behavior, and corrupted SQ recovery.make runexercises Console CALLs, reserved opcode rejection, ring corruption recovery, NOP, fairness, transfers, and endpoint IPC.make run-measureexercises measurement-only NOP and NullCap baselines.cargo test-configcovers shared ring layout and helper invariants.make capos-rt-checkchecks userspace runtime ring code under the bare-metal target.
Open Work
- Implement
CAP_OP_FINISHas part of the system Cap’n Proto transport. - Add promise pipelining using
pipeline_depandpipeline_field. - Define LINK, DRAIN, and MULTISHOT semantics before accepting those flags.
- Add SQPOLL after SMP gives the kernel a spare execution context.
IPC and Endpoints
Endpoints let one process serve capability calls to another process without adding a separate IPC syscall surface. The same ring transport carries ordinary kernel capability calls and cross-process endpoint calls.
Status: Partially implemented. Ring-native endpoint CALL/RECV/RETURN, client endpoint attenuation, badges, copy and move capability transfer, direct IPC handoff, and cleanup for many exit paths are implemented. Notification objects, promise pipelining, shared buffers, revocation, and transfer-path cleanup refactoring remain open.
Current Behavior
An Endpoint is a kernel capability object with queues for pending client
calls, pending server receives, and in-flight calls awaiting RETURN. A service
that owns the raw endpoint can receive and return. Importers receive a
ClientEndpoint facet that can CALL but cannot RECV or RETURN.
sequenceDiagram
participant Client
participant ClientRing as Client ring
participant Endpoint
participant ServerRing as Server ring
participant Server
Server->>ServerRing: submit RECV on raw endpoint
Client->>ClientRing: submit CALL on client facet
ClientRing->>Endpoint: deliver params and caller result target
Endpoint->>ServerRing: complete RECV with EndpointMessageHeader and params
ServerRing-->>Server: cap_enter returns completion
Server->>ServerRing: submit RETURN with call_id and result
ServerRing->>Endpoint: take in-flight target
Endpoint->>ClientRing: post caller CQE with result and badge
ClientRing-->>Client: wait returns matching completion
If a CALL arrives before a RECV, the endpoint queues bounded params. If a RECV arrives before a CALL, the endpoint queues the receive request. Delivered calls move into the in-flight queue until the server returns or cleanup cancels them.
Design
Endpoint IPC is capability-oriented. The manifest can export a raw endpoint from one service; importers get a narrowed client facet. This keeps server-only authority out of clients without introducing rights bitmasks.
CALL and RETURN may carry sideband transfer descriptors. Copy transfers insert a new cap into the receiver while preserving the sender. Move transfers reserve the sender slot, insert the destination, then remove the source on commit. RETURN-side transfers append result-cap records after the normal result payload.
Badges are stored on cap-table hold edges and delivered to servers with endpoint invocation metadata, so one endpoint can distinguish callers without one object per caller.
Future IPC should add notification objects for lightweight signaling and promise pipelining for Cap’n Proto-style dependent calls.
Invariants
- Only raw endpoint holders may RECV or RETURN.
- Imported endpoint caps are
ClientEndpointfacets and must reject RECV and RETURN from userspace. - Endpoint queues are bounded by call count, receive count, in-flight count, per-call params, and total queued params.
- Each in-flight call has a kernel-assigned non-zero
call_id. - CALL delivery copies params into kernel-owned queued storage before the caller can resume.
- Move transfer commit must not leave both source and destination live.
- Transfer rollback must preserve source authority if destination insertion or result delivery fails.
- Process exit must cancel queued state involving that pid and wake affected peers when possible.
Code Map
kernel/src/cap/endpoint.rs- endpoint queues, client facet, call IDs, cancellation by pid.kernel/src/cap/ring.rs- endpoint CALL/RECV/RETURN dispatch, result copying, deferred cancellation CQEs.kernel/src/cap/transfer.rs- transfer descriptor loading and transaction preparation.capos-lib/src/cap_table.rs- cap-table transfer primitives and rollback.kernel/src/cap/mod.rs- manifest export resolution and client-facet construction.capos-config/src/ring.rs-EndpointMessageHeader, transfer descriptors, transfer result records, endpoint opcodes.demos/capos-demo-support/src/lib.rs- endpoint, IPC, transfer, and hostile IPC smoke routines.demos/endpoint-roundtrip,demos/ipc-server,demos/ipc-client- QEMU smoke binaries.
Validation
make runvalidates same-process endpoint RECV/RETURN, cross-process IPC, endpoint exit cleanup, badged calls, transfer success/failure paths, and clean halt.make run-spawnvalidates init-spawned endpoint-roundtrip, server, and client processes.cargo test-libcovers cap-table transfer preflight, provisional insertion, commit, rollback, stale generation, and slot exhaustion cases.cargo test-ring-loomcovers ring queue behavior that endpoint IPC depends on for completion delivery.
Open Work
- Add notification objects for signal-style events.
- Add Cap’n Proto promise pipelining after endpoint routing can resolve dependent answers.
- Add shared-buffer or memory-object capabilities for bulk data transfer.
- Add epoch-based revocation if broad authority invalidation becomes necessary.
Userspace Runtime
The userspace runtime owns the repeated mechanics that every service needs: bootstrap validation, heap initialization, typed capability lookup, ring submission, completion matching, application exception decoding, and handle lifetime.
Status: Partially implemented. capos-rt provides a no_std entry ABI, fixed heap, typed CapSet lookup, a
single-owner ring client, typed Console and ProcessSpawner clients, result-cap
adoption, and release-on-drop for owned handles. Full generated bindings,
promise pipelining, language runtimes, and broad transport semantics remain
future work.
Current Behavior
Runtime-owned _start receives (ring_addr, pid, capset_addr), initializes a
fixed heap, validates the ring address, reads the read-only CapSet page, installs
an emergency Console panic path when available, calls capos_rt_main(runtime),
and exits with the returned code.
The Runtime lends out at most one RuntimeRingClient at a time. The client
wraps the raw ring page, keeps request buffers alive until completions are
matched, handles out-of-order completions, packs copy-transfer descriptors, and
parses result-cap records. Owned runtime handles queue CAP_OP_RELEASE when the
last local reference is dropped; the release queue flushes when a ring client is
borrowed or dropped.
Design
The runtime separates non-owning bootstrap references from owned local handles.
CapSet entries produce typed Capability<T> values only when the interface ID
matches the requested type. Result-cap adoption performs the same interface
check before producing OwnedCapability<T>.
Typed clients are thin wrappers over the ring client. They encode Cap’n Proto
params, submit CALL SQEs, wait for a matching CQE, decode transport errors, and
decode kernel-produced CapException payloads into client errors.
Future generated clients should preserve this split: transport lifetime and completion matching belong in the runtime, while interface-specific encoding belongs in generated or handwritten client wrappers.
Invariants
ring_addrmust equalRING_VADDR; runtime bootstrap rejects any other address.- The CapSet header magic/version must validate before lookup.
- CapSet handles are non-owning unless explicitly adopted.
- Only one runtime ring client may be live at a time for a process.
- Request params and result buffers must outlive their matching CQE.
- A result cap can be consumed only once and only with the expected interface ID.
- Dropping the final owned handle queues exactly one local
CAP_OP_RELEASE. - Release flushing treats stale or already-removed caps as non-fatal cleanup.
Code Map
capos-rt/src/entry.rs-_start,Runtime, bootstrap validation, single-owner ring token, release queue flushing.capos-rt/src/alloc.rs- fixed userspace heap initialization.capos-rt/src/capset.rs- typed CapSet lookup wrappers.capos-rt/src/ring.rs- ring client, pending calls, completion matching, copy-transfer packing, result-cap parsing.capos-rt/src/client.rs- Console, ProcessSpawner, ProcessHandle clients and exception decoding.capos-rt/src/lib.rs- typed capability marker types and owned handle reference counting.capos-rt/src/panic.rs- emergency Console output path.init/src/main.rsandcapos-rt/src/bin/smoke.rs- current runtime users.
Validation
make capos-rt-checkbuilds the runtime smoke binary with userspace relocation constraints.make runvalidates runtime entry, typed Console calls, exception decoding, owned handle release, result-cap parsing through IPC, and clean process exit.make run-spawnvalidatesProcessSpawnerClient,ProcessHandleClient, result-cap adoption, and release behavior under init spawning.cd capos-rt && cargo test --lib --target x86_64-unknown-linux-gnucovers host-testable runtime invariants when run explicitly.
Open Work
- Replace duplicated demo support ring helpers with
capos-rtwhere practical. - Add generated client bindings after the schema surface stabilizes.
- Implement promise/answer transport semantics beyond current placeholders.
- Define release behavior for queued handles when a process exits before the release queue flushes.
Manifest and Service Startup
The manifest is the boot-time authority graph. It names binaries, services, initial capabilities, exported service caps, restart policy metadata, badges, and system config.
Status: Partially implemented. Manifest parsing, Cap’n Proto encoding, CUE conversion, binary embedding,
kernel-side service startup, service exports, endpoint client facets, badges,
BootPackage manifest exposure to init, init-side graph validation, and generic
init-side spawning through ProcessSpawner are implemented for
system-spawn.cue. Retiring the default kernel-side service graph is the
remaining selected milestone work.
Current Behavior
tools/mkmanifest requires the repo-pinned CUE compiler, evaluates
system.cue, embeds declared binaries, validates binary references and
authority graph structure, serializes SystemManifest, and places
manifest.bin into the ISO. The kernel receives that file as the single
Limine module.
flowchart TD
Cue[system.cue or system-spawn.cue] --> Mkmanifest[tools/mkmanifest]
Binaries[release userspace binaries] --> Mkmanifest
Mkmanifest --> Manifest[manifest.bin SystemManifest]
Manifest --> Limine[Limine boot module]
Limine --> Kernel[kernel parse and validate]
Kernel --> Tables[CapTables and CapSet pages]
Tables --> BootServices[default: kernel-enqueued services]
Tables --> Init[spawn manifest: init gets ProcessSpawner and BootPackage]
Init --> BootPackage[BootPackage.readManifest chunks]
BootPackage --> Plan[capos-config ManifestBootstrapPlan validation]
Init --> Spawner[ProcessSpawner.spawn]
Spawner --> Children[init-spawned child processes]
The default manifest currently starts all services from the kernel. The spawn
manifest starts only init, grants it ProcessSpawner plus a read-only
BootPackage cap, and lets init read bounded manifest chunks into a
metadata-only capos-config::ManifestBootstrapPlan. Init validates binary
references, authority graph structure, exports, cap sources, and interface IDs
before spawning the endpoint, IPC, VirtualMemory, and FrameAllocator cleanup
demos. Spawn grants carry explicit requested badges: raw parent-capability
grants must preserve the source hold badge, endpoint-client grants may mint the
requested badge only from an endpoint-owner source, and kernel-source
FrameAllocator/VirtualMemory grants mint fresh child-local caps without badges.
Design
Manifest validation has three layers:
- Binary references: binary names are unique, service binary references resolve, and referenced binary payloads are non-empty.
- Authority graph: service names, cap names, export names, and service-sourced references are unique and resolvable; re-exporting service-sourced caps is rejected.
- Bootstrap cap sources: expected interface IDs match kernel sources or declared service exports.
Kernel startup resolves caps in two passes. Pass 1 creates kernel-sourced caps
and records declared exports. Pass 2 resolves service-sourced imports against
the export registry, attenuating endpoint exports to client-only facets for
importers. Declaration order is preserved because CapIds are assigned by
insertion order and CapSet entries mirror that order.
Future behavior should make the kernel parse only enough boot information to
launch init with manifest/boot-package authority. init should use
BootPackage.readManifest to validate the service graph, call
ProcessSpawner, grant caps, and wait for services where policy requires it.
Invariants
- The manifest is schema data, not shell script or ambient namespace.
- Omitted cap sources fail closed.
- Cap names within one service are unique and are the names userspace sees in CapSet.
- Service exports must name caps declared by the same service.
- Service-sourced imports must reference a declared service export.
- Endpoint exports to importers must be attenuated to client-only facets.
expectedInterfaceIdchecks compatibility; it is not the authority selector.- Badges travel with cap-table hold edges and endpoint invocation metadata. Spawn-time client endpoint minting carries the requested child badge instead of copying the parent’s owner-hold badge.
Code Map
schema/capos.capnp-SystemManifest,ServiceEntry,CapRef,KernelCapSource,ServiceCapSource,RestartPolicy.capos-config/src/manifest.rs- manifest structs, CUE conversion, capnp encode/decode, metadata-onlyManifestBootstrapPlan, schema-version guardrails, validation.tools/mkmanifest/src/lib.rsandtools/mkmanifest/src/main.rs- host-side manifest build pipeline and binary embedding.kernel/src/main.rs- kernel manifest module parse and validation.kernel/src/cap/mod.rs- service cap creation, exports, endpoint attenuation, CapSet entry construction.kernel/src/cap/boot_package.rs- read-only manifest-size and chunked manifest-read capability.kernel/src/cap/process_spawner.rs- init-callable spawn path for packaged boot binaries.capos-rt/src/client.rs- typed BootPackage and ProcessSpawner clients.init/src/main.rs- current spawn manifest executor smoke.system.cueandsystem-spawn.cue- current manifests.
Validation
cargo test-configvalidates manifest decode, CUE conversion, graph checks, source checks, and binary reference checks.cargo test-mkmanifestvalidates host-side manifest conversion, embedded binary handling, and pinned CUE path/version checks.make runvalidates default kernel-side manifest execution.make run-spawnvalidatessystem-spawn.cue, init-side BootPackage manifest reads, init-side manifest graph validation, init-side spawning, hostile spawn failures, child grants, process waits, and cap-table exhaustion checks.make generated-code-checkvalidates schema-generated Rust stays in sync.
Open Work
- Retire the default kernel-side service graph and make the normal boot path launch only init before service startup.
Memory Management
Memory management gives the kernel controlled ownership of physical frames, separates user processes, enforces page permissions, and exposes memory authority only through explicit capabilities.
Status: Partially implemented. Physical frame allocation, heap initialization, kernel page-table remapping, per-process address spaces, ELF/user stack/TLS mappings, user-buffer validation, FrameAllocator, and VirtualMemory caps are implemented. Broader quota unification, SMP-safe validation, shared buffers, huge pages, and hardware I/O memory isolation remain open.
Current Behavior
The frame allocator builds a bitmap from the Limine memory map, marks all non-usable frames as used, reserves frame zero, and reserves its own bitmap frames. The heap is initialized separately for kernel allocation.
Paging initialization builds a new kernel PML4, remaps kernel sections with section-specific permissions, copies upper-half mappings with NX applied and user access stripped, switches CR3, then enables page-global support. SMEP/SMAP are enabled after those mappings are active.
Each user AddressSpace owns its lower-half page tables and clones the
kernel’s upper-half mappings. Dropping an address space walks the user half and
frees mapped frames and page-table frames. VirtualMemory lets a process map,
unmap, and protect anonymous user pages, with a 256-page per-address-space
tracking limit.
Design
The kernel keeps physical allocation host-testable by placing bitmap logic in
capos-lib and wrapping it with kernel HHDM access in kernel/src/mem/frame.rs.
Page-table manipulation stays in the kernel because it is architecture-specific.
ELF loading and VirtualMemory both use page-table flags to preserve W^X:
non-executable data gets NX, writable mappings are explicit, and userspace
pages must be USER_ACCESSIBLE. The CapSet and ring bootstrap pages occupy
reserved virtual pages; VirtualMemory rejects ranges that overlap either one.
User-buffer validation checks that user pointers stay below the user address
limit and that page-table permissions match the requested access. SMAP
UserAccessGuard brackets kernel copy operations into or out of user pages.
Future memory work should unify quotas across frame grants, VM mappings, shared buffers, and DMA resources rather than adding one-off counters per cap.
Invariants
- Frame addresses are 4 KiB aligned.
- The frame bitmap’s own frames are never returned as free frames.
- Upper-half kernel mappings are not user-accessible.
- Kernel text is RX, rodata is read-only NX, and data/bss are RW NX.
- User address spaces own only lower-half page-table frames.
- CapSet is read-only/no-execute; ring is writable/no-execute.
VirtualMemorycannot map, unmap, or protect the ring or CapSet pages.VirtualMemoryprotect/unmap only succeeds for pages tracked as owned by the cap’s address space.
Code Map
capos-lib/src/frame_bitmap.rs- host-testable physical frame bitmap core.capos-lib/src/frame_ledger.rs- outstanding grant ledger for FrameAllocator cleanup.kernel/src/mem/frame.rs- Limine memory-map integration and global frame allocator wrapper.kernel/src/mem/heap.rs- kernel heap setup.kernel/src/mem/paging.rs- kernel remap,AddressSpace, page mapping, VM-cap page tracking, user copy helpers.kernel/src/mem/validate.rs- user-buffer validation.kernel/src/cap/frame_alloc.rs- FrameAllocator capability and cleanup.kernel/src/cap/virtual_memory.rs- VirtualMemory capability.kernel/src/spawn.rs- ELF, stack, and TLS user mappings.kernel/src/arch/x86_64/smap.rs- SMEP/SMAP setup and user access guard.
Validation
cargo test-libcovers frame bitmap, frame ledger, ELF parser, and cap-table pure logic.cargo miri-libruns host-testablecapos-libtests under Miri when installed.make kani-libproves bounded frame bitmap, cap ID, and ELF parser invariants when Kani is installed.make runvalidates ELF mapping, process teardown, FrameAllocator cleanup, TLS, VirtualMemory map/protect/unmap/quota/release smoke, and clean halt.make run-spawnvalidates ELF load failure rollback and frame exhaustion handling throughProcessSpawner.
Open Work
- Resolve quota fragmentation across FrameAllocator, VirtualMemory, and future shared memory.
- Harden user-buffer validation for SMP-era page-table races.
- Add shared-buffer or memory-object capabilities for zero-copy data paths.
- Add DMA isolation and device memory capability boundaries before userspace drivers.
- Add huge-page handling only with explicit ownership and teardown rules.
Scheduling
Scheduling decides which process runs, preserves CPU state across preemption and blocking, and integrates capability-ring progress with process execution.
Status: Partially implemented. Single-CPU preemptive round-robin scheduling, PIT timer interrupts, full context
switches, cap_enter blocking waits, user-mode idle, process exit, and direct
IPC handoff are implemented. SMP, per-CPU data, kernel-mode idle, priority, and
restart policy are future work.
Current Behavior
The scheduler stores processes in a BTreeMap<Pid, Process> and ready pids in
a VecDeque. PIT fires at roughly 100 Hz through IRQ0. On each timer tick, the
kernel wakes timed-out or satisfied cap_enter waiters, processes the current
process’s ring in timer mode, saves the current context, rotates ready
processes, switches CR3, updates TSS.RSP0 and the syscall kernel stack, restores
FS base, and returns to the next user context.
cap_enter(min_complete, timeout_ns) processes pending SQEs immediately. If
the requested completion count is not available and the timeout permits
blocking, the current process enters Blocked(CapEnter { ... }) and the syscall
entry path switches to another process.
When endpoint delivery satisfies a blocked server RECV, the scheduler can set a direct IPC target. The next scheduling decision runs that server before ordinary round-robin work when it is ready.
Design
The implementation keeps ring dispatch outside the global scheduler lock. Timer dispatch extracts ring/cap/scratch handles, releases the scheduler lock, processes bounded SQEs, then reacquires the scheduler lock to choose the next process. This prevents Cap’n Proto decode, serial output, and capability method bodies from running under the global scheduler lock.
The idle task is currently a user-mode process with one code page and one stack page. It exists because the timer return path assumes interrupts entered from CPL3. A future kernel-mode idle loop requires distinct IRQ entry/restore handling for CPL0 and CPL3 frames.
Exit switches to the kernel PML4 before tearing down the exiting address space, releases capability authority, completes process waiters, and defers final drop until the scheduler is running on another kernel stack.
Invariants
- The idle process must never block in
cap_enteror exit. - Ring dispatch must not hold the scheduler lock.
- Timer dispatch runs with the current process CR3, so user buffers are accessible only for that process.
- Blocked
cap_enterwaiters wake when enough CQEs are available or their finite timeout expires. - Direct IPC handoff is a scheduling preference, not a bypass of process state checks.
- The scheduler must update TSS.RSP0 and syscall kernel RSP on each switch.
- FS base is saved and restored across context switches for TLS.
- The final drop of an exiting process must not occur on its own kernel stack.
Code Map
kernel/src/sched.rs- process table, run queue, blocking, wakeups, timer scheduling, exit, direct IPC target.kernel/src/arch/x86_64/context.rs- CPU context layout, timer entry/restore, tick counter.kernel/src/arch/x86_64/idt.rs- timer interrupt handler wiring.kernel/src/arch/x86_64/pic.rsandkernel/src/arch/x86_64/pit.rs- PIC remap and PIT setup.kernel/src/arch/x86_64/gdt.rs- TSS and kernel stack updates.kernel/src/arch/x86_64/syscall.rs- blocking syscall transition forcap_enter.kernel/src/arch/x86_64/tls.rs- FS base save/restore.kernel/src/process.rs- process state, kernel stacks, idle process.
Validation
make runvalidates timer preemption, ring fairness, direct IPC handoff, blockedcap_enterwakeups, process exit, and clean halt.make run-spawnvalidates process wait blocking and child exit completion throughProcessHandle.wait.cargo build --features qemuverifies QEMU-only scheduler and halt paths.- QEMU smoke output for IPC includes direct handoff diagnostics when the server is woken from a blocked RECV.
Open Work
- Replace the user-mode idle process with a kernel/per-CPU idle context after interrupt restore paths support CPL0 timer entries.
- Implement SMP with per-CPU scheduler state, per-CPU syscall stacks, and TLB shootdown.
- Add priority or policy scheduling only after the current authority and IPC semantics remain stable.
- Add service restart policy outside the static boot graph.
Trust Boundaries
This page gives reviewers one place to find the hostile-input boundaries, trusted inputs, and current isolation assumptions that matter for capOS security review.
Current Boundaries
| Boundary | Trust rule | Current enforcement | Validation and review source |
|---|---|---|---|
| Ring 0 to Ring 3 | The kernel trusts no userspace register, pointer, SQE, CapSet, or result buffer field. | kernel/src/arch/x86_64/syscall.rs, kernel/src/mem/validate.rs, and kernel/src/cap/ring.rs validate syscall arguments, user buffers, opcodes, and capability table lookups before privileged use. | ../panic-surface-inventory.md, REVIEW.md |
| Capability table to kernel object | A process acts only through a live table-local CapId with matching generation and interface. | capos-lib/src/cap_table.rs owns generation-tagged slots; kernel capability dispatch goes through CapObject::call. | cargo test-lib, QEMU ring and IPC smokes recorded in REVIEW_FINDINGS.md |
| Capability ring shared memory | Userspace owns SQ writes, but the kernel owns validation, dispatch, completion, and failure semantics. | SQ/CQ headers and entries live in capos-config/src/ring.rs; kernel dispatch bounds indexes, buffer ranges, opcodes, transfer descriptors, and CQ posting. | cargo test-ring-loom, QEMU ring corruption, reserved opcode, fairness, IPC, and transfer smokes |
| Endpoint IPC and transfer | IPC cannot create or destroy authority except through explicit copy, move, release, or spawn transactions. | kernel/src/cap/endpoint.rs, kernel/src/cap/transfer.rs, and capos-lib/src/cap_table.rs implement queued calls, RECV/RETURN, copy/move transfer, badge propagation, and rollback. | ../authority-accounting-transfer-design.md, open transfer findings in REVIEW_FINDINGS.md |
| Manifest and boot package | Boot manifest bytes and embedded binaries are untrusted inputs until parsed and validated. Only holders of the read-only BootPackage cap can request chunked manifest bytes; ordinary services receive no default boot-package authority. | tools/mkmanifest, capos-config/src/manifest.rs, kernel/src/cap/boot_package.rs, ELF parsing in capos-lib/src/elf.rs, and kernel load paths validate graph references, paths, CapSet layout, interface IDs, manifest-read bounds, ELF bounds, and load ranges. | cargo test-config, cargo test-mkmanifest, cargo test-lib, manifest and ELF fuzz targets, make run-spawn |
| Process spawn inputs | Parent-supplied spawn params, ELF bytes, grants, badges, and result-cap insertion must fail closed. | ProcessSpawner currently validates ELF load, grants, explicit badge attenuation, frame exhaustion, and parent cap-slot exhaustion. Manifest schema-version guardrails reject unknown manifest vintages before graph validation. | Spawn QEMU smoke evidence and open findings in REVIEW_FINDINGS.md |
| Host tools and filesystem | Manifest/config input must not escape intended source directories or invoke unconstrained host commands. | tools/mkmanifest validates references and path containment, rejects unpinned CUE compilers, and Makefile targets route CUE and Cap’n Proto through pinned tool paths. | ../trusted-build-inputs.md, make generated-code-check, make dependency-policy-check |
| Generated code and schema | Schema, generated bindings, and no_std patches are trusted build inputs. | schema/capos.capnp, build scripts, tools/generated/capos_capnp.rs, and tools/check-generated-capnp.sh make generated-code drift review-visible. | ../trusted-build-inputs.md, make generated-code-check |
| Device DMA and MMIO | Current userspace receives no raw DMA buffer, device physical address, virtqueue pointer, or BAR mapping. | The QEMU virtio-net path is allowed only through kernel-owned bounce buffers until typed DMAPool, DeviceMmio, and Interrupt capabilities exist. | ../dma-isolation-design.md, make run-net |
| Panic and emergency paths | Hostile input should produce controlled errors, not panic, allocate unexpectedly, or expose stale state. | Ring dispatch is mostly controlled-error; remaining panic surfaces are classified by reachability and tracked as hardening work. | ../panic-surface-inventory.md, REVIEW.md |
Security Invariants
- All authority is represented by capability-table hold edges; no syscall or host tool path should bypass the capability graph.
- The interface is the permission: method authority is expressed by the typed Cap’n Proto interface or by a narrower wrapper capability, not by ambient process identity.
- Kernel operations at hostile boundaries validate structure, bounds, ownership, generation, interface ID, and resource availability before mutating privileged state.
- Failed transfer, spawn, manifest, and DMA setup paths must leave ledgers, cap tables, frame ownership, and in-flight call state unchanged or explicitly rolled back.
- Trusted build inputs must be pinned or drift-review-visible before their output becomes part of the boot image or generated source baseline.
Open Work
- Unify fragmented resource ledgers into the authority-accounting model so reviewers can audit quotas without following parallel counters.
- Harden open panic-surface entries that become more exposed as spawn, lifecycle, SMP, or userspace drivers expand hostile input reachability.
- Keep DMA in kernel-owned bounce-buffer mode until the
DMAPool,DeviceMmio, andInterrupttransition gates have code and QEMU proof.
Verification Workflow
This page maps capOS claims to the commands, QEMU smokes, fuzz targets, proof tools, and review documents that currently support them.
Local Command Set
Use the repo aliases and Makefile targets instead of bare host commands. The
workspace default Cargo target is x86_64-unknown-none, so host tests rely on
aliases that set the host target explicitly.
| Scope | Command | What it checks |
|---|---|---|
| Formatting | make fmt-check | Rust formatting across kernel, shared crates, standalone userspace crates, and demos. |
| Config and manifest logic | cargo test-config | Cap’n Proto manifest encode/decode, CUE value handling, CapSet layout, and config validation. |
| Ring concurrency model | cargo test-ring-loom | Bounded SQ/CQ producer-consumer invariants and corrupted-SQ recovery behavior. |
| Shared library logic | cargo test-lib | ELF parser, frame bitmap, frame ledger, capability table, and property-test coverage. |
| Manifest tool | cargo test-mkmanifest | Host-side manifest conversion and validation behavior. |
| Userspace runtime | make capos-rt-check | capos-rt build path, entry ABI, typed clients, ring helpers, and no_std constraints. |
| Kernel build | cargo build --features qemu | Kernel build with the QEMU exit feature enabled. |
| Generated bindings | make generated-code-check | Cap’n Proto compiler path/version, generated output equality, no_std patch anchors, and checked-in baseline drift. |
| Dependency policy | make dependency-policy-check | cargo-deny and cargo-audit policy across root and standalone lockfiles. |
| Full image build | make | Kernel, userspace demos, runtime smoke binaries, manifest, Limine artifacts, and ISO packaging. |
| Default QEMU smoke | make run | End-to-end boot, userspace process output, capability ring, IPC, transfer, VirtualMemory, TLS, cleanup, and final halt paths included in the default manifest. |
| Spawn QEMU smoke | make run-spawn | Init-owned spawn flow, ProcessSpawner hostile cases, child grants, waits, and cleanup. |
| Networking smoke | make run-net | QEMU virtio-net attachment and kernel PCI/device-discovery path. |
| Kani proofs | make kani-lib | Bounded proofs for selected capos-lib invariants when cargo-kani is installed. |
Do not claim full verification unless the relevant command actually ran in the
current change. For doc-only changes, use an appropriately narrower check such
as mdbook build.
Review Workflow
- Identify the changed trust boundary or state that the change is docs-only.
- Read
REVIEW.mdfor the applicable security, unsafe, memory, performance, capability, and emergency-path checklist. - Read
REVIEW_FINDINGS.mdbefore judging correctness so known open findings are not treated as solved behavior. - For system-design work, list the concrete design and research files read; reviewers should reject vague grounding such as “docs” or “research”.
- Run the smallest command set that exercises the changed behavior, then add QEMU proof for user-visible kernel or runtime behavior.
- Record unresolved non-critical findings in
REVIEW_FINDINGS.mdwith concrete remediation context before treating the task as reviewed.
Evidence by Claim
| Claim type | Required evidence |
|---|---|
| Parser or manifest validation | Host tests for valid and malformed input; fuzz target when arbitrary bytes can reach the parser. |
| Kernel/user pointer safety | QEMU hostile-pointer smoke plus code review of address, length, permissions, and validation-to-use windows. |
| Ring or IPC transport behavior | Host model/property tests where possible, plus QEMU process output proving success and failure paths. |
| Capability transfer or release | Rollback tests for copy/move/release failure, cap-slot exhaustion, stale caps, and process-exit cleanup. |
| Resource accounting | Tests that prove quota rejection, matched release on success and failure, and process-exit cleanup. |
| Generated code or schema changes | make generated-code-check and a checked-in baseline diff generated by the pinned compiler. |
| Dependency or toolchain changes | Dependency-class review plus make dependency-policy-check; update ../trusted-build-inputs.md when trust assumptions change. |
| Device or DMA work | make run-net or a targeted QEMU smoke; no userspace-driver transition without the gates in ../dma-isolation-design.md. |
| Panic-surface hardening | Updated ../panic-surface-inventory.md when reachability or classification changes. |
Fuzzing and Proof Tracks
The current fuzz corpus lives under fuzz/ and covers manifest Cap’n Proto
input, exported JSON conversion for mkmanifest, and arbitrary ELF parser
input. Run fuzzers when a change alters those parsers, schema shape, or
validation rules.
Kani coverage is intentionally narrow and lives in capos-lib, where pure
logic can be bounded without hardware state. Add or refresh Kani harnesses for
ledger, cap-table, bitmap, and parser invariants when those invariants become
part of a security claim.
Loom coverage belongs in shared ring logic. Extend cargo test-ring-loom when
SQ/CQ ownership, ordering, corruption recovery, or wake semantics change.
Documentation Sources
REVIEW.md: rules for security, unsafe code, capability invariants, resource accounting, and emergency paths.REVIEW_FINDINGS.md: open remediation backlog and latest verification records.../trusted-build-inputs.md: trusted compiler, generated-code, dependency, bootloader, manifest, and host-tool inputs.../panic-surface-inventory.md: classified panic-like surfaces and commands used to generate the inventory.../authority-accounting-transfer-design.md: authority graph, quota, transfer, rollback, and ProcessSpawner accounting invariants.
Panic-Surface Inventory
Scope: panic!, assert!, debug_assert!, .unwrap(), .expect(),
todo!, and unreachable! surfaces relevant to boot manifest loading, ELF
loading, SQE handling, params/result buffers, IPC, and future spawn inputs.
Classification terms:
trusted-internal: depends on kernel/shared-code invariants, static ABI layout, or host build/test code; not directly controlled by a service.boot-fatal: reached during boot/package setup before mutually untrusted services run. Bad platform/package state can halt the system.untrusted-input reachable: reachable from userspace-controlled SQEs, Cap’n Proto params/result buffers, IPC state, manifest/package data, or future spawn-controlled service/binary data.
Summary
No current panic!/assert!/unwrap()/expect() site found in the
kernel ring dispatch path directly consumes raw SQE fields or user
params/result-buffer pointers. Those paths mostly return CQE errors through
kernel/src/cap/ring.rs.
The remaining relevant surfaces are boot-fatal setup assumptions, scheduler
internal invariants that would become more exposed once untrusted spawn/lifecycle
inputs can create or destroy processes dynamically, one IPC queue invariant,
and a manifest validation .expect() guarded by a prior graph-validation call.
Manifest And Future Spawn Inputs
| Location | Surface | Reachability | Classification | Notes |
|---|---|---|---|---|
kernel/src/main.rs:308 run_init | MODULES.response().expect("no modules from bootloader") | Boot package/module table | boot-fatal | Missing Limine modules abort before manifest validation. |
capos-config/src/manifest.rs:328 validate_bootstrap_cap_sources | .expect("graph validation checked service exists") | Manifest service-source caps after validate_manifest_graph() | untrusted-input reachable, guarded | The call is safe only when callers preserve the current validation order in kernel/src/main.rs:346-351. Future spawn/package validation must not call this independently on unchecked manifests. |
kernel/src/main.rs:381 run_init | elf_cache.get(...).ok_or_else(...) | Manifest service binary reference | untrusted-input reachable, controlled error | Not a panic surface. Included because it is the future spawn shape to preserve: unknown or unparsed binaries return an error. |
kernel/src/main.rs:405 run_init | Process::new(...).map_err(...) | Manifest-spawned process creation | untrusted-input reachable, controlled error | Current boot path converts allocation/mapping failures into boot errors. Future ProcessSpawner should keep this shape instead of adding unwraps. |
kernel/src/cap/mod.rs:278 create_all_service_caps | unreachable!("kernel source resolved in pass 1") | Manifest cap source resolution | trusted-internal | Depends on the two-pass enum construction in the same function. Not directly controlled after pattern matching on CapSource::Kernel, but future dynamic grants should avoid relying on this internal sentinel. |
ELF Inputs
| Location | Surface | Reachability | Classification | Notes |
|---|---|---|---|---|
kernel/src/main.rs:202 load_elf | debug_assert!(stack_top % 16 == 0) | ELF load path | trusted-internal | Constant stack layout invariant, not ELF-controlled. |
kernel/src/main.rs:303 align_up | debug_assert!(align.is_power_of_two()) | TLS mapping from parsed ELF | trusted-internal | elf::parse rejects non-power-of-two TLS alignment; load_tls also caps the size before calling align_up. |
capos-lib/src/elf.rs parser | no runtime panic surfaces outside tests/Kani | Boot manifest ELF bytes; future spawn ELF bytes | untrusted-input reachable, controlled error | Parser uses checked offsets/ranges and returns Err(&'static str). Test-only assertions/unwraps are excluded from runtime classification. |
kernel/src/main.rs:167 load_elf | slice init_data[src_offset..] | Parsed ELF PT_LOAD file range | untrusted-input reachable, guarded | Not matched by the panic-token grep, but it is an index panic candidate if parser invariants are bypassed. elf::parse checks segment file ranges before load_elf. |
kernel/src/main.rs:290-293 load_tls | slice &init_data[init_start..init_end] | Parsed ELF TLS file range | untrusted-input reachable, guarded | Not matched by the panic-token grep, but it is an index panic candidate if parser invariants are bypassed. elf::parse checks TLS file bounds before load_tls. |
SQE And Params/Result Buffers
| Location | Surface | Reachability | Classification | Notes |
|---|---|---|---|---|
kernel/src/cap/ring.rs process_ring / dispatch_call / dispatch_recv / dispatch_return | no matched panic-like surfaces | Userspace SQEs, params, result buffers | untrusted-input reachable, controlled error | SQ corruption, unsupported fields/opcodes, oversized buffers, invalid user buffers, and CQ pressure return transport errors or defer consumption. |
capos-config/src/ring.rs:147-149 | const assert! layout checks | Shared ring ABI | trusted-internal | Compile-time ABI guard; not runtime input reachable. |
capos-config/src/capset.rs:53-55 | const assert! layout checks | Shared CapSet ABI | trusted-internal | Compile-time ABI/page-fit guard; not runtime input reachable. |
capos-lib/src/frame_bitmap.rs:87, capos-lib/src/frame_bitmap.rs:149 | .try_into().unwrap() on 8-byte bitmap windows | Frame allocation, including work triggered by manifest/process creation and capability methods | trusted-internal | Guarded by frame + 64 <= total or i + 64 <= to, assuming the caller-provided bitmap covers total_frames. Kernel constructs that bitmap at boot. |
IPC
| Location | Surface | Reachability | Classification | Notes |
|---|---|---|---|---|
kernel/src/cap/endpoint.rs:202 Endpoint::endpoint_call | pending_recvs.pop_front().unwrap() | Cross-process CALL delivered to pending RECV | untrusted-input reachable, guarded | Guarded by !inner.pending_recvs.is_empty() under the same lock. It is still on an IPC path driven by service SQEs, so S.8.4 should convert this to an explicit error/rollback path if panic-free IPC is required. |
kernel/src/cap/endpoint.rs:343-345 endpoint_restore_recv_front | unchecked push_front growth | IPC rollback path | untrusted-input reachable, non-panic today | VecDeque::push_front can allocate/panic if spare capacity assumptions are broken. Current pending recv queue is pre-reserved and bounded on normal insert; rollback paths should keep the bound explicit when hardened. |
Scheduler And Process Lifecycle
| Location | Surface | Reachability | Classification | Notes |
|---|---|---|---|---|
kernel/src/sched.rs:55 init_idle | Process::new_idle().expect(...) | Boot scheduler init | boot-fatal | Idle creation OOM/mapping failure panics before services run. |
kernel/src/sched.rs:206-211 block_current_on_cap_enter | current.expect, assert!, process-table expect | cap_enter(min_complete > 0) path | untrusted-input reachable, internal invariant | Userspace can request blocking, but these unwraps assert scheduler state, not user values. Future process lifecycle/spawn changes increase this exposure. |
kernel/src/sched.rs:252-264 capos_block_current_syscall | current.expect, idle assert!, table expect, panic! if not blocked | Blocking syscall continuation | untrusted-input reachable, internal invariant | Triggered after cap_enter chooses to block. User controls the request, but panic requires kernel state inconsistency. |
kernel/src/sched.rs:279, kernel/src/sched.rs:376 | run_queue references missing process expect | Scheduling after queue selection | trusted-internal now; future spawn/lifecycle sensitive | A stale run-queue PID panics. Dynamic spawn/exit must preserve run-queue/process-table invariants. |
kernel/src/sched.rs:407-422 exit_current | current.expect, idle assert!, remove(...).unwrap(), next-process unwrap() | Ambient exit syscall and future process exit | untrusted-input reachable, internal invariant | Any service can exit itself. Panic requires scheduler corruption or idle misuse, but future spawn/process APIs should harden this boundary. |
kernel/src/sched.rs:468-475 current_ring_and_caps | current.expect, process-table expect | cap_enter flush path | untrusted-input reachable, internal invariant | User can call cap_enter; panic requires no current process or missing table entry. |
kernel/src/sched.rs:493-517 start | initial run-queue expect, process-table unwrap, CR3 expect | Boot service start | boot-fatal | Manifest with zero services is rejected earlier, and process creation errors out; panics indicate scheduler/CR3 invariant breakage. |
kernel/src/arch/x86_64/context.rs:59-60 | CR3 expect("invalid CR3 from scheduler") | Timer interrupt scheduling | trusted-internal; future lifecycle sensitive | Scheduler should only return page-aligned CR3s from AddressSpace. |
Boot Platform And Memory Setup
| Location | Surface | Reachability | Classification | Notes |
|---|---|---|---|---|
kernel/src/main.rs:36 | assert!(BASE_REVISION.is_supported()) | Limine boot protocol | boot-fatal | Platform/bootloader contract check. |
kernel/src/main.rs:41-44 | memory-map and HHDM expect | Limine boot protocol | boot-fatal | Missing bootloader responses halt before untrusted services. |
kernel/src/main.rs:74 | cap::init().expect(...) | Kernel cap table bootstrap | boot-fatal | Fails on kernel-internal cap-table exhaustion. |
kernel/src/mem/frame.rs:39 | frame-bitmap region expect | Boot memory map | boot-fatal | Bad or too-small memory map halts. |
kernel/src/mem/frame.rs:115 | free_frame uses try_free_frame(...).expect(...) | Kernel-owned frame teardown | trusted-internal | Capability handlers use try_free_frame; this panic surface is for kernel-owned frames and rollback/Drop paths. |
kernel/src/mem/frame.rs:139 | assert!(offset != 0) | HHDM cache use before frame init | trusted-internal | Initialization-order invariant. |
kernel/src/mem/heap.rs:11 | heap allocation expect | Boot heap init | boot-fatal | Fails if the frame allocator cannot provide the fixed kernel heap. |
kernel/src/mem/paging.rs:32, kernel/src/mem/paging.rs:58, kernel/src/mem/paging.rs:70 | page-alignment .unwrap() / paging initialized assert! | Kernel frame/page-table internals | trusted-internal | frame::alloc_frame returns page-aligned addresses. |
kernel/src/mem/paging.rs:106, kernel/src/mem/paging.rs:189, kernel/src/mem/paging.rs:194 | kernel PML4/map remap expects | Kernel page-table setup | boot-fatal | Assumes kernel image is mapped in bootloader tables and enough frames exist. |
kernel/src/arch/x86_64/syscall.rs:49 | STAR selector expect | Syscall init | boot-fatal | GDT selector layout invariant. |
kernel/src/sched.rs:299, kernel/src/sched.rs:450, kernel/src/sched.rs:517 | CR3 expect("invalid CR3") | Context switch/exit/start | trusted-internal; future lifecycle sensitive | Scheduler should only carry page-aligned address-space roots. |
Verification Notes
Inventory commands run:
rg -n "\b(panic!|assert!|assert_eq!|assert_ne!|debug_assert!|debug_assert_eq!|debug_assert_ne!|unwrap\(|expect\(|todo!|unreachable!)" kernel capos-lib capos-config init demos tools schema system.cue Makefile docs -g '*.rs' -g '*.cue' -g '*.md' -g 'Makefile'
rg -n "\b(panic!|assert!|assert_eq!|assert_ne!|debug_assert!|debug_assert_eq!|debug_assert_ne!|unwrap\(|expect\(|todo!|unreachable!)" kernel/src capos-lib/src capos-config/src init/src demos/capos-demo-support/src demos/*/src tools/mkmanifest/src -g '*.rs'
Code tests were not run for this doc-only inventory because the write scope is
limited to docs/panic-surface-inventory.md and no Rust code, schema, manifest,
or build configuration changed.
Trusted Build Inputs
This inventory covers the build inputs currently trusted by the capOS boot image, generated bindings, host tooling, and verification paths. It started as the S.10.0 inventory and now records the S.10.2 generated-code drift check. It now also records the S.10.3 dependency policy.
Summary
| Input | Current source | Pinning status | Drift-review status |
|---|---|---|---|
| Limine bootloader binaries | Makefile:5-10, Makefile:34-49 | Git commit and selected binary SHA-256 values are pinned. | make limine-verify fails if the checked-out commit or copied bootloader artifacts drift. |
| Rust toolchain | rust-toolchain.toml:1-3 | Floating nightly channel with target triples only. | No repo-visible date, hash, or installed component audit. The current local resolver reported rustc 1.96.0-nightly (2972b5e59 2026-04-03). |
| Workspace cargo dependencies | Cargo.toml:1-9, crate Cargo.toml files, Cargo.lock | Lockfile pins exact crate versions and checksums for the root workspace. Manifest requirements remain semver ranges. | make dependency-policy-check runs cargo deny check plus cargo audit against the root workspace and lockfile in CI. |
| Standalone cargo dependencies | init/Cargo.lock, demos/Cargo.lock, tools/mkmanifest/Cargo.lock, capos-rt/Cargo.lock, fuzz/Cargo.lock | Each standalone workspace has its own lockfile. | make dependency-policy-check runs the shared deny/audit baseline against every standalone manifest and lockfile. Cross-workspace version drift remains review-visible and intentional where lockfiles differ. |
| Cap’n Proto compiler | Makefile:12-80, kernel/build.rs, capos-config/build.rs, tools/check-generated-capnp.sh | Official capnproto-c++-1.2.0.tar.gz source tarball URL, version, and SHA-256 are pinned in Makefile; make capnp-ensure builds a shared .capos-tools/capnp/1.2.0/bin/capnp under the git common-dir parent so linked worktrees reuse it. The build rule patches the distributed CLI version placeholder to the pinned version before compiling. | Build scripts default to the clone-shared pinned path and reject CAPOS_CAPNP when it points elsewhere. Make targets export the pinned path and CI persists it through $GITHUB_ENV. make generated-code-check verifies both the exact compiler path and Cap'n Proto version 1.2.0 before regenerating bindings through Cargo. |
| Cap’n Proto Rust runtime/codegen crates | capos-config/Cargo.toml:9, capos-config/Cargo.toml:15, kernel/Cargo.toml:12, kernel/Cargo.toml:21, Cargo.lock | Cargo manifests use exact capnp = "=0.25.4" and capnpc = "=0.25.3" requirements where declared; lockfiles pin exact crate versions and checksums. | S.10.3 now requires dependency-class and no_std review before these changes are accepted. |
| Generated capnp bindings | capos-config/src/lib.rs:10-12, kernel/src/main.rs:15-18, tools/generated/capos_capnp.rs, tools/check-generated-capnp.sh | Generated into Cargo OUT_DIR; the expected patched output is checked in under tools/generated/. | make generated-code-check regenerates both crate outputs through Cargo and fails if either output differs from the checked-in baseline. |
| no_std patching of generated bindings | kernel/build.rs:13-30, capos-config/build.rs:10-25, tools/check-generated-capnp.sh | Patch anchors are asserted in both build scripts. | make generated-code-check verifies the patched output contains the expected no_std imports for both crates. |
| Linker script build scripts | kernel/build.rs:2, init/build.rs:2-5, demos/*/build.rs | Source-controlled scripts and linker scripts. | Build rerun boundaries are explicit; generated link args are not independently audited. |
| CUE manifest compiler | Makefile:13-91, tools/mkmanifest/src/main.rs:65-130, tools/mkmanifest/src/lib.rs:30-80, .github/workflows/ci.yml | make cue-ensure installs cuelang.org/go/cmd/cue pinned to v0.16.0 into the clone-shared .capos-tools/cue/0.16.0/bin/cue path. | Make exports CAPOS_CUE to tools/mkmanifest, and CI records that exact path through $GITHUB_ENV before QEMU smoke. mkmanifest also derives the same clone-shared path, rejects missing or non-canonical CAPOS_CUE, and checks cue version v0.16.0 before export. |
| mdBook documentation tools | Makefile, book.toml | GitHub release assets for mdBook v0.5.0 and mdbook-mermaid v0.17.0 are pinned by version and SHA-256 under the clone-shared .capos-tools path. | make docs and make cloudflare-pages-build verify the tarball checksums and executable versions, refresh the Mermaid assets, and build target/docs-site. |
| QEMU and firmware | Makefile:67-83 | Host-installed qemu-system-x86_64; OVMF path is hard-coded for UEFI. | No repo-visible version or firmware checksum. Current local host reported QEMU 10.2.2. |
| ISO and host filesystem tools | Makefile:51-65 | Host-installed xorriso, sha256sum, git, make, shell utilities. | No version capture except ad hoc local inspection. |
| Boot manifest and embedded binaries | system.cue:1-144, tools/mkmanifest/src/lib.rs:82-115, Makefile:28-29, Makefile:51-65 | Source manifest is checked in; embedded ELF payloads are build artifacts. | Manifest validation checks references and path containment, but final manifest.bin is generated and not checksum-recorded. |
| Build downloads | Makefile, Cargo lockfiles, rust-toolchain.toml | Limine and documentation tool tarballs are explicitly fetched; Cargo, Go, and rustup downloads are implicit when caches/toolchains are absent. | Limine artifacts and documentation tool tarballs are verified. Cargo, Go, and rustup downloads rely on upstream tooling and lockfiles, with no repo policy. |
S.10.3 Dependency Policy
Dependency changes are accepted only if they satisfy this policy and are recorded in the owning task checklist.
Dependency classes
Use these classes when reviewing a dependency change:
- Kernel-critical no_std: crates used directly by
kernel,capos-lib, andcapos-config. - Userspace-runtime no_std: crates used by
init,demos, andcapos-rt. - Host/build: crates used by
tools/*,build.rshelpers, and generated output pipelines. - Test/fuzz/dev: crates gated by
dev-dependenciesortarget-specificfor fuzz/proptests/smoke support.
Required pre-merge criteria
For any added dependency (or bump in any class):
- Manifest and features are explicit. Dependency entries must include
explicit feature choices; avoid
default-features = trueunless justified. - No_std compatibility is proven for no_std classes. Kernel-critical and
userspace-runtime dependencies must compile in a
#![no_std]mode withallocwhere expected.cargo build -p <crate> --target x86_64-unknown-nonemust succeed for every kernel/no_std crate affected. - Security policy checks run and pass. CI-equivalent checks for the
touched workspace are required through
make dependency-policy-check, which runscargo deny checkon every Cargo manifest andcargo auditon every lockfile. - Dependency class change is justified in review. PR text must include target class, ownership rationale, transitive graph impact, and why the crate is not a transitive replacement for an already-allowed dependency.
- Lockfile behavior is explicit. Update only intended lockfiles and record intentional cross-workspace drift in this document if workspace purpose differs.
No_std add/edit checklist
- Reject crates that require
std, OS I/O, or unsupported platform APIs in the dependency path intended for kernel classes. - Reject dependencies that re-export broad platform facades or large unsafe surface unless there is a replacement with smaller scope and better audit visibility.
- Record a license and supply-chain review result (via policy checks) before merge.
- Confirm no
unsafecontract escapes are added without a review surface note in the relevant module.
Standing requirements
- Add
S.10.3checks to the target branch plan item for any kernel/no_std crate dependency change and document the exact pass command set. - Keep lockfile deltas review-visible in normal PR flow; lockfile pinning is the minimum bar, not the gate.
- Keep transitive drift in sync with the trust class: class-wide divergence across lockfiles requires explicit justification.
Remaining gaps after S.10.3 policy
- Continue Rust toolchain pinning work (date/hash pin, reproducible host compiler inputs) as a separate build-reproducibility task.
- Decide whether final ISO/payload hashes become policy-grade inputs in production-hardening stages.
Bootloader and ISO Inputs
The Makefile now pins Limine at commit
aad3edd370955449717a334f0289dee10e2c5f01 and verifies these copied artifacts:
| Artifact | Checksum reference |
|---|---|
limine/limine-bios.sys | Makefile:7 |
limine/limine-bios-cd.bin | Makefile:8 |
limine/limine-uefi-cd.bin | Makefile:9 |
limine/BOOTX64.EFI | Makefile:10 |
make limine-ensure clones https://github.com/limine-bootloader/limine.git
only when limine/.git is absent, fetches the pinned commit if needed, checks
it out detached, and runs make inside the Limine tree (Makefile:34-40).
make limine-verify then checks the repository HEAD and artifact checksums
(Makefile:42-49). The ISO copies the kernel, generated manifest.bin,
Limine config, and verified Limine artifacts into iso_root/, runs xorriso,
then runs limine bios-install (Makefile:51-65).
Remaining reproducibility gap: Limine source is pinned, but the Limine build host compiler and environment are not pinned or recorded.
Rust Toolchain
rust-toolchain.toml specifies:
channel = "nightly"targets = ["x86_64-unknown-none", "aarch64-unknown-none"]
This is a floating channel pin, not a reproducible toolchain pin. A future
rustup resolution can move the compiler even when the repository is
unchanged. The current local host resolved to:
rustc 1.96.0-nightly (2972b5e59 2026-04-03)cargo 1.96.0-nightly (888f67534 2026-03-30)- host target
x86_64-unknown-linux-gnu
The Makefile derives HOST_TARGET from rustc -vV (Makefile:12) and uses
that for tools/mkmanifest (Makefile:28-29). Cargo aliases in
.cargo/config.toml:4-22 hard-code x86_64-unknown-linux-gnu for host tests.
Remaining reproducibility gap: pin the nightly by date or exact toolchain hash, and record required components. Until then, compiler drift can change codegen, linking, lints, and generated bindings without a repository diff.
Cargo Dependencies
The root workspace members are capos-config, capos-lib, and kernel
(Cargo.toml:1-4). init/, demos/, tools/mkmanifest/, and fuzz/ are
standalone workspaces with their own lockfiles.
Important direct dependencies and current root-lock resolutions:
| Dependency | Manifest references | Root lock resolution |
|---|---|---|
capnp | capos-config/Cargo.toml:8, capos-lib/Cargo.toml:7, kernel/Cargo.toml:11 | 0.25.4 in Cargo.lock |
capnpc | capos-config/Cargo.toml:14, kernel/Cargo.toml:19 | 0.25.3 in Cargo.lock |
limine crate | kernel/Cargo.toml:7 | 0.6.3 in Cargo.lock |
spin | kernel/Cargo.toml:8 | 0.9.8 in Cargo.lock |
x86_64 | kernel/Cargo.toml:9 | 0.15.4 in Cargo.lock |
linked_list_allocator | kernel/Cargo.toml:10 | 0.10.6 in Cargo.lock |
loom | capos-config/Cargo.toml:17 | 0.7.2 in Cargo.lock |
proptest | capos-lib/Cargo.toml:9-10 | 1.11.0 in Cargo.lock |
Standalone lockfile drift observed during this inventory:
| Lockfile | Notable direct/runtime resolution |
|---|---|
init/Cargo.lock | capnp 0.25.4, capnpc 0.25.3, linked_list_allocator 0.10.5 |
demos/Cargo.lock | capnp 0.25.4, capnpc 0.25.3, linked_list_allocator 0.10.6 |
tools/mkmanifest/Cargo.lock | capnp 0.25.4, capnpc 0.25.3, serde_json 1.0.149 |
capos-rt/Cargo.lock | capnp 0.25.4, capnpc 0.25.3, linked_list_allocator 0.10.6 |
fuzz/Cargo.lock | capnp 0.25.4, capnpc 0.25.3, libfuzzer-sys 0.4.12 |
Cargo lockfiles pin exact crate versions and crates.io checksums, so ordinary crate upgrades are review-visible through lockfile diffs. They do not, by themselves, define whether a dependency is acceptable for kernel/no_std use, whether multiple lockfiles must converge, or whether advisories/licenses block the build.
S.10.3 policy gate:
deny.tomldefines the shared license, advisory, ban, and source baseline.make dependency-policy-checkrunscargo deny checkon the root workspace,init,demos,tools/mkmanifest,capos-rt, andfuzz.- The same target runs
cargo audit --deny warningson every checked-in lockfile. - Local packages are marked
publish = falseso cargo-deny treats them as private, and local path dependencies includeversion = "0.1.0"so registry wildcard requirements can remain denied. - CI installs pinned
cargo-deny 0.19.4andcargo-audit 0.22.1and runs the target.
Remaining dependency-policy gap: decide whether standalone lockfiles may
intentionally drift from the root lockfile, especially for capnp and
allocator crates used by userspace.
Cap’n Proto Compiler, Runtime, and Generated Bindings
The trusted Cap’n Proto inputs are:
schema/capos.capnp, the source schema.- Repo-local pinned
capnp, invoked through thecapnpcRust build dependency viaCAPOS_CAPNP. capnpruntime crate withdefault-features = falseandalloc.capnpccodegen crate.- Generated
capos_capnp.rswritten to CargoOUT_DIR. - Local no_std patching applied after generation.
kernel/build.rs and capos-config/build.rs both run
capnpc::CompilerCommand over ../schema/capos.capnp, then read the generated
capos_capnp.rs, assert that the expected #![allow(unused_variables)] anchor
is present, and inject:
#![allow(unused)]
#![allow(unused_imports)]
fn main() {
use ::alloc::boxed::Box;
use ::alloc::string::ToString;
}
The generated code used by builds is included from OUT_DIR in
capos-config/src/lib.rs:10-12 and kernel/src/main.rs:15-18. The expected
patched output is checked in as tools/generated/capos_capnp.rs, so schema,
compiler, capnpc crate, and patch-output changes must update that baseline and
become review-visible as a source diff.
S.10.2 generated-code drift check:
make generated-code-checkrunstools/check-generated-capnp.sh.- The script invokes the actual Cargo build-script path for
capos-configandcapos-kernelin an isolated target directory, so it checks the generated artifacts that those crates would include fromOUT_DIR. - The script verifies that each patched file still contains the capnpc anchor
plus the local no_std patch imports, compares the two crate outputs
byte-for-byte, and then compares both outputs against
tools/generated/capos_capnp.rs. - Any intentional schema/codegen/patch change must update the checked-in baseline in the same review, making generated output drift review-visible.
make checkrunsfmt-checkplusgenerated-code-checkfor a single local or CI entry point.- Current pinned compiler source is
capnproto-c++-1.2.0.tar.gzfromhttps://capnproto.org/with SHA-256ed00e44ecbbda5186bc78a41ba64a8dc4a861b5f8d4e822959b0144ae6fd42ef. The checked-intools/generated/capos_capnp.rsbaseline must be regenerated with that compiler when schema or codegen behavior intentionally changes. The current pinned baseline SHA-256 is224b4ec2296f800bd577b75cfbd679ebb78e7aa2f813ad9893061f4867c9dd3d.
Remaining gaps for S.10.3:
- The no_std patching logic still lives in both build scripts. The baseline and pairwise output comparison catch divergent results, but a future cleanup could move the patch helper into shared source to reduce duplication.
Cargo Build Scripts
Build scripts currently do these trusted operations:
| Script | Behavior |
|---|---|
kernel/build.rs | Watches kernel/linker-x86_64.ld, schema/capos.capnp, and itself; generates and patches capnp bindings. Checked by make generated-code-check. |
capos-config/build.rs | Watches schema/capos.capnp; generates and patches capnp bindings. Checked by make generated-code-check. |
init/build.rs | Emits a linker script argument for init/linker.ld. |
demos/*/build.rs | Emits a linker script argument for demos/linker.ld. |
The linker build scripts derive CARGO_MANIFEST_DIR from Cargo and only emit
link arguments plus rerun directives. The capnp build scripts read and rewrite
generated code under OUT_DIR. None of these scripts fetch network resources.
S.10.2 coverage: make generated-code-check exercises both capnp build
scripts through Cargo, validates the patched generated files, and fails if the
two crate outputs drift apart or no longer match the checked-in generated
baseline.
Manifest, Embedded Binaries, and Downloaded Artifacts
system.cue declares named binaries and services. Makefile:54-55 builds
manifest.bin by running tools/mkmanifest on the host. mkmanifest runs:
- Resolve the clone-shared pinned CUE compiler from git state, reject missing
or mismatched
CAPOS_CUE, checkcue version v0.16.0, then runcue export system.cue --out json(tools/mkmanifest/src/main.rs:65-128). - JSON-to-
CueValueconversion and manifest validation (tools/mkmanifest/src/main.rs:13-23). - Binary embedding from relative paths (
tools/mkmanifest/src/lib.rs:135-180). - Binary-reference validation and Cap’n Proto serialization
(
tools/mkmanifest/src/main.rs:37-49).
Path handling rejects absolute paths, parent traversal, non-normal components,
and canonicalized paths that escape the manifest directory
(tools/mkmanifest/src/lib.rs:182-217). The generated manifest.bin is copied
into the ISO as /boot/manifest.bin (Makefile:117) and loaded by Limine via
limine.conf:5.
Downloaded or generated artifacts in the current build:
| Artifact | Producer | Pinning/drift status |
|---|---|---|
limine/ checkout | git clone/git fetch in Makefile:34-40 | Commit-pinned and artifact-verified. |
| Cargo registry crates | cargo build, cargo run, tests, fuzz | Lockfile-pinned checksums plus CI-enforced deny/audit checks through make dependency-policy-check. |
| Rust toolchain and targets | rustup from rust-toolchain.toml when absent | Floating nightly channel. |
target/ kernel and host artifacts | Cargo | Generated, not checked in. |
init/target/ and demos/target/ ELFs | Cargo standalone builds | Generated, embedded into manifest.bin; no final payload checksums recorded in source. |
manifest.bin | tools/mkmanifest | Generated from system.cue plus ELF payloads; not checked in. |
iso_root/ and capos.iso | Makefile, xorriso, Limine installer | Generated and gitignored; Limine inputs verified, full ISO checksum not source-recorded. |
Remaining gaps for S.10.2/S.10.3:
- Decide whether CI should record or compare hashes for
manifest.bin, embedded ELF payloads, or the final ISO for reproducible-build tracking. - Pin or record
xorriso,qemu-system-x86_64, OVMF firmware, and other host tools used by build and boot verification with the same strictness ascapnpandcue. - Decide whether CI should record the pinned
cue exportJSON or finalmanifest.binbytes if manifest reproducibility becomes release-critical.
Host Tools
Current local host versions observed during this inventory:
| Tool | Observed version | Build role |
|---|---|---|
capnp | 1.2.0 | Repo-local schema compiler built by make capnp-ensure from a SHA-256-pinned official source tarball into the shared .capos-tools cache for this clone. |
cue | v0.16.0 | Repo-local manifest compiler installed by make cue-ensure into the shared .capos-tools cache for this clone. |
qemu-system-x86_64 | 10.2.2 | Boot verification via make run and make run-uefi. |
xorriso | 1.5.8 | ISO generation. |
make | 4.4.1 | Build orchestration. |
git | 2.53.0 | Limine checkout/fetch and review workflow. |
These are environment observations, not repository pins. make run-uefi also
trusts /usr/share/edk2/x64/OVMF.4m.fd (Makefile:82-83) without a checksum.
Remaining gap for S.10.3: decide the minimum supported host tool versions and whether they are enforced by CI, a container/devshell, or explicit preflight checks.
Verification Used for This Inventory
This was a documentation-only inventory. Code tests and QEMU boot were not run because no source, build, runtime, or generated-code behavior was changed.
Scoped read-only verification commands used:
git status --short --branchrg -n "S\\.10|trusted|supply|Limine|limine|capnp|capnpc|QEMU|qemu|download|curl|git clone|wget|build\\.rs|rust-toolchain|Cargo\\.lock" ...rg --filescargo metadata --locked --format-version 1 --no-depsrg -n '^name = |^version = |^checksum = ' Cargo.lock init/Cargo.lock demos/Cargo.lock tools/mkmanifest/Cargo.lock fuzz/Cargo.lockcommand -v rustc cargo capnp cue qemu-system-x86_64 xorriso sha256sum git makerustc -Vv,cargo -V,capnp --version,cue version,qemu-system-x86_64 --version,xorriso -version,make --version,git --version
DMA Isolation Design
S.11 gates PCI, virtio, and later userspace device-driver work on an explicit DMA authority model. The immediate goal is narrow: let the kernel bring up a QEMU virtio-net smoke without creating a user-visible raw physical-memory escape hatch.
Short-Term Decision
Use kernel-owned bounce buffers for the first in-kernel QEMU virtio-net smoke.
The kernel allocates DMA-capable pages from its own frame allocator, owns the virtqueue descriptor tables and packet buffers, programs the device with the corresponding physical addresses, and copies packet payloads between those buffers and the networking stack. No userspace process receives a DMA buffer capability, a physical address, a virtqueue pointer, or a BAR mapping for this smoke.
This is deliberately conservative:
- It works before ACPI/DMAR or AMD-Vi parsing, IOMMU page-table management, MSI/MSI-X routing, and userspace driver lifecycle supervision exist.
- It keeps all physical-address programming inside the kernel, where the same code that allocates the frames also bounds the descriptors that reference them.
- It does not make the current
FrameAllocatorcapability part of the DMA path.FrameAllocatorcan expose raw frames today and is already tracked inREVIEW_FINDINGS.md; DMA must not build new untrusted-driver semantics on that interface. - It gives the smoke a disposable implementation path. When NIC or block
drivers move to userspace, bounce-buffer authority becomes a typed
DMAPoolobject instead of an ad hoc physical-address grant.
An IOMMU-backed DMA-domain model remains the target for direct device access from mutually untrusted userspace drivers, but it is not a prerequisite for the first QEMU smoke. Without an IOMMU, a malicious bus-mastering device can still DMA to arbitrary RAM at the hardware level; the short-term smoke assumes QEMU-provided virtio hardware and protects against confused or untrusted userspace, not hostile hardware.
Authority Model
Device authority is split into three independent capabilities:
DMAPool: authority to allocate, expose, and revoke device-visible memory within a kernel-owned physical range or IOMMU domain.DeviceMmio: authority to map and access one device’s register windows.Interrupt: authority to wait for and acknowledge one interrupt source.
Holding one of these capabilities never implies the others. A driver needs all three for a normal device, but the kernel and init can grant, revoke, and audit them separately.
DMAPool Invariants
DMAPool is the only future userspace-facing authority that may cause a
device-visible DMA address to exist.
- Authority: A holder may allocate buffers only from the pool object it was granted. It may not request arbitrary physical frames, import caller virtual memory by address, or derive another pool.
- Physical range: Every exported device address must resolve to pages owned by the pool. The kernel records the allowed host-physical page set and validates every descriptor mapping against that set before a device can use it. If an IOMMU domain backs the pool, the exported address is an IOVA, not raw host physical memory.
- Ownership: Each DMA buffer has one pool owner, one device-domain owner, and explicit CPU mappings. Sharing a buffer with another process requires a later typed memory-object transfer; copying packet data is the default until that object exists.
- No raw grants: Userspace never receives an unrestricted host-physical
address. A driver may receive an opaque DMA handle or an IOVA meaningful
only to its
DMAPool/device domain. It cannot turn that value into access to unrelated RAM. - Bounds: Buffer length, alignment, segment count, and queue depth are bounded by the pool. Descriptor chains that point outside an allocated buffer, wrap arithmetic, exceed device limits, or reference freed buffers fail closed before doorbell writes.
- Revocation: Revoking the pool first quiesces the device path using it, prevents new descriptors, waits for or cancels in-flight descriptors, then removes IOMMU mappings or invalidates bounce-buffer handles before freeing pages.
- Reset: If in-flight DMA cannot be proven stopped, revocation escalates to device reset through the owning device object before pages are reused.
- Residual state: Pages returned from a pool are zeroed or otherwise scrubbed before reuse by a different owner. Receive buffers are treated as device-written untrusted input until validated by the driver or stack.
For the in-kernel QEMU smoke, the kernel is the only DMAPool holder. The
same invariants apply internally even though no userspace capability object is
exposed yet.
DeviceMmio Invariants
DeviceMmio is register authority, not memory authority.
- Authority: A holder may map only BARs or subranges recorded in the claimed device object. It may not map PCI config space globally, another function’s BAR, RAM, ROM, or synthetic kernel pages.
- Physical range: Each mapping is bounded to the BAR’s decoded physical range, page-rounded by the kernel, and tagged as device memory with cache attributes appropriate for MMIO. Partial BAR grants must preserve page-level isolation; otherwise the grant must cover the whole page-aligned register window and be treated as that much authority.
- Ownership: At most one mutable driver owner controls a device function’s
MMIO at a time. Management capabilities may inspect topology, but register
writes require the claimed
DeviceMmioobject. - No DMA implication: Mapping registers does not grant any DMA buffer,
frame allocation, interrupt, or config-space authority. Doorbell writes are
accepted only as effects of register access; descriptor validity is enforced
by
DMAPoolbefore queues are made visible to the device. - Revocation: Revocation unmaps the driver’s register pages, marks the device object unavailable for new calls, and invalidates outstanding MMIO handles. Stale mappings or calls fail closed.
- Reset: Revoking the final mutable
DeviceMmioowner resets or disables the device unless a higher-level device manager explicitly transfers ownership without exposing it to an untrusted holder.
Interrupt Invariants
Interrupt is event authority for one routed source.
- Authority: A holder may wait for, mask/unmask where supported, and acknowledge only its assigned vector, line, or MSI/MSI-X table entry. It may not reprogram arbitrary interrupt controllers or claim another source.
- Ownership: Each interrupt source has one delivery owner at a time. Shared legacy lines must be represented as a kernel-demultiplexed object with explicit device membership, not as ambient access to the whole line.
- Range: The capability records the hardware source, vector, trigger mode, polarity, and target CPU/routing state. User-visible operations are checked against that record.
- Revocation: Revocation masks or detaches the source, drains pending notifications for the old holder, invalidates waiters, and prevents stale acknowledgements from affecting a new owner.
- Reset: If the source cannot be detached cleanly, the owning device is reset or disabled before the interrupt is reassigned.
- No MMIO or DMA implication: Interrupt delivery does not grant register access, DMA buffers, or packet memory.
Revocation Ordering
Device revocation must follow a fixed order:
- Stop new submissions by invalidating the driver’s user-visible handles.
- Revoke MMIO write authority by write-blocking or unmapping BAR pages, or by disabling the device before any DMA teardown starts.
- Mask or detach interrupts.
- Quiesce virtqueues or device command queues.
- Reset or disable the device if in-flight DMA cannot be accounted for.
- Remove IOMMU mappings or invalidate bounce-buffer handles.
- Scrub and free DMA pages.
This order prevents a stale driver from racing revocation with doorbell writes, interrupt acknowledgement, or descriptor reuse. Logical handle invalidation is not sufficient while a BAR remains mapped; register-write authority must be removed or the device must be disabled before descriptor or DMA-buffer ownership is reclaimed.
Future Userspace-Driver Transition Criteria
Moving NIC or block drivers out of the kernel is gated by S.11.2. The gate is only open when all rows below are implemented and demonstrated.
| Gate item | Required state | Must-have proof |
|---|---|---|
| S.11.2.0 DMA-objected buffers | DMAPool owns every driver-visible DMA mapping. | A driver receives opaque buffer handles or IOVA-only values; no path hands out raw host physical addresses. |
| S.11.2.1 Bound checks | Allocation, descriptor chain length, alignment, segment length, and ring depth are bounded and constant-time validated before ring submission. | Ring submissions fail closed on overflow, wrap, stale-handle, and freed-handle reuse attempts. |
| S.11.2.2 Explicit remap/ownership | DeviceMmio can only grant claimed BAR pages; cache attributes and write policy are enforced. | Driver cannot access unclaimed BARs, ROM, RAM pages, config-space globals, or stale mappings after revoke. |
| S.11.2.3 Interrupt correctness | Interrupt owns exactly one logical source at a time and drains/waits only for that source. | Reassigning an owner invalidates old waiters and masks or detaches the source first. |
| S.11.2.4 Quiesce + reset contract | Device manager can force reset/disable on failed revoke or teardown. | No in-flight descriptor may continue touching freed buffers after driver removal. |
| S.11.2.5 Process lifecycle | Capability release, process exit, and process-spawn cleanup paths cannot leak DMA pages/MMIO/intr ownership. | Crash-path teardown removes holds and invalidates user-visible handles before page free. |
| S.11.2.6 Isolation and accounting | S.9 quota and authority ledgers include DMA, MMIO, and interrupt hold edges. | A malicious or buggy driver cannot consume more than its allocated authority budget. |
| S.11.2.7 Hostile-smoke coverage | QEMU/CI smokes cover stale handles, descriptor abuse, revoke races, and exit-under-dma. | Smoke output has explicit closed-case proof lines for each above failure mode. |
For each row, the transition requires an owner, implementation notes, and a CI-backed
verification path. Until all rows pass, Phase 4.2 NIC/block drivers remain in-kernel for
functionality, and only kernel-mapped bounce-buffer mode is allowed for prototype DMA.
S.11.2 Decision Record
S.11.2 is not complete until the kernel has a dedicated device manager object model
that can produce, transfer, and revoke DMAPool, DeviceMmio, and Interrupt in a
single ownership transaction for a driver process.
Current status: transition remains blocked pending implementation of the conditions above.
S.9 Design: Authority Graph and Resource Accounting for Transfer
This document defines the concrete S.9 design gate for:
- WORKPLAN 3.6 capability transfer (
xfer_cap_count, copy/move, rollback) - WORKPLAN 5.2 ProcessSpawner prerequisites (spawn quotas and result-cap insertion)
S.9 is complete when this design contract is concrete enough to guide implementation. The invariants and acceptance criteria below are implementation gates for later work in 3.6/5.2/S.8/S.12, not requirements for declaring the S.9 design artifact complete.
1. Authority Graph Model
Authority is modeled as a directed multigraph:
- Nodes:
Process(Pid)Object(ObjectId)(kernel object identity, independent of per-processCapId)
- Edges:
Hold(Pid -> ObjectId)with metadata:cap_id(table-local handle)interface_idbadgetransfer_mode(copy,move,non_transferable)origin(kernel,spawn_grant,ipc_transfer,result_cap)
Security invariant A1: all authority is represented by Hold edges; no
operation can create object authority outside this graph.
Security invariant A2: each process mutates only its own CapTable edges except
through explicit transfer/spawn transactions validated by the kernel.
Security invariant A3: for every live Hold edge there is exactly one
cap_id slot in one process table referencing the object generation.
2. Per-Process Resource Ledger and Quotas
Each process owns a kernel-maintained ResourceLedger with hard limits.
Enforcement is fail-closed at reservation time (before side effects).
ResourceLedger {
cap_slots_used / cap_slots_max
endpoint_queue_used / endpoint_queue_max
outstanding_calls_used / outstanding_calls_max
scratch_bytes_used / scratch_bytes_max
frame_grant_pages_used / frame_grant_pages_max
log_bytes_window_used / log_bytes_per_window (token bucket)
cpu_time_us_window_used / cpu_budget_us_per_window (token bucket)
}
Initial quota profile for Stage 6/5.2 bring-up (tunable by kernel config):
cap_slots_max: 256endpoint_queue_max: 128 messagesoutstanding_calls_max: 64scratch_bytes_max: 256 KiBframe_grant_pages_max: 4096 pages (16 MiB at 4 KiB pages)log_bytes_per_window: 64 KiB/sec with 256 KiB burstcpu_budget_us_per_window: 10,000 us per 100,000 us window
Security invariant Q1: no counter may exceed its max.
Security invariant Q2: every resource reservation has a matched release on all success, error, timeout, process-exit, and rollback paths.
Security invariant Q3: quota checks for transfer/spawn happen before mutating sender or receiver capability state.
3. Diagnostic Rate Limiting and Aggregation
Repeated invalid ring/cap submissions are aggregated per process and error key.
- Key:
(pid, error_code, opcode, cap_id_bucket) - Buckets:
cap_id_bucket = exact cap idfor stale/invalid cap failurescap_id_bucket = 0for structural ring errors
- Per-key token bucket: allow first
N=4emissions/sec, then suppress. - Suppressed counts are flushed once per second as one summary line:
pid=X invalid submissions suppressed=Y last_err=...
Security invariant D1: invalid submission floods cannot consume unbounded serial bandwidth or scheduler time in log formatting.
Security invariant D2: suppression never hides first-observation diagnostics for
a new (pid,error,opcode,cap bucket) key.
4. Transfer and Rollback Semantics
Transfers (xfer_cap_count > 0) use a kernel transfer transaction
(TransferTxn) scoped to a single SQE dispatch. The current ring ABI does not
provide kernel-owned SQE sequence numbers or a durable transaction table, so
userspace replay of a copy-transfer SQE is repeatable: each replay is treated
as a new copy grant. Move-transfer replay fails closed after the source slot is
removed or reserved by the first successful dispatch.
Future exactly-once replay suppression requires transaction identity scoped to
(sender_pid, call_id, sqe_seq) and a monotonic transfer epoch. Until that
exists, exactly-once claims apply only within one dispatch attempt, not across
malicious rewrites of shared SQ ring indexes.
Phases:
Prepare:- validate SQE transport fields and
xfer_cap_count - validate sender ownership/generation/transferability for each exported cap
- reserve receiver quota (
cap_slots,outstanding_calls, scratch if needed) - pin sender entries in txn state (no sender table mutation yet)
- validate SQE transport fields and
Commit:- insert destination edges exactly once
- for
copy: increment object refcount/export ref - for
move: remove sender slot only after destination insertion succeeds - publish completion/result
Finalize:- release transient reservations
- mark txn terminal (
committedoraborted)
On any error before Commit, rollback is full:
- receiver inserts are not visible
- sender slots/refcounts unchanged
- reservations released
- CQE returns transfer failure (
CAP_ERR_TRANSFER_ABORTED/ subtype)
On error during Commit, kernel executes compensating rollback to preserve
exactly-once visibility: either all inserts are visible with matching sender
state transition, or none are visible.
Security invariant T1: each transfer descriptor is applied at most once within a single SQE dispatch attempt.
Security invariant T2: move transfer is atomic from observer perspective; no state exists where both sender and receiver lose authority due to partial apply.
Security invariant T3: copy-transfer SQE replay is explicitly repeatable until kernel-owned transaction identity exists. Move-transfer replay fails closed after source removal or source reservation.
Security invariant T4: CAP_OP_RELEASE removes one local hold edge only from
the caller table and decrements remote export refs exactly once.
5. Integration with 3.6 Capability Transfer
3.6 implementation must consume this design directly:
CALLandRETURNvalidate all currently-reserved transfer fields fail-closed when unsupported.xfer_cap_countpath is wired throughTransferTxn(no ad hoc direct inserts).- Badge propagation is explicit in transfer descriptors and copied into destination edge metadata.
CAP_OP_RELEASEuses the same authority ledger and refcount bookkeeping.
3.6 acceptance criteria:
- Copy transfer produces one new receiver edge and retains sender edge.
- Move transfer produces one new receiver edge and deletes sender edge atomically.
- Any transfer failure leaves sender and receiver
CapTables unchanged. - Copy replay is an explicit repeatable-grant policy until a kernel-owned transaction identity is added; move replay fails closed after source removal or reservation.
CAP_OP_RELEASEon stale/non-owned cap fails closed without mutating other process tables.
6. Integration with 5.2 ProcessSpawner Prerequisites
5.2 must use the same accounting and transfer machinery:
spawn()preflights child quotas (cap_slots,outstanding_calls,scratch,frame_grant_pages, endpoint queue baseline) before mapping child memory or scheduling.- Parent-provided
CapGrantentries are inserted via the same transfer transaction semantics (copy for initial grants in 5.2.2). - Returned
ProcessHandleis inserted through the standard result-cap insertion path and accounted as a normal cap slot. - Child setup rollback must unwind:
- address space mappings
- ring page
- CapSet page
- kernel stack
- allocated frames
- provisional capability edges/reservations
5.2 acceptance criteria:
- Spawn failure at any step leaves no child-visible process and no leaked ledger usage.
- Successful spawn accounts all child bootstrap resources within quotas.
- Parent and child cap-table accounting remains balanced under repeated spawn/exit cycles.
ProcessHandle.waitand exit cleanup release outstanding-call/scratch/frame usage deterministically.
7. Implementation Notes for Verification Tracks
This design unblocks:
- S.8 hostile-input tests for quota and invalid-transfer failures.
- S.12 Kani bounds refresh for ledger and transfer invariants.
- Target 12 in
docs/proposals/security-and-verification-proposal.mdwith explicit allocator hooks and fail-closed exhaustion behavior.
Proposal Index
This page classifies proposal documents by current role so readers do not confuse implemented behavior, active design direction, future architecture, and rejected alternatives.
Active or Near-Term
| Proposal | Status | Purpose |
|---|---|---|
| Service Architecture | Partially implemented | Defines authority-at-spawn, service composition, exported capabilities, and the init-owned service graph direction. |
| Storage and Naming | Accepted design | Defines capability-native storage, namespaces, boot-package structure, and future persistence instead of a global filesystem. |
| Error Handling | Partially implemented | Defines the two-level transport/application error model and the current CQE transport error namespace. |
| Security and Verification | Partially implemented | Defines the security review vocabulary, trust-boundary checklist, and practical verification tracks used by capOS. |
| mdBook Documentation Site | Partially implemented | Defines the documentation site structure, status vocabulary, and curation rules for architecture, proposal, security, and research pages. |
Future Architecture
| Proposal | Status | Purpose |
|---|---|---|
| Networking | Partially implemented | Plans the in-kernel QEMU virtio-net smoke and the later userspace NIC, network stack, and socket capability architecture. |
| SMP | Future design | Defines the future multi-core scheduler, per-CPU state, AP startup, and TLB shootdown direction. |
| Userspace Binaries | Partially implemented | Describes native userspace binaries, capos-rt, language support, POSIX compatibility, and runtime authority handling. |
| Go Runtime | Future design | Plans a custom GOOS=capos path, runtime services, memory growth, TLS, scheduling, and network integration for Go. |
| Shell | Future design | Describes native, agent-oriented, and POSIX shell models over explicit capabilities instead of ambient paths. |
| Boot to Shell | Queued future milestone | Defines text-only console and web-terminal login/setup, password verifier and passkey authentication, and the authenticated native shell launch path after manifest execution, terminal input, native shell, session, broker, audit, and credential-storage prerequisites are credible. |
| System Monitoring | Future design | Defines capability-scoped logs, metrics, health, traces, crash records, and audit/status views. |
| User Identity and Policy | Future design | Defines users, sessions, guest profiles, and policy layers for RBAC, ABAC, and MAC over capability grants. |
| Cloud Metadata | Future design | Describes cloud instance bootstrap through metadata/config-drive capabilities and manifest deltas. |
| Cloud Deployment | Future design | Plans hardware abstraction, cloud VM support, storage/network boot dependencies, and later aarch64 deployment work. |
| Live Upgrade | Future design | Defines service replacement without dropping capabilities or in-flight calls through retargeting and quiesce/resume protocols. |
| GPU Capability | Future design | Sketches capability-oriented GPU, CUDA, memory, and driver isolation models. |
| Formal MAC/MIC | Future design | Defines a formal mandatory-access and mandatory-integrity model plus future proof obligations. |
| Browser/WASM | Future design | Explores running capOS concepts in a browser using WebAssembly and worker-per-process isolation. |
Rejected or Superseded
| Proposal | Status | Purpose |
|---|---|---|
| Cap’n Proto SQE Envelope | Rejected | Records why ring SQEs stay fixed-layout transport records instead of becoming Cap’n Proto messages themselves. |
| Sleep(INF) Process Termination | Rejected | Records why infinite sleep should not replace explicit process termination, while preserving typed status and future sys_exit removal as separate lifecycle work. |
Maintenance
When a proposal becomes implemented, rejected, or stale, update this index in the same change that changes the proposal or corresponding implementation. Long proposal files may describe target behavior; this index is the first status checkpoint before a reader opens those documents.
Proposal: Capability-Based Service Architecture
How capOS processes receive authority, compose into services, and expose layered capabilities — without a service manager daemon.
Problem
Traditional OSes grant processes ambient authority (file system, network, IPC namespaces) and then restrict it via sandboxing (seccomp, namespaces, AppArmor). Service managers like systemd handle dependencies, lifecycle, and resource limits through a central daemon with a massive configuration surface.
capOS inverts this: processes start with zero authority and receive only the capabilities they need. The capability graph implicitly encodes service dependencies, resource limits, and access control. No central daemon required.
Process Startup Model
A process receives its entire authority as a set of named capabilities at spawn time. There is no ambient authority to fall back on — if a capability wasn’t granted, the operation is impossible.
The child process sees its granted capabilities by name. It cannot discover or request capabilities it wasn’t given.
Capability Layering
Each process consumes lower-level capabilities and exports higher-level ones. Authority narrows at every layer:
Kernel
│
├─ Nic cap (raw frame send/receive for one device)
├─ Timer cap (monotonic clock)
├─ DeviceMmio cap (one device's BAR regions)
└─ Interrupt cap (one IRQ line)
│
v
NIC Driver Process
│
└─ Nic cap ──> Network Stack Process
│
├─ TcpSocket cap (one connection)
├─ UdpSocket cap (one socket)
└─ NetworkManager cap (create sockets)
│
v
HTTP Service Process
│
├─ Fetch cap (any URL)
│ │
│ v
│ Trusted Process (holds Fetch, mints scoped caps)
│
└─ HttpEndpoint cap (one origin)
│
v
Application Process
The application at the bottom holds an HttpEndpoint cap scoped to a single
origin. It cannot make raw TCP connections, send arbitrary packets, or touch
any device. The capability is the security policy.
HTTP Capabilities
Two levels of HTTP capability: Fetch (general) and HttpEndpoint (scoped).
HttpEndpoint is implemented by a process that holds a Fetch cap and
restricts it.
Fetch
Unrestricted HTTP access — equivalent to the browser Fetch API. The holder can make requests to any URL. This is the base capability that HTTP service processes use internally.
interface Fetch {
# General-purpose HTTP request to any URL.
request @0 (url :Text, method :Text, headers :List(Header), body :Data)
-> (status :UInt16, headers :List(Header), body :Data);
}
struct Header {
name @0 :Text;
value @1 :Text;
}
Fetch is powerful — granting it is roughly equivalent to granting arbitrary
outbound network access. It should only be held by service processes that need
to make requests on behalf of others, not by application code directly.
HttpEndpoint
A restricted view of Fetch, scoped to a single origin. The holder can only
make requests within the bounds encoded in the capability.
interface HttpEndpoint {
# Request scoped to this endpoint's origin.
# Path is relative (e.g., "/v1/users").
request @0 (method :Text, path :Text, headers :List(Header), body :Data)
-> (status :UInt16, headers :List(Header), body :Data);
}
Note: same request() signature as Fetch, but path instead of url.
The origin is implicit — bound into the capability at mint time.
Attenuation
A process holding Fetch mints HttpEndpoint caps by narrowing authority.
The core restriction is always origin — Fetch can reach any URL,
HttpEndpoint is locked to one host. Additional constraints (path prefixes,
method restrictions, rate limits) are possible but are userspace policy
details, not OS-level concerns.
This is the standard object-capability attenuation pattern: same interface,
less authority. The application code is identical whether it holds a broad or
narrow HttpEndpoint.
Boot and Initialization Sequence
The kernel doesn’t know about services. It boots, creates a handful of
kernel-provided caps, and spawns exactly one process: init. Everything else
is init’s responsibility.
Current State vs Target State
The implementation is in transition. The default system.cue path still lets
the kernel spawn every service listed in the manifest and wire cross-service
caps through kernel/src/cap/mod.rs::create_all_service_caps. The
system-spawn.cue path now sets config.initExecutesManifest; in that mode
the kernel validates the full manifest, boots only the first init service,
grants init BootPackage and ProcessSpawner, and lets init resolve the
remaining ServiceEntry graph through ProcessSpawner.
The target model removes the kernel-side service graph entirely. The manifest stops being a kernel authority graph and becomes a boot package delivered to init:
- List of embedded binaries (init needs them before any storage service exists; they can’t be fetched from a filesystem that hasn’t started).
- Init’s config blob (CUE-encoded tree; what to spawn, with what attenuations, with what restart policy).
- Kernel boot parameters (memory limits, feature flags) consumed by the kernel itself, not forwarded to init.
The kernel spawns exactly one userspace process (init) with a fixed cap bundle:
Console— kernel serial wrapper (may be replaced later by a userspace log service, with init retaining a direct console cap for emergency use).ProcessSpawner— only init and its delegated supervisors hold this.FrameAllocator— physical frame authority for init’s own allocations.VirtualMemory— per-process address-space authority for init.DeviceManager— enumerate/claim devices; init delegates device-specific slices to drivers.Timer— monotonic clock.BootPackage— read-only cap exposing the embedded binaries and the config blob.
Everything else — drivers, net-stack, filesystems, supervisors, apps —
init spawns at runtime via ProcessSpawner with appropriate attenuation.
No manifest ServiceEntry, no cross-service CapRef, no manifest exports.
Pre-Init Boundary After Stage 6
Rule of thumb: no userspace service runs before init. The kernel’s job is primitive cap synthesis and a single-process handoff; init’s job is the whole service graph. Concretely, after Stage 6:
- Stays in kernel pre-init: memory map ingest, frame allocator, heap,
paging, GDT/IDT/TSS, serial for kernel diagnostics, scheduler, ring
dispatch, kernel-cap
CapObjectimpls, ELF loading for init, boot package measurement (if attested boot is added). - Stays in manifest: binaries list + init config blob + kernel boot
params. Schema-wise,
ServiceEntryandCapSource::Servicedisappear;SystemManifestshrinks tobinaries + initConfig + kernelParams. - Moves to init: service topology, cross-service cap wiring, attenuation, restart policies, dynamic spawn, cap export/import, supervision trees. Anything a service manager would do.
- Moves to init or later services: logging policy, config store, secrets, filesystem mounts, network configuration, device binding.
Edge cases that might look like they want a pre-init service but don’t:
- Early crash / panic handling. Kernel-side panic handler, no service needed.
- Recovery shell. Kernel fallback: if init fails to reach a healthy state within a timeout (e.g. exits immediately, or never issues a liveness SQE), kernel optionally spawns a “recovery” binary from the boot package with the same cap bundle. Still just one userspace process at a time pre-supervisor-loop.
- Attested/measured boot. Kernel hashes binaries in the boot package
before handing
BootPackageto init. The measurement agent, if any, runs as a normal service spawned by init with a cap to the sealed measurements. - Early-boot console. Kernel owns serial and exposes
Consoleto init. A userspace log service can layer on top later; it is not pre-init.
Legacy Manifest Fields After Stage 6
ServiceEntry.caps, CapSource::Service, and ServiceEntry.exports are
transitional. ProcessSpawner and the generic init-side spawn loop are now in
place for system-spawn.cue; the remaining cleanup is to remove these fields
from the kernel bootstrap contract:
- Delete
ServiceEntryandCapSource::Servicefromschema/capos.capnp. - Collapse
SystemManifest.servicesintoinitConfig: CueValue. - Remove
create_all_service_caps, the two-pass resolver, and the manifest authority-graph validator (validate_manifest_graph). - Kernel spawns one process from
initConfig.initBinarywith the fixed cap bundle described above plusBootPackage.
The re-export restriction added in capos-config::validate_manifest_graph
(service A exports cap sourced from B.ep) becomes moot at that point
because there are no manifest exports at all. It stays as defensive
validation while the transitional schema exists.
Init Binary Embedding
Init is part of the kernel’s bootstrap contract, not a configuration
choice: the cap bundle handed to init is a kernel ABI, the _start(ring, pid, …) entry shape is a kernel ABI, and a version-mismatched init is a
footgun with no payoff in a single-init research OS. So the init ELF
ships inside the kernel binary via include_bytes!, not as a separate
manifest entry or Limine module.
Shape:
init/stays a standalone crate with its own linker script and code model (user-space base0x200000,staticrelocation model, 4 KiB alignment). Not a workspace member; different build flags than the kernel.kernel/build.rsdrivesinit/’s build (or depends on the prebuilt artifact at a known path) and emits aninclude_bytes!("…")into akernel::boot::INIT_ELF: &[u8]static.- Kernel bootstrap parses
INIT_ELFthrough the samecapos_lib::elfpath used for service binaries, creates the init address space viaAddressSpace::new_user(), loads segments, populates the cap bundle (includingBootPackage), and jumps. No Limine module lookup for init. SystemManifest.binariesstops containing an “init” entry. Itsbinarieslist is services-only.BootPackageexposes only what init hands out to children.- Measured-boot attestation (if added) covers the kernel ELF, which
transitively covers init’s bytes. Service binaries are hashed
separately by the kernel before handing
BootPackageto init.
What this does not change:
- Init still runs in Ring 3 with its own page tables; embedding is byte packaging, not privilege merging.
- Init is still ELF-parsed at boot — the same loader and W^X enforcement apply. The only thing different is where the bytes came from.
- Service binaries (everything spawned after init) stay in the boot
package as distinct blobs, exposed to init via
BootPackage. They are not linked into the kernel; their lifecycle is independent of the kernel’s.
What option was rejected: fully linking init into the kernel crate (shared
compilation unit, shared text). That collapses the kernel/user build
boundary, couples linker scripts and code models, and puts init’s
panics/UB inside the kernel’s compilation context. The process-isolation
boundary survives that arrangement — but the build-time separation that
makes the boundary trustworthy does not. include_bytes! preserves the
separation; static linking destroys it.
Kernel boot
│
├─ Create kernel caps: Console, Timer, DeviceManager, ProcessSpawner
│
└─ Spawn init with all kernel caps
│
init process (PID 1)
│
├─ Phase 1: Core services (sequential — each depends on previous)
│ ├─ DeviceManager.enumerate() → list of devices
│ ├─ Spawn NIC driver with device-specific caps
│ ├─ Wait for NIC driver to export Nic cap
│ ├─ Spawn net-stack with Nic + Timer caps
│ └─ Wait for net-stack to export NetworkManager cap
│
├─ Phase 2: Higher-level services (can be parallel)
│ ├─ Spawn http-service with TcpSocket cap from net-stack
│ ├─ Spawn dns-resolver with UdpSocket cap
│ └─ ...
│
└─ Phase 3: Applications
├─ Spawn app-a with HttpEndpoint("api.example.com")
├─ Spawn app-b with Fetch cap (trusted)
└─ ...
The Init Process in Detail
Init is a regular userspace process with privileged caps. It is the only
process that holds ProcessSpawner (the right to create new processes) and
DeviceManager (the right to enumerate and claim devices). It can delegate
subsets of these to child supervisors.
// init/src/main.rs — this IS the system configuration
fn main(caps: CapSet) {
let spawner = caps.get::<ProcessSpawner>("spawner");
let devices = caps.get::<DeviceManager>("devices");
let timer = caps.get::<Timer>("timer");
let console = caps.get::<Console>("console");
// === Phase 1: Hardware drivers ===
// Find the NIC
let nic_device = devices.find("virtio-net")
.expect("no network device found");
// Spawn NIC driver — gets ONLY its device's MMIO + IRQ
let nic_driver = spawner.spawn(SpawnRequest {
binary: "/sbin/virtio-net",
caps: caps![
"device_mmio" => nic_device.mmio(),
"interrupt" => nic_device.interrupt(),
"log" => console.clone(),
],
restart: RestartPolicy::Always,
});
// The driver exports a Nic cap once initialized
let nic: Cap<Nic> = nic_driver.exported("nic").wait();
// === Phase 2: Network stack ===
let net_stack = spawner.spawn(SpawnRequest {
binary: "/sbin/net-stack",
caps: caps![
"nic" => nic,
"timer" => timer.clone(),
"log" => console.clone(),
],
restart: RestartPolicy::Always,
});
let net_mgr: Cap<NetworkManager> = net_stack.exported("net").wait();
// === Phase 3: HTTP service ===
let tcp = net_mgr.create_tcp_pool();
let http_service = spawner.spawn(SpawnRequest {
binary: "/sbin/http-service",
caps: caps![
"tcp" => tcp,
"log" => console.clone(),
],
restart: RestartPolicy::Always,
});
let fetch: Cap<Fetch> = http_service.exported("fetch").wait();
// === Phase 4: Applications ===
// Trusted telemetry agent — gets full Fetch
spawner.spawn(SpawnRequest {
binary: "/sbin/telemetry",
caps: caps![
"fetch" => fetch.clone(),
"log" => console.clone(),
],
restart: RestartPolicy::OnFailure,
});
// Sandboxed app — gets scoped HttpEndpoint
let api_cap = fetch.attenuate(EndpointPolicy {
origin: "https://api.example.com",
paths: Some("/v1/users/*"),
methods: Some(&["GET", "POST"]),
});
spawner.spawn(SpawnRequest {
binary: "/app/my-service",
caps: caps![
"api" => api_cap,
"log" => console.clone(),
],
restart: RestartPolicy::OnFailure,
});
// Init stays alive as the root supervisor
supervisor_loop(&spawner);
}
Key Mechanisms
Cap export. A spawned process can export capabilities back to its parent
via the ProcessHandle (see Spawn Mechanism section). This is how the NIC
driver makes its Nic cap available to the network stack — init spawns the
driver, waits for it to export "nic", then passes that cap to the next
process.
Restart policy. Encoded in SpawnRequest, enforced by the supervisor
loop in the spawning process. When a child exits unexpectedly:
- Old caps held by the child are automatically revoked (kernel invalidates the process’s cap table on exit)
- Supervisor re-spawns with the same
SpawnRequest - New instance gets fresh caps — same authority, new identity
Dependency ordering. Sequential in code: wait() on exported caps
blocks until the dependency is ready. No declarative dependency graph
needed — Rust’s control flow is the dependency graph.
Service Taxonomy
Concrete categories of userspace services capOS expects to run. All spawned by init (or a supervisor init delegates to) after Stage 6. None are pre-init.
Hardware Drivers
One process per managed device. Each holds exactly the caps for its own
hardware: an DeviceMmio slice, the corresponding Interrupt cap, and
optionally a DmaRegion cap carved out of the frame allocator. Exports a
typed device cap (Nic, BlockDevice, Framebuffer, Gpu, …). Examples:
virtio-net, virtio-blk, NVMe, AHCI, framebuffer/GPU.
Platform Services
- Logger / journal — accepts
Logcap writes, forwards to console and/or durable storage. Init and kernel bootstrap use a directConsolecap until the logger is up; afterwards new services getLogcaps only. - Filesystem — one per mounted volume. Consumes a
BlockDevicecap, exportsDirectory/Filecaps. FAT, ext4, overlay, tmpfs. - Store — capability-native content-addressed storage backing
persistent capability state (
storage-and-naming-proposal.md). - Network stack — userspace TCP/IP (
networking-proposal.md). ConsumesNic+Timer, exportsNetworkManager,TcpSocket,UdpSocket,TcpListener. - DNS resolver — consumes a
UdpSocket, exportsResolver. - Config / secrets store — reads the initial config from
BootPackage, exposes runtimeConfigandSecretcaps with per-key attenuation. - Cloud metadata agent — detects IMDS / ConfigDrive / SMBIOS on cloud
boot and delivers a
ManifestDelta(cloud-metadata-proposal.md). - Upgrade manager — orchestrates
CapRetargetfor live service replacement (live-upgrade-proposal.md). - Capability proxy — makes local caps reachable over the network with
capnp-rpc transport (Plan 9’s
exportfsequivalent). - Measurement / attestation agent — consumes sealed kernel hashes
from
BootPackage, exposesQuotecaps for remote attestation.
Supervisors
Per-subsystem restart managers that hold a narrowed ProcessSpawner plus
the caps of the subtree they own. If any child crashes, the supervisor
tears down and re-spawns the set. Example: net-supervisor owns NIC
driver + net-stack + DHCP client.
Application Services
User-facing or user-spawned processes: HTTP servers, API gateways, worker
pools, shells, interactive tools. Hold only the narrow caps the supervisor
grants (HttpEndpoint for one origin, Directory for one mount, etc.).
Human users, service accounts, guests, and anonymous callers are represented
by session/profile services that grant scoped cap bundles; they are not kernel
subjects or ambient process credentials. See
user-identity-and-policy-proposal.md.
What Does Not Become a Service
- Console / serial — stays in the kernel as a
CapObjectwrapper. Small enough, needed for kernel diagnostics, no benefit from userspace isolation. A userspace log service can layer on top. - Frame allocator, virtual memory, scheduler, ring dispatch — kernel primitives, exposed as caps but not as services.
- Interrupt delivery, DMA mapping — kernel mechanisms, exposed to drivers as caps.
- Boot measurement — if added, happens in the kernel before
BootPackageexists; the measurement agent (userspace) only reports them.
Supervision
Supervision Tree
Init doesn’t have to supervise everything directly. It can delegate:
init (root supervisor)
├─ net-supervisor (holds: spawner subset, device caps)
│ ├─ virtio-net driver
│ ├─ net-stack
│ └─ http-service
└─ app-supervisor (holds: spawner subset, service caps)
├─ my-service
└─ another-app
Each supervisor is a process that holds a ProcessSpawner cap (possibly
restricted to specific binaries) and the caps it needs to grant to children.
If net-supervisor crashes, init restarts it, and it re-spawns the entire
networking subtree.
Supervisor Loop
#![allow(unused)]
fn main() {
fn supervisor_loop(children: &[SpawnRequest], spawner: &ProcessSpawner) {
let mut handles: Vec<ProcessHandle> = children.iter()
.map(|req| spawner.spawn(req.clone()))
.collect();
loop {
// Wait for any child to exit
let (index, exit_code) = wait_any(&handles);
let req = &children[index];
match req.restart {
RestartPolicy::Always => {
handles[index] = spawner.spawn(req.clone());
}
RestartPolicy::OnFailure if exit_code != 0 => {
handles[index] = spawner.spawn(req.clone());
}
_ => {
// Process exited normally, don't restart
}
}
}
}
}
Socket Activation
systemd pre-creates a socket and passes the fd to the service on first connection. In capOS, the supervisor does the same with caps:
Eager (default): supervisor spawns the child immediately with a
TcpListener cap. Child calls accept() and blocks.
Lazy: supervisor holds the TcpListener cap itself. On first incoming
connection (or on first accept() from a proxy cap), it spawns the child
and transfers the cap. The child code is identical in both cases.
#![allow(unused)]
fn main() {
// Lazy activation — supervisor holds the listener until needed
let listener = net_mgr.create_tcp_listener();
listener.bind([0,0,0,0], 8080);
// This blocks until a connection arrives
let _conn = listener.accept();
// Now spawn the actual service, giving it the listener
spawner.spawn(SpawnRequest {
binary: "/app/web-server",
caps: caps!["listener" => listener, "log" => console.clone()],
restart: RestartPolicy::Always,
});
}
Configuration
See docs/proposals/storage-and-naming-proposal.md for the full storage, naming, and configuration model.
Summary: the system topology is currently defined in a capnp-encoded
system manifest baked into the boot image. tools/mkmanifest compiles the
human-authored system.cue or system-spawn.cue source into the binary
manifest. The transition path already lets init validate and execute that
manifest through ProcessSpawner; the default boot path still needs to retire
the legacy kernel-side service graph. Runtime configuration lives in a
capability-based store service once that service exists.
Comparison with Traditional Approaches
| Concern | systemd/Linux | capOS |
|---|---|---|
| Service dependencies | Wants=, After=, Requires= | Implicit in cap graph |
| Sandboxing | seccomp, namespaces, AppArmor | Default: zero ambient authority |
| Socket activation | ListenStream=, fd passing protocol | Pass TcpListener cap |
| Restart policy | Restart=on-failure | Supervisor process loop |
| Logging | journald, StandardOutput=journal | Log cap in granted set |
| Resource limits | cgroups, MemoryMax=, CPUQuota= | Bounded allocator caps |
| Network access control | firewall rules (iptables/nftables) | Scoped HttpEndpoint / TcpSocket caps |
| Config format | INI-like unit files (~1500 directives) | Rust code or minimal manifest |
| Trusted computing base | systemd PID 1 (~1.4M lines) | Init process (hundreds of lines) |
Spawn Mechanism
Spawning is a capability-gated operation. The kernel provides a
ProcessSpawner capability — only the holder can create new processes.
Implemented Kernel Slice
The kernel now provides:
-
ProcessSpawnercapability — aCapObjectimpl inkernel/src/cap/process_spawner.rs. Methods:spawn(name, binaryName, grants) -> handleIndex— resolve a boot-package binary, load ELF, create address space (builds on existingelf.rsloader andAddressSpace::new_user()inmem/paging.rs), populate the initial cap table, schedule the process, and return theProcessHandlethrough the ring result-cap list- the returned
ProcessHandlecap lets the parent wait for child exit in the first slice; exported caps and kill semantics are later lifecycle work
-
Initial cap passing — at spawn time, the kernel copies permitted parent cap references into the child’s cap table or mints authorized child-local kernel caps. Raw grants preserve the source badge, endpoint-client grants can mint a requested badge from an owner endpoint, and child-local FrameAllocator/VirtualMemory grants are created for the child’s address space. The parent’s references are unaffected.
-
Cap export — future lifecycle work will let a child register a cap by name in its
ProcessHandle, making it available to the parent (or anyone holding the handle). This is the mechanism behindnic_driver.exported("nic").wait()once exported-cap lookup is added.
Schema
interface ProcessSpawner {
spawn @0 (name :Text, binaryName :Text, grants :List(CapGrant)) -> (handleIndex :UInt16);
}
struct CapGrant {
name @0 :Text;
capId @1 :UInt32;
interfaceId @2 :UInt64;
mode @3 :CapGrantMode;
badge @4 :UInt64;
source @5 :CapGrantSource;
}
struct CapGrantSource {
union {
capability @0 :Void;
kernel @1 :KernelCapSource;
}
}
enum CapGrantMode {
raw @0;
clientEndpoint @1;
}
interface ProcessHandle {
wait @0 () -> (exitCode :Int64);
}
Note on capability passing: Capabilities are referenced by cap table
slot IDs (UInt32), not by Cap’n Proto’s native capability table mechanism.
spawn() returns the ProcessHandle through the ring result-cap list;
handleIndex identifies that transferred cap in the completion. The first
slice passes a boot-package binaryName instead of raw ELF bytes so the
request stays within the bounded ring parameter buffer. kill, post-spawn
grants, and exported-cap lookup remain future lifecycle work until their
teardown and authority semantics are implemented.
capOS uses manual capnp dispatch (CapObject trait with raw message bytes,
not capnp-rpc), so cap references are plain integers and typed result caps use
the ring transfer-result metadata. See
userspace-binaries-proposal.md Part 7 for
the surrounding userspace bootstrap schema context.
Relationship to Existing Code
The current kernel has these pieces in place:
- ELF loading (
kernel/src/elf.rs) — parses PT_LOAD segments, validates alignment, and feeds the reusable spawn primitive behindProcessSpawner. - Address space creation (
kernel/src/mem/paging.rs) —AddressSpace::new_user()creates isolated page tables with the kernel mapped in the upper half. - Cap table (
kernel/src/cap/table.rs) —CapTablewithinsert(),get(),remove(), transfer preflight, provisional insert, commit, and rollback helpers. EachProcessowns one local table. - Process struct and scheduler (
kernel/src/process.rs,kernel/src/sched.rs) — a process table plus round-robin run queue are in place for both legacy manifest-spawned services and init-spawned children.
Generic capability transfer/release and the reusable ProcessSpawner
lifecycle path are complete enough for init-owned service startup. Remaining
lifecycle gaps are kill, post-spawn grants, runtime exported-cap lookup,
restart supervision, and retiring the default kernel-side service graph.
Prerequisites
| Prerequisite | Status | Why |
|---|---|---|
| ELF loading + address spaces | Done (Stage 2-3) | elf.rs, AddressSpace::new_user() |
| Capability ring + cap_enter | Done (Stage 4/6 foundation) | Ring-based cap invocation with blocking waits |
| Scheduling + preemption (core) | Done (Stage 5) | Round-robin, PIT 100 Hz, context switch |
| Cross-process Endpoint IPC | Done (Stage 6 foundation) | CALL/RECV/RETURN routing through Endpoint objects |
| Generic cap transfer/release | Done (Stage 6, 2026-04-22) | Copy/move transfer, result-cap insertion, CAP_OP_RELEASE; epoch revocation still future |
| ProcessSpawner + ProcessHandle | Done (Stage 6, 2026-04-22) | Init-driven spawn with grants, wait completion, hostile-input coverage; kill/post-spawn grants still future |
| Authority graph + quota design (S.9) | Done (2026-04-21) | Defines transfer/spawn invariants, per-process quotas, and rollback rules; see docs/authority-accounting-transfer-design.md |
This proposal describes the target architecture. Individual pieces (like
Fetch/HttpEndpoint) are additive — they’re userspace processes that
compose existing caps into higher-level ones. No kernel changes needed
beyond Stages 4-6.
First Step After Transfer and ProcessSpawner — done 2026-04-22
The minimal demonstration of this architecture landed together with capability
transfer and ProcessSpawner:
ProcessSpawnercap inkernel/src/cap/process_spawner.rswraps ELF loading and address-space creation behind a typed capability.- Init spawns children —
make run-spawnboots a manifest withconfig.initExecutesManifestset; the kernel boots only init, theninitspawnsendpoint-roundtrip,ipc-server, andipc-clientfrom manifest entries throughProcessSpawner, grants endpoint owner/client facets, and waits on eachProcessHandle. - Cross-process cap invocation — spawned client invokes the server’s Endpoint cap, server replies, both print to console.
This exercises: spawn cap, initial cap passing, manifest-declared export
recording, cross-process cap invocation, hostile-input rejection, and
per-process resource exhaustion paths. Generic manifest execution exists for
the system-spawn.cue transition path. Making it the default make run path
and deleting the legacy kernel resolver is the selected follow-on milestone in
WORKPLAN.md.
Open Questions
-
Cap revocation. If a service is restarted, its old caps should be invalidated. Planned approach (from research): epoch-based revocation (EROS-inspired, O(1) invalidate) plus generation counters on CapId (Zircon-inspired, stale reference detection). See ROADMAP.md Stages 5-6.
-
Cap discovery. How does a process learn what caps it was given? Resolved: name→(cap_id, interface_id) mapping passed at spawn via a well-known page (
CapSet). See userspace-binaries-proposal.md Part 2.cap_idis the authority-bearing table handle.interface_idis the transported capnpTYPE_IDused by typed clients to check that the handle speaks the expected interface. -
Lazy spawning. Should the init process start everything eagerly, or should caps be backed by lazy proxies that spawn the backing service on first invocation?
-
Cap persistence. If the system reboots, should the cap graph be reconstructable from saved state? Or is it always rebuilt from init code?
-
Delegation depth. Can an application further delegate its
HttpEndpointcap to a subprocess? If so, the HTTP gateway needs to support fan-out. If not, how is this restriction enforced?
Proposal: Storage, Naming, and Persistence
What replaces the filesystem in a capability OS where Cap’n Proto is the universal wire format.
The Problem with Filesystems
In Unix, the filesystem is the universal namespace. Everything is a path:
/dev/sda, /etc/config, /proc/self/fd/3, /run/dbus/system_bus_socket.
Paths are ambient authority — any process can open /etc/passwd if the
permission bits allow. The filesystem conflates naming, access control,
persistence, and device abstraction into one mechanism.
capOS has capabilities instead of paths. Access control is structural (you can only use what you were granted), not advisory (permission bits checked at open time). This means:
- No global namespace needed — each process sees only its granted caps
- No path-based access control — the cap IS the access
- No distinction between “file”, “device”, “socket” — everything is a typed capability interface
A traditional VFS would reintroduce ambient authority through the back door. Instead, capOS needs a storage and naming model native to capabilities and Cap’n Proto.
Core Insight: Cap’n Proto Everywhere
Cap’n Proto is already used in capOS for:
- Interface definitions —
.capnpschemas define capability contracts - IPC messages — capability invocations are capnp messages
- Serialization — capnp wire format crosses process boundaries
If we extend this to storage, then:
- Stored objects are capnp messages
- Configuration is capnp structs
- Binary images are capnp-wrapped blobs
- The boot manifest is a capnp message describing the initial capability graph
No format conversion anywhere. The same tools (schema compiler, serializer, validator) work for IPC, storage, config, and network transfer.
Architecture
Three Layers
Target architecture after the manifest executor and process-spawner work:
Boot Image (read-only, baked into ISO)
│
│ capnp-encoded manifest + binaries
│
v
Kernel (creates initial caps from manifest)
│
│ grants caps to init
│
v
Init (builds live capability graph)
│
├──> Filesystem services (FAT, ext4 — wrap BlockDevice as Directory/File)
│
├──> Store service (capability-native content-addressed storage)
│ backed by: virtio-blk, RAM, or network
│
└──> All other services (receive Directory, Store, or Namespace caps)
Layer 1: Boot Image
The boot image (ISO/disk) contains a capnp-encoded system manifest loaded as a Limine module alongside the kernel. The manifest describes:
struct SystemManifest {
# Binaries available at boot, keyed by name
binaries @0 :List(NamedBlob);
# Initial service graph — what to spawn and with what caps
services @1 :List(ServiceEntry);
# Static configuration values as an evaluated CUE-style tree
config @2 :CueValue;
}
struct NamedBlob {
name @0 :Text;
data @1 :Data;
}
struct ServiceEntry {
name @0 :Text;
binary @1 :Text; # references a NamedBlob by name
caps @2 :List(CapRef); # what caps this service receives
restart @3 :RestartPolicy;
exports @4 :List(Text); # cap names this service is expected to export
}
struct CapRef {
name @0 :Text; # local name in the child's cap table
expectedInterfaceId @1 :UInt64; # generated .capnp TYPE_ID for validation
union {
unset @2 :Void; # invalid; keeps omitted sources fail-closed
kernel @3 :KernelCapSource;
service @4 :ServiceCapSource;
}
}
enum KernelCapSource {
console @0;
endpoint @1;
frameAllocator @2;
virtualMemory @3;
}
struct ServiceCapSource {
service @0 :Text;
export @1 :Text;
}
enum RestartPolicy {
never @0;
onFailure @1;
always @2;
}
struct CueValue {
union {
null @0 :Void;
boolean @1 :Bool;
intValue @2 :Int64;
uintValue @3 :UInt64;
text @4 :Text;
bytes @5 :Data;
list @6 :List(CueValue);
fields @7 :List(CueField);
}
}
struct CueField {
name @0 :Text;
value @1 :CueValue;
}
Capability source identity is already structured in the bootstrap manifest, so source selection does not depend on parsing authority strings:
struct CapRef {
name @0 :Text; # local name in the child's CapSet
expectedInterfaceId @1 :UInt64; # generated .capnp TYPE_ID for validation
union {
unset @2 :Void; # invalid; keeps omitted sources fail-closed
kernel @3 :KernelCapSource;
service @4 :ServiceCapSource;
}
}
enum KernelCapSource {
console @0;
endpoint @1;
frameAllocator @2;
virtualMemory @3;
}
struct ServiceCapSource {
service @0 :Text;
export @1 :Text;
}
KernelCapSource / ServiceCapSource select the authority to grant. The
expectedInterfaceId field carries the generated Cap’n Proto interface
TYPE_ID and only checks that the granted object speaks the expected schema.
It cannot replace source identity: many different objects may expose the same
interface while representing different authority.
The build system (Makefile) generates this manifest from a human-authored
description and packs it into the ISO as manifest.bin. Current code embeds
every SystemManifest.binaries entry into that manifest as NamedBlob data,
including the release-built init and smoke-demo ELFs. Exposing the manifest to
init as a read-only BootPackage capability (rather than letting the kernel
parse and act on the service graph) is the selected follow-on milestone.
Using a CueValue tree instead of AnyPointer keeps the manifest directly
decodable in no_std userspace without depending on Cap’n Proto reflection.
Transitional Schema Note
ServiceEntry, CapSource::Service, and ServiceEntry.exports are
transitional. ProcessSpawner and copy/move cap transfer are implemented
(2026-04-22), but the default make run boot path still has the kernel
spawn every declared service and wire cross-service caps. Once init owns
generic manifest execution, the manifest loses the service graph entirely:
struct SystemManifest {
# Binaries available at boot, keyed by name
binaries @0 :List(NamedBlob);
# Init's config blob (replaces the service graph)
initConfig @1 :CueValue;
# Kernel boot parameters (memory limits, feature flags)
kernelParams @2 :CueValue;
}
ServiceEntry / CapRef disappear from the schema and become plain CUE
fields inside initConfig. Init reads them at runtime and calls
ProcessSpawner directly. validate_manifest_graph,
validate_bootstrap_cap_sources, and create_all_service_caps all retire
once that happens. See docs/proposals/service-architecture-proposal.md —
“Legacy Manifest Fields After Stage 6” for the deprecation plan.
Layer 2: Kernel Bootstrap
Target design for the kernel’s boot role:
- Parse the system manifest (read-only capnp message from Limine module).
- Hash the embedded binaries for optional measured-boot attestation.
- Create kernel-provided capabilities:
Console,Timer,DeviceManager,ProcessSpawner,FrameAllocator,VirtualMemory(per-process), and a read-onlyBootPackagecap exposingSystemManifest.binariesandinitConfig. - Spawn init — exactly one userspace process — with that cap bundle.
Current code has not reached this split for the default boot: the kernel
still parses the manifest and creates one process per ServiceEntry.
The transition path exists in system-spawn.cue: it sets
config.initExecutesManifest, the kernel validates the full manifest but
boots only init, and init spawns endpoint, IPC, VirtualMemory, and
FrameAllocator cleanup demo children through ProcessSpawner. Retiring the
legacy kernel resolver for default make run is the selected follow-on
milestone tracked in WORKPLAN.md.
Layer 3: Init and the Live Capability Graph
Target init reads initConfig from the BootPackage cap and executes it:
fn main(caps: CapSet) {
let spawner = caps.get::<ProcessSpawner>("spawner");
let boot = caps.get::<BootPackage>("boot");
let config = boot.init_config()?; // CueValue
// Walk service entries from the config and spawn in dependency order
for entry in config.field("services")?.iter()? {
let binary = boot.binary(entry.field("binary")?.as_str()?)?;
let granted = resolve_caps(entry.field("caps")?, &running_services, &caps);
let handle = spawner.spawn(binary, granted, entry.field("restart")?.into())?;
running_services.insert(entry.field("name")?.as_str()?.into(), handle);
}
supervisor_loop(&running_services);
}
In this target model, init is a generic manifest executor rather than a
hardcoded service graph. The system topology is defined in the boot
package’s initConfig, not in init’s source code. Changing what services
run means rebuilding the boot image with a different config blob, not
recompiling init. Manifest graph resolution stops being a kernel concern.
The current transition still uses SystemManifest.services as the service
graph instead of initConfig; init reads the BootPackage manifest, validates a
metadata-only ManifestBootstrapPlan, resolves kernel and service cap sources,
records exported caps, spawns children in manifest order, and waits for their
ProcessHandles.
Two Storage Models
capOS supports two complementary storage models, both exposed as typed capabilities:
Filesystem Capabilities (Directory, File)
For accessing traditional block-based filesystems (FAT, ext4, ISO9660) and
for POSIX compatibility. A filesystem service wraps a BlockDevice and
exports Directory/File capabilities.
BlockDevice (raw sectors)
│
└──> Filesystem service (FAT, ext4, ...)
│
├──> Directory caps (namespace over files)
└──> File caps (read/write byte streams)
This model maps naturally to USB flash drives, NVMe partitions, and
network-mounted filesystems. The open() and sub() operations return new
capabilities via IPC cap transfer (see “IPC and Capability Transfer” below).
Capability-Native Store (Store, Namespace)
For capOS-native data: configuration, service state, content-addressed object
storage. A store service wraps a BlockDevice and exports Store/Namespace
capabilities.
BlockDevice (raw sectors)
│
└──> Store service
│
├──> Store cap (content-addressed put/get)
└──> Namespace caps (mutable name→hash mappings)
Content-addressing provides automatic deduplication, verifiable integrity, and immutable references. Namespaces add mutable bindings on top.
Bridging the Two Models
The models are composable. An adapter service can bridge between them:
- FsStore adapter: exposes a Directory tree as a content-addressed Store (hash each file’s contents, directory listings become capnp-encoded objects)
- StoreFS adapter: exposes Store/Namespace as a Directory tree (each name maps to a File whose contents are the stored object)
- Import/export: a utility service reads files from a Directory and stores them in a Store, or materializes Store objects as files in a Directory
In both cases the adapter is a userspace service holding caps to both subsystems. No kernel mechanism needed — just capability composition.
File I/O Interfaces
Directory, File, Store, and Namespace caps may be scoped to a user session, guest profile, anonymous request, or service identity, but the cap remains the authority. POSIX ownership metadata is compatibility data inside these services, not a system-wide authorization channel. See user-identity-and-policy-proposal.md.
BlockDevice
Raw sector access, served by device drivers (virtio-blk, NVMe, USB mass
storage). The driver receives hardware capabilities (MMIO, IRQ,
FrameAllocator for DMA) and exports a BlockDevice cap.
interface BlockDevice {
readBlocks @0 (startLba :UInt64, count :UInt32) -> (data :Data);
writeBlocks @1 (startLba :UInt64, count :UInt32, data :Data) -> ();
info @2 () -> (blockSize :UInt32, blockCount :UInt64, readOnly :Bool);
flush @3 () -> ();
}
For bulk transfers, readBlocks/writeBlocks accept a SharedBuffer
capability instead of inline Data (see “Shared Memory for Bulk Data”
below). The inline-Data variants work for metadata reads and small
operations; the SharedBuffer variants avoid copies for large I/O.
File
Byte-stream access to a single file. Served by filesystem services. Created
dynamically when a client calls Directory.open() — the filesystem service
creates a File CapObject for the opened file and transfers it to the
caller via IPC cap transfer.
interface File {
read @0 (offset :UInt64, length :UInt32) -> (data :Data);
write @1 (offset :UInt64, data :Data) -> (written :UInt32);
stat @2 () -> (size :UInt64, created :UInt64, modified :UInt64);
truncate @3 (length :UInt64) -> ();
sync @4 () -> ();
close @5 () -> ();
}
close releases the server-side state for this file (open cluster chain
cache, dirty buffers). The kernel-side CapTable entry is removed by the system
transport via CAP_OP_RELEASE when the local holder releases it; generated
capos-rt handle drop still needs RAII integration before ordinary userspace
handles submit that opcode automatically. CapabilityManager is
management-only (list(), later grant()); it does not expose a drop()
method because ordinary handle lifetime belongs to the transport, not to an
application call on the same table that dispatches it.
Attenuation: a read-only File wraps the original and rejects write,
truncate, sync calls. An append-only File rejects write at offsets
other than the current size.
Directory
Namespace over files on a filesystem. Served by filesystem services.
open() and sub() return new capabilities via IPC cap transfer.
interface Directory {
open @0 (name :Text, flags :UInt32) -> (file :File);
list @1 () -> (entries :List(DirEntry));
mkdir @2 (name :Text) -> (dir :Directory);
remove @3 (name :Text) -> ();
sub @4 (name :Text) -> (dir :Directory);
}
struct DirEntry {
name @0 :Text;
size @1 :UInt64;
isDir @2 :Bool;
}
sub() returns a Directory scoped to a subdirectory — the analog of chroot.
The caller cannot traverse upward or see the parent directory. open() with
create flags creates a new file if it doesn’t exist.
The flags field in open() is a bitmask: CREATE = 1, TRUNCATE = 2,
APPEND = 4. No READ/WRITE flags — those are determined by the
Directory cap’s attenuation (a read-only Directory returns read-only Files).
Syscall Trace: Reading a File from a FAT USB Drive
Four userspace processes: App, FAT service, USB mass storage, xHCI driver.
With promise pipelining (one submission):
Cap’n Proto promise pipelining lets the App chain dependent calls without waiting for intermediate results. The App submits a single pipelined request: “open this file, then read from the result”:
# Single pipelined submission (SQEs with PIPELINE flag):
# call 0: dir.open("report.pdf") → promise P0
# call 1: P0.file.read(offset=0, len=4096) → depends on P0
cap_submit([
{cap=2, method=OPEN, params={"report.pdf", flags=0}},
{cap=PIPELINE(0, field=file), method=READ, params={offset:0, length:4096}},
])
→ kernel routes call 0 to FAT service via Endpoint
→ FAT service reads directory entry from BlockDevice
→ FAT service creates FileCapObject, replies with File cap
→ kernel sees pipelined call 1 targeting the File cap from call 0
→ kernel dispatches call 1 to the same FAT service (or direct-invokes
the new File CapObject if it's a local endpoint)
→ FAT service maps offset → cluster chain → LBA
→ FAT service submits CALL SQE: {cap=blk_cap, method=READ_BLOCKS, params={lba, count}}
→ USB mass storage → xHCI → hardware → back up
← completion: {data: [4096 bytes]}, File cap installed as cap_id=5
One app-to-kernel transition. The kernel resolves the pipeline dependency internally — the App never sees the intermediate File cap until the whole chain completes (though the cap is installed and usable afterward).
This is a core Cap’n Proto feature: by expressing “call method on the
not-yet-resolved result of another call,” the client avoids a round-trip
for each link in the chain. For deeper chains (e.g., dir.sub("a").sub("b") .open("file").read(0, 4096)), the savings compound — one submission instead
of four sequential syscalls.
Without pipelining (two sequential ring submissions):
Without promise pipelining, the App submits two separate CALL SQEs via the ring, blocking on each completion before submitting the next:
# 1. Open file (App holds Directory cap, cap_id=2)
# App writes CALL SQE: {cap=2, method=OPEN, params={"report.pdf", flags=0}}
cap_enter(min_complete=1, timeout=MAX)
→ kernel routes CALL to FAT service via Endpoint
→ FAT service reads directory entry from BlockDevice
→ FAT service creates FileCapObject for this file
→ FAT service posts RETURN SQE with [FileCapObject] in xfer_caps
→ kernel installs File cap in App's table → cap_id=5
← App reads CQE: result={file: cap_index=0}, new_caps=[5]
# 2. Read 4096 bytes from offset 0
# App writes CALL SQE: {cap=5, method=READ, params={offset:0, length:4096}}
cap_enter(min_complete=1, timeout=MAX)
→ kernel routes CALL to FAT service
→ FAT service maps offset → cluster chain → LBA
→ FAT service submits CALL SQE: {cap=blk_cap, method=READ_BLOCKS, params={lba, count}}
→ kernel routes to USB mass storage
→ mass storage submits CALL SQE: {cap=usb_cap, method=BULK_TRANSFER, params={scsi_cmd}}
→ kernel routes to xHCI driver
→ xHCI programs TRBs, waits for interrupt
← returns raw sector data
← returns sector data
← FAT service extracts file bytes, posts RETURN SQE with {data: [4096 bytes]}
This works but costs two round-trips where pipelining needs one. The synchronous path is useful for simple cases and bootstrapping; pipelining is the intended steady-state model.
In both cases, the intermediate IPC hops (FAT → USB mass storage → xHCI) are invisible to the App.
Capability-Native Store
The Store Capability
Once the system is running, persistent storage is provided by a userspace service — the store. It’s backed by a block device (virtio-blk), and exposes a content-addressed object store where objects are capnp messages.
interface Store {
# Store a capnp message, returns its content hash
put @0 (data :Data) -> (hash :Data);
# Retrieve by hash
get @1 (hash :Data) -> (data :Data);
# Check existence
has @2 (hash :Data) -> (exists :Bool);
# Delete (if caller has authority — see note below)
delete @3 (hash :Data) -> ();
}
Note on delete: In a content-addressed store, deleting a hash can break
references from other namespaces pointing to the same object. delete on the
base Store interface is dangerously broad — a StoreAdmin interface
(separate from Store) may be more appropriate, with delete restricted to a
GC service that can verify no live references exist. Open Question #3 (GC)
should be resolved before implementing delete. The attenuation table below
lists Store (full) as “Read, write, delete any object” — in practice, most
callers should receive a Store attenuated to put/get/has only.
Content-addressed means:
- Deduplication is automatic (same content = same hash)
- Integrity is verifiable (hash the data, compare)
- References between objects are just hashes embedded in capnp messages
- No mutable paths — “updating a file” means storing a new version and updating the reference
Mutable References: Namespaces
A Namespace capability provides mutable name-to-hash mappings on top of
the immutable store:
interface Namespace {
# Resolve a name to a store hash
resolve @0 (name :Text) -> (hash :Data);
# Bind a name to a hash (if caller has write authority)
bind @1 (name :Text, hash :Data) -> ();
# List names (if caller has list authority)
list @2 () -> (names :List(Text));
# Get a sub-namespace (attenuated — restricted to a prefix)
sub @3 (prefix :Text) -> (ns :Namespace);
}
A Namespace cap scoped to "config/" can only see and modify names under
that prefix. This is the analog of a chroot — but structural, not a kernel
hack. The sub() method returns a new Namespace cap via IPC cap transfer.
Future: union composition. The research survey recommends
extending Namespace with Plan 9-inspired union semantics — a union(other, mode) method that merges two namespaces with before/after/replace ordering.
This adds composability without a global mount table. See
research.md §6.
IPC and Capability Transfer
Several storage operations return new capabilities: Directory.open()
returns a File, Directory.sub() returns a Directory, Namespace.sub()
returns a Namespace. This requires dynamic capability management — the kernel
must install new capabilities in a process’s CapTable at runtime as part of
IPC.
The Capability Ring
All kernel-userspace interaction goes through a shared-memory ring pair (submission queue + completion queue), inspired by io_uring. SQE opcodes map to capnp-rpc Level 1 message types. The ring is allocated per-process at spawn time and mapped into the process’s address space.
Syscall surface: 2 syscalls. New capabilities, operations, and transfer mechanisms are expressed as new SQE opcodes instead of expanding the syscall ABI.
| # | Syscall | Purpose |
|---|---|---|
| 1 | exit(code) | Terminate process |
| 2 | cap_enter(min_complete, timeout_ns) | Process pending SQEs, then wait until enough CQEs exist or the timeout expires |
Writing SQEs is syscall-free, but ordinary capability CALLs make progress
through cap_enter. Timer polling handles non-CALL ring work and only CALL
targets that explicitly opt into interrupt-context dispatch. cap_enter
flushes pending SQEs and can block the process until min_complete
completions are available or a finite timeout expires. An indefinite wait uses
timeout_ns = u64::MAX; timeout_ns = 0 keeps the call non-blocking. A future
SQPOLL-style worker can reintroduce a zero-syscall CALL-completion hot path
without running arbitrary capability methods from timer interrupt context.
The ring structs and synchronous CALL dispatch are implemented and working.
See capos-config/src/ring.rs for the shared ring structs and
kernel/src/cap/ring.rs for kernel-side processing.
Ring Layout
One 4 KiB page per process, mapped into both kernel (HHDM) and user space:
┌─────────────────────────┐ offset 0
│ Ring Header │ SQ/CQ head, tail, mask, flags
├─────────────────────────┤ offset 128
│ SQE Array (16 × 64B) │ submission queue entries
├─────────────────────────┤ offset 1152
│ CQE Array (32 × 32B) │ completion queue entries
└─────────────────────────┘
SQ: userspace owns tail (producer), kernel owns head (consumer)
CQ: kernel owns tail (producer), userspace owns head (consumer)
SQE Opcodes
Five opcodes handle everything — client calls, server dispatch, capability transfer, pipelining, and lifecycle:
| Opcode | capnp-rpc analog | Purpose |
|---|---|---|
CALL | Call | Invoke method on a capability |
RETURN | Return | Respond to incoming call (server side) |
RECV | (implicit) | Wait for incoming calls on Endpoint |
RELEASE | Release | Drop a capability reference |
FINISH | Finish | Release pipeline answer state |
TIMEOUT | — | Post a CQE after N nanoseconds (io_uring-inspired) |
TIMEOUT is an alternative to the timeout_ns argument on cap_enter:
it works with zero-syscall polling (kernel fires the CQE on a timer tick)
and composes with LINK/DRAIN for deadline-based chains.
SQE flags: PIPELINE (cap_id is a promise reference), LINK (chain to
next SQE), MULTISHOT (keep generating CQEs), DRAIN (barrier).
Promise Pipelining
A CALL SQE can target either a concrete CapId or a PromisedAnswer
reference (via the PIPELINE flag + pipeline_dep/pipeline_field fields).
The kernel resolves the dependency chain internally:
SQE[0]: CALL dir.open("report.pdf") → user_data=100
SQE[1]: CALL [PIPELINE: dep=100, field=0].read(0, 4096) → user_data=101
One cap_enter call. The kernel dispatches SQE[0], extracts the File cap
from the result, and dispatches SQE[1] against it — all without returning
to userspace between steps.
The Endpoint Kernel Object
For cross-process IPC, an Endpoint connects client-side proxy caps to a server’s receive loop:
Client's CapTable Server's CapTable
┌─────────────────┐ ┌──────────────────┐
│ cap 2: Proxy │ │ cap 0: Endpoint │
│ → endpoint ────────── Endpoint ◄──── RECV SQE ──│ │
│ badge: 42 │ (kernel obj) │ │
└─────────────────┘ └──────────────────┘
The server posts a RECV SQE (with MULTISHOT flag). Incoming calls appear
as CQEs with badge, interface_id, method_id, and a kernel-assigned call_id.
The server responds by posting a RETURN SQE referencing the call_id.
interface_id is the transported schema ID for the interface being invoked.
It should equal the generated TYPE_ID for that capnp interface. cap_id is
the authority-bearing table handle; interface_id is only the protocol tag.
The target capability entry owns one public interface; method_id selects a
method inside that interface, while cap_id identifies the object being
invoked. If the same backing state needs another interface, the transport
should mint a separate capability entry for that interface rather than letting
one handle accept multiple unrelated interface_id values.
Direct-Switch IPC
When a client’s CALL targets a cap served by a blocked server (waiting on RECV), the kernel marks that server as the direct IPC handoff target so the next context-switch path runs the callee before unrelated round-robin work. The current implementation still uses the ordinary saved-context restore path; small-message register transfer remains a future fastpath after measurement. See research.md §2.
Capability Transfer via Ring
Capabilities travel as sideband arrays (CapTransferDescriptor) alongside capnp
message bytes:
- CALL params: params buffer contains the capnp message bytes followed by
xfer_cap_counttransfer descriptors packed ataddr + len, which must be aligned toCAP_TRANSFER_DESCRIPTOR_ALIGNMENT. - RETURN results: server result buffers carry the capnp reply bytes and may
carry return transfer descriptors on
addr + len; the kernel inserts destination capability records in the caller’s result buffer after the normal result bytes. Count is reported in CQEcap_countand those records are written asCapTransferResult { cap_id, interface_id }values atresult_addr + result. The requested result buffer (result_len) must be large enough for both normal reply bytes and all appendedcap_countrecords.
xfer_cap_count > 0 with malformed descriptor metadata (bad mode bits, reserved
bits, _reserved0, or misalignment) fails closed as
CAP_ERR_INVALID_TRANSFER_DESCRIPTOR. Kernels that have not yet enabled transfer
handling should return CAP_ERR_TRANSFER_NOT_SUPPORTED for transfer-bearing SQEs.
The capnp wire format’s WirePointerKind::Other encodes capability indices
in messages. The sideband arrays map these indices to actual CapIds. The
kernel does not parse capnp messages — it transfers a list of caps alongside
the opaque message bytes.
Dynamic Capability Management
Every open(), sub(), or resolve() creates and transfers a new
capability at runtime. The kernel’s CapTable insert() and remove() are
the primitives. Capabilities flow through RETURN SQE sideband arrays (and
through the manifest at boot). No separate cap_grant mechanism needed —
authority flow follows the ring’s IPC graph.
The CapTable generation counter handles stale references: when a File cap is
closed (slot freed, generation bumps), any cached CapId returns
StaleGeneration instead of accidentally hitting a new occupant.
Shared Memory for Bulk Data
Copying file data through capnp Data fields works for metadata and small
reads, but is impractical for anything above a few KB. A 1 MB read through
a capability CALL copies data four times: device → driver heap → capnp
message → kernel buffer → client buffer.
SharedBuffer Capability
A SharedBuffer (also called MemoryObject, listed in ROADMAP.md Stage 6)
is a kernel object backed by physical pages that can be mapped into multiple
address spaces simultaneously. Zero copies between processes.
interface SharedBuffer {
# Map into caller's address space (returns virtual address and size)
map @0 () -> (addr :UInt64, size :UInt64);
# Unmap from caller's address space
unmap @1 () -> ();
# Size of the buffer
size @2 () -> (bytes :UInt64);
}
The kernel creates SharedBuffer objects on request (via a kernel-provided
BufferAllocator capability). The pages are reference-counted — the buffer
persists as long as any process holds a cap to it.
File I/O with SharedBuffer
File and BlockDevice interfaces support both inline-Data and SharedBuffer modes:
# Small read (< ~4 KB): inline in capnp message
file.read(offset=0, length=256) → {data: [256 bytes]}
# Large read: caller provides SharedBuffer, server fills it
let buf = buf_alloc.create(1048576); # 1 MB SharedBuffer
file.readBuf(offset=0, buf, length=1048576) → {bytesRead: 1048576}
# Data is now in buf's mapped pages — no copy through kernel
Extended File interface with SharedBuffer support:
interface File {
read @0 (offset :UInt64, length :UInt32) -> (data :Data);
write @1 (offset :UInt64, data :Data) -> (written :UInt32);
readBuf @2 (offset :UInt64, buffer :SharedBuffer, length :UInt32) -> (bytesRead :UInt32);
writeBuf @3 (offset :UInt64, buffer :SharedBuffer, length :UInt32) -> (written :UInt32);
stat @4 () -> (size :UInt64, created :UInt64, modified :UInt64);
truncate @5 (length :UInt64) -> ();
sync @6 () -> ();
close @7 () -> ();
}
The readBuf/writeBuf methods accept a SharedBuffer cap (transferred
via IPC). The server maps the buffer, performs DMA or memory copies into it,
then returns. The caller reads directly from the mapped pages.
For BlockDevice, the same pattern applies — the driver maps the SharedBuffer, programs DMA descriptors pointing to its physical pages, and the device writes directly into the shared memory.
When to Use Each Mode
| Scenario | Mechanism | Why |
|---|---|---|
| Reading a 64-byte config value | File.read() inline Data | Copy overhead negligible |
| Reading a 10 MB binary | File.readBuf() SharedBuffer | Avoids 4× copy overhead |
| FAT directory entry (32 bytes) | BlockDevice.readBlocks() inline | Small metadata read |
| Streaming video frames | File.readBuf() + ring of SharedBuffers | Continuous zero-copy |
| Network packet buffers | SharedBuffer ring between NIC driver and net stack | DMA-capable pages |
Attenuation
Storage services mint restricted capabilities using wrapper CapObjects:
| Capability | Authority |
|---|---|
Directory (full) | Open, list, mkdir, remove, sub |
Directory (read-only) | Open (returns read-only Files), list, sub only |
File (full) | Read, write, truncate, sync |
File (read-only) | Read and stat only |
File (append-only) | Read, stat, write at end only |
Store (full) | Read, write, delete any object |
Store (read-only) | Get and has only |
Namespace (full) | Resolve, bind, list under prefix |
Namespace (read-only) | Resolve and list only |
Blob (single object) | Read one specific hash |
SharedBuffer (read-only) | Map as read-only (page table: R, no W) |
An application that only needs to read its config gets a read-only
Directory scoped to its config path. It can’t write, can’t see other
apps’ directories, can’t access the raw BlockDevice.
Naming Without Paths
Traditional OS: process opens /var/lib/myapp/data.db — a global path.
capOS: process receives a Directory or Namespace cap at spawn time,
opens "data.db" within it. The process has no idea where on disk this
lives. It can’t traverse upward. There is no global root.
# Traditional: global path namespace
/
├── etc/
│ └── myapp/
│ └── config.toml
├── var/
│ └── lib/
│ └── myapp/
│ └── data.db
└── sbin/
└── myapp
# capOS: per-process capability set (no global namespace)
Process "myapp" sees:
"config" → Directory(read-only, scoped to myapp's config files)
"data" → Directory(read-write, scoped to myapp's data files)
"state" → Namespace(read-write, scoped to myapp's store objects)
"log" → Console cap
"api" → HttpEndpoint cap
The process doesn’t know or care about the backing storage layout. It just uses the capabilities it was granted.
Configuration
Build-Time Config (Boot Manifest)
The system manifest is authored at build time. The human-writable source
could be any format — TOML, CUE, or even a Makefile target that generates
the capnp binary. What matters is that it compiles to a SystemManifest
capnp message baked into the ISO.
Example source (TOML, compiled to capnp by a build tool):
[services.virtio-net]
binary = "virtio-net"
restart = "always"
caps = [
{ name = "device_mmio", source = { kernel = "device_mmio" } },
{ name = "interrupt", source = { kernel = "interrupt" } },
{ name = "log", source = { kernel = "console" } },
]
exports = ["nic"]
[services.net-stack]
binary = "net-stack"
restart = "always"
caps = [
{ name = "nic", source = { service = { service = "virtio-net", export = "nic" } } },
{ name = "timer", source = { kernel = "timer" } },
{ name = "log", source = { kernel = "console" } },
]
exports = ["net"]
[services.fat-fs]
binary = "fat-fs"
restart = "always"
caps = [
{ name = "blk", source = { service = { service = "usb-storage", export = "block-device" } } },
{ name = "log", source = { kernel = "console" } },
]
exports = ["root-dir"]
[services.my-app]
binary = "my-app"
restart = "on-failure"
caps = [
{ name = "api", source = { service = { service = "http-service", export = "api" } } },
{ name = "docs", source = { service = { service = "fat-fs", export = "root-dir" } } },
{ name = "data", source = { service = { service = "store", export = "namespace" } } },
{ name = "log", source = { kernel = "console" } },
]
A build tool validates this against the capnp schemas (does virtio-net
actually export "nic"? does http-service support endpoint() minting?)
and produces the binary manifest.
Runtime Config (via Store)
Once the store service is running, configuration can be stored there and updated without rebuilding the ISO. The store is just another capability — a config-management service could watch for changes and signal services to reload.
Connection to Network Transparency
If capabilities are the only abstraction, and capnp is the only wire format, then the transport is irrelevant:
- Local IPC: capnp message copied between address spaces by kernel
- Local store: capnp message written to block device
- Remote IPC: capnp message sent over TCP to another machine
- Remote store: capnp message fetched from a remote store service
A capability reference doesn’t encode where the backing service lives. The kernel (or a proxy) handles routing. This means:
- A
Directorycap could be backed by local FAT or a remote 9P server - A
Namespacecap could be backed by local storage or a remote store - A
Fetchcap could route through a local HTTP service or a remote proxy - A
ProcessSpawnercap could spawn locally or on a remote machine
The system manifest could describe services that run on different machines, and the capability graph spans the network. This is the “network transparency” item in the roadmap — it falls out naturally from the model.
Persistence of the Capability Graph
The live capability graph (which process holds which caps) is ephemeral — it exists in kernel memory and is lost on reboot. The system manifest describes the intended graph, and init rebuilds it on each boot.
For true persistence (resume after reboot without re-initializing):
- Each service serializes its state to the store before shutdown
- On next boot, the manifest includes “restore from store hash X” hints
- Services read their saved state from the store and resume
This is application-level persistence, not kernel-level. The kernel doesn’t snapshot the capability graph — services are responsible for their own state. This avoids the complexity of EROS-style transparent persistence while still allowing stateful services.
Phases
Phase 1: Boot Manifest (parallel with Stage 4)
- Define
SystemManifestschema inschema/ - Build tool (
tools/mkmanifest) that compilessystem.cueinto a capnp-encoded manifest and packs it into the ISO as a Limine module - Kernel parses the manifest and currently creates one process per
ServiceEntry - Kernel passes the manifest to init as bytes or a
Manifestcapability without interpreting the child service graph in thesystem-spawn.cuetransition path - Init becomes a generic manifest executor instead of a demo parser for
the
system-spawn.cuetransition path - No persistent storage yet — boot image is the only data source
Phase 2: File I/O Interfaces in Schema (parallel with Stage 6)
Depends on: IPC (Stage 6) for cross-process cap transfer.
- Add
BlockDevice,File,Directory,DirEntry,SharedBuffertoschema/capos.capnp - Implement kernel Endpoint and RECV/RETURN SQE opcodes
- Capability transfer in IPC replies (RETURN SQE xfer_caps installs caps in caller’s table)
- Demo: two-process file server (in-memory File/Directory service + client)
Phase 3: RAM-backed Store (after Phase 2)
Depends on: IPC (Stage 6) for cross-process store access.
- Implement
StoreandNamespaceas a userspace service - Backed by RAM (no disk driver yet, data lost on reboot)
- Services can store and retrieve capnp objects at runtime
- Demonstrates the naming model without requiring a block device driver
Namespace.sub()returns new caps via IPC cap transfer
Phase 4: BlockDevice Drivers and Filesystem (after virtio infrastructure)
- virtio-blk driver (userspace, reuses virtqueue infrastructure from networking smoke test)
BlockDevicetrait implementation- FAT filesystem service: wraps BlockDevice, exports Directory/File caps
- SharedBuffer integration for bulk reads (depends on Stage 6 MemoryObject)
- Store service uses BlockDevice for persistence
- System state survives reboot via store + manifest restore hints
Phase 5: Network Store (after networking)
- Store service can replicate to or fetch from a remote store
- Capability references transparently span machines
- Directory cap backed by a remote filesystem (9P-style)
Relationship to Other Proposals
- Networking proposal — the NIC driver and net stack are services described in the manifest, not hardcoded. The store could be backed by network storage once networking works. A remote Directory cap (9P over capnp) reuses the same File/Directory interfaces.
- Service architecture proposal — the manifest replaces code-as-config for init. ProcessSpawner, supervision, and cap export work as described there, but driven by manifest data instead of compiled Rust code. IPC Endpoints are the mechanism for service export.
- Capability model — IPC cap transfer (Endpoint + RETURN SQE) is the
mechanism that makes
open()andresolve()work. SharedBuffer is the bulk data path that makes file I/O practical. Both are tracked in ROADMAP.md Stage 6.
Open Questions
-
Manifest validation. How much can the build tool verify statically? Cap export names depend on runtime behavior of services. Should services declare their exports in their own metadata (like a package manifest)?
-
Schema evolution. When a service’s capnp interface changes, stored objects referencing the old schema need migration. Cap’n Proto has backwards-compatible schema evolution, but breaking changes need a story.
-
Garbage collection. Content-addressed store accumulates unreferenced objects. Who GCs? A separate service with
Storeread + delete authority? Reference counting in the namespace layer? -
Large objects. Storing multi-megabyte binaries as single capnp
Datafields is wasteful (capnp allocates contiguously). SharedBuffer partially addresses this for I/O, but the Store’sput/getinterface still takesData. Options: chunked storage (Merkle tree of hashes), a streamingBlobinterface, or SharedBuffer-aware Store methods. -
Trust model for the manifest. The boot manifest has full authority to define the system. Who signs it? How do you prevent a tampered ISO from granting excessive caps? Secure boot integration?
-
File locking and concurrent access. Multiple processes opening the same file through the same filesystem service need coordination. Options: mandatory locking in the filesystem service (rejects conflicting opens), advisory locking via a separate Lock capability, or single-writer enforcement at the Directory level (open with exclusive flag).
-
RETURN+RECV atomicity. When a server posts a RETURN SQE followed by a RECV SQE, there must be no window where a client call can arrive but the server isn’t listening. SQE LINK chaining (RETURN → RECV) should provide this atomicity — the kernel processes both SQEs as a unit.
Proposal: Userspace TCP/IP Networking
How capOS gets from “kernel boots” to “userspace process opens a TCP connection.”
This document has two parts: a kernel-internal smoke test (actionable now) and a userspace networking architecture (blocked on Stages 4-6).
Part 1: Kernel-Internal Networking (Phase A)
Prove that capOS can send and receive TCP/IP traffic. Everything runs in-kernel — no IPC, no capability syscalls, no multiple processes needed.
What’s Needed
- PCI enumeration — scan config space, find virtio-net device. Uses the standalone PCI/PCIe subsystem described in cloud-deployment-proposal.md Phase 4 (~200 lines of glue code on top of the shared PCI infrastructure)
- virtio-net driver — init virtqueues, send/receive raw Ethernet frames.
Use
virtio-driverscrate or implement manually (~600-800 lines) - Timer — PIT or LAPIC timer for
smoltcp’s poll loop (retransmit timeouts,Instant::now()support). Not a full scheduler — just a monotonic clock (~50-100 lines) - smoltcp integration — implement
phy::Devicetrait over the in-kernel driver, create anInterfacewith static IP, ICMP ping, then TCP - QEMU flags — add
-netdev user,id=n0 -device virtio-net-pci,netdev=n0to the Makefile
Milestones
- Ping: ICMP echo to QEMU gateway (10.0.2.2 with default user-mode net)
- HTTP: TCP connection to a host-side server, send GET, receive response
Estimated Scope
~1000-1500 lines of new kernel code. ~200 more for TCP on top of ping.
Crate Dependencies
| Crate | Purpose | no_std |
|---|---|---|
smoltcp | TCP/IP stack | yes (features: medium-ethernet, proto-ipv4, socket-tcp) |
virtio-drivers | virtio device abstraction | yes (optional — can implement manually) |
Timer Source Decision
Resolved: PIT is already configured at 100 Hz from Stage 5. A monotonic
TICK_COUNT (AtomicU64 in kernel/src/arch/x86_64/context.rs) increments on
each timer interrupt, providing ~10ms resolution — sufficient for TCP timeouts.
Switch to LAPIC timer when SMP lands (see smp-proposal.md
Phase A).
QEMU Network Config
| Config | Use case |
|---|---|
-netdev user,id=n0 -device virtio-net-pci,netdev=n0 | Default: NAT, guest reaches host |
Add hostfwd=tcp::5555-:80 to netdev | Forward host port to guest |
Part 2: Userspace Networking Architecture (Phases B+C)
Blocked on: Stage 4 (Capability Syscalls), Stage 5 (Scheduling), Stage 6 (IPC + Capability Transfer).
Architecture
+--------------------------------------------------+
| Application Process |
| holds: TcpSocket cap, UdpSocket cap, ... |
| calls: connect(), send(), recv() via capnp |
+---------------------------+----------------------+
| IPC (capnp messages)
+---------------------------v----------------------+
| Network Stack Process (userspace) |
| smoltcp TCP/IP stack |
| holds: NIC cap (from driver), Timer cap |
| implements: TcpSocket, UdpSocket, Dns caps |
+---------------------------+----------------------+
| IPC (capnp messages)
+---------------------------v----------------------+
| NIC Driver Process (userspace) |
| virtio-net driver |
| holds: DeviceMmio cap, Interrupt cap |
| implements: Nic cap |
+---------------------------+----------------------+
| capability syscalls
+---------------------------v----------------------+
| Kernel |
| DeviceMmio cap: maps BAR into driver process |
| Interrupt cap: routes virtio IRQ to driver |
| Timer cap: provides monotonic clock |
+--------------------------------------------------+
Three separate processes, each with minimal authority:
- NIC driver – only has access to the specific virtio-net device registers
and its interrupt line. Implements the
Nicinterface. - Network stack – holds the
Niccapability from the driver. Runs smoltcp. Implements higher-level socket interfaces. - Application – holds socket capabilities from the network stack. Cannot touch the NIC or raw packets directly.
Prerequisites
| Prerequisite | Roadmap Stage | Why |
|---|---|---|
| Capability syscalls | Stage 4 (sync path done) | All resource access via cap invocations |
| Scheduling + preemption | Stage 5 (core done) | Network I/O requires blocking/waking |
| IPC + capability transfer | Stage 6 | Cross-process cap calls |
| Interrupt routing to userspace | New kernel primitive | NIC driver receives IRQs |
| MMIO mapping capability | New kernel primitive | NIC driver accesses device registers |
Phase B: Capability Interfaces
- Define networking schema (
Nic,TcpSocket, etc.) inschema/net.capnp - Implement
NicandNetworkManageras kernel-internalCapObjects wrapping the Phase A code - Verify capability-based invocation works end-to-end in kernel
Phase C: Userspace Decomposition
- Move NIC driver into a userspace process
- Move network stack into a separate userspace process
- Application process uses socket capabilities via IPC
- Full capability isolation achieved
Cap’n Proto Schema (draft — will evolve with IPC implementation)
interface Nic {
transmit @0 (frame :Data) -> ();
receive @1 () -> (frame :Data);
macAddress @2 () -> (addr :Data);
linkStatus @3 () -> (up :Bool);
}
interface DeviceMmio {
map @0 (bar :UInt8) -> (virtualAddr :UInt64, size :UInt64);
unmap @1 (virtualAddr :UInt64) -> ();
}
interface Interrupt {
wait @0 () -> ();
ack @1 () -> ();
}
interface Timer {
now @0 () -> (ns :UInt64);
sleep @1 (ns :UInt64) -> ();
}
interface TcpSocket {
connect @0 (addr :Data, port :UInt16) -> ();
send @1 (data :Data) -> (bytesSent :UInt32);
recv @2 (maxLen :UInt32) -> (data :Data);
close @3 () -> ();
}
interface NetworkManager {
createTcpListener @0 () -> (listener :TcpListener);
createUdpSocket @1 () -> (socket :UdpSocket);
getConfig @2 () -> (addr :Data, netmask :Data, gateway :Data);
}
Open Questions
- DMA memory management. Dedicated
DmaAllocatorcapability vs extendingFrameAllocatorwithallocDma? - Blocking model. Kernel blocks caller on IPC channel vs return “would block” vs both?
- Buffer ownership. Copy into IPC message vs shared memory vs capability lending?
References
Crates
- smoltcp —
no_stdTCP/IP stack - virtio-drivers —
no_stdvirtio drivers (rCore project)
Specs
- virtio 1.2 spec — Section 5.1 covers network device
- OSDev Wiki: PCI, Virtio
Prior Art
- rCore — virtio-drivers + smoltcp
- Redox smolnetd — microkernel userspace net stack
- Fuchsia Netstack3 — capability-oriented, userspace, Rust
- Hermit — unikernel with smoltcp + virtio-net
QEMU
Proposal: Error Handling for Capability Invocations
How capOS communicates errors from capability calls back to userspace processes.
This proposal defines a two-level error model: transport errors (the invocation mechanism itself failed) and application errors (the capability processed the request and returned a structured error). The design aligns with Cap’n Proto’s own exception model and the patterns used by seL4, Zircon, and other capability systems.
Note (2025-06): This proposal was written when
cap_callwas the synchronous invocation syscall. Since then,cap_callhas been replaced by the shared-memory capability ring +cap_entersyscall. The two-level error model and CapException schema remain valid, but the delivery mechanism changes: transport errors and application errors are communicated through CQE result fields (status code + result buffer), not syscall return values. The “Syscall Return Convention” section below describes the originalcap_callconvention; a future revision should map these error codes to CQE status fields instead.
Current CQE Error Namespace
The capability ring uses signed 32-bit CapCqe.result values. Non-negative
values are opcode-specific success results; negative values are kernel transport
errors defined in capos-config/src/ring.rs:
| Code | Name | Meaning |
|---|---|---|
-1 | CAP_ERR_INVALID_REQUEST | Malformed request metadata or an opcode value not reserved in the ABI. |
-2 | CAP_ERR_INVALID_PARAMS_BUFFER | SQE parameter buffer is unmapped, out of range, or not readable. |
-3 | CAP_ERR_INVALID_RESULT_BUFFER | SQE result buffer is unmapped, out of range, or not writable. |
-4 | CAP_ERR_INVOKE_FAILED | Capability lookup or invocation failed before a successful result was produced. |
-5 | CAP_ERR_UNSUPPORTED_OPCODE | Opcode is reserved in the ABI but not yet dispatched. Currently returned for CAP_OP_FINISH; CAP_OP_RELEASE has kernel dispatch and reports stale/non-owned caps as request/invoke failures. |
-6 | CAP_ERR_TRANSFER_NOT_SUPPORTED | Transfer mode or sideband descriptor layout is recognized as unsupported by this kernel. |
-7 | CAP_ERR_INVALID_TRANSFER_DESCRIPTOR | xfer_cap_count descriptor layout malformed or contains reserved bits. |
-8 | CAP_ERR_TRANSFER_ABORTED | Transaction-in-progress transfer failed and must not produce partial capability state. |
This is deliberately a small transport namespace. Interface-specific failures should be encoded in the result payload once the target capability successfully handles the request.
Transfer-related transport mapping (3.6.0 ABI slice)
CAP_ERR_TRANSFER_NOT_SUPPORTEDis used for transfer-bearing SQEs that the kernel currently dispatches but does not yet process (xfer_cap_count != 0on kernels where sideband transfer is off).CAP_ERR_INVALID_TRANSFER_DESCRIPTORis used for structurally validly dispatched transfer SQEs where transfer metadata is malformed:- descriptor
transfer_modeis not exactlyCAP_TRANSFER_MODE_COPYorCAP_TRANSFER_MODE_MOVE; - any descriptor reserved bits are set;
- any descriptor
_reserved0field is non-zero; - descriptor region placement (
addr + len) is misaligned; - descriptor range overflows or cannot be safely bounded.
- descriptor
CAP_ERR_TRANSFER_ABORTEDis reserved for transaction failure after partial transfer side effects are prepared and must not be observed (all-or-nothing rollback boundary).CAP_ERR_INVALID_REQUESTremains for non-transfer transport malformation (unsupported opcodes for today, unsupported SQE fields not part of the transfer path, and malformed result/payload buffer pairs).
Problem Statement
Currently, cap_call returns u64::MAX on any error and prints the details
to the kernel serial console. The userspace process receives no information
about what went wrong – it cannot distinguish “invalid capability ID” from
“method not implemented” from “out of memory inside the service.”
Every other capability system separates transport-level errors (bad handle, message validation failure) from application-level errors (the service processed the request and returned a meaningful error). capOS needs both.
Background: How Other Systems Do This
Cap’n Proto RPC Protocol
The Cap’n Proto RPC specification defines an Exception type in rpc.capnp:
struct Exception {
reason @0 :Text;
type @3 :Type;
enum Type {
failed @0; # deterministic failure, retrying won't help
overloaded @1; # temporary resource exhaustion, retry with backoff
disconnected @2; # connection to a required capability was lost
unimplemented @3; # method not supported by this server
}
trace @4 :Text;
}
These four types describe client response strategy, not error semantics.
The capnp Rust crate maps them to capnp::ErrorKind::{Failed, Overloaded, Disconnected, Unimplemented}.
Cap’n Proto’s official philosophy (from KJ library and Kenton Varda’s writings): exceptions are for infrastructure failures, not application semantics. Application-level errors should be modeled as unions in method return types.
Capability OS Error Models
| System | Transport errors | Application errors |
|---|---|---|
| seL4 | seL4_Error enum (11 values) from syscall return | In-band via IPC message payload (user-defined) |
| Zircon | zx_status_t (signed i32, ~30 values) from syscall | FIDL per-method error type (union in return) |
| EROS/Coyotos | Kernel-generated invocation exceptions | OPR0.ex flag + exception code in reply payload |
| Plan 9 (9P) | Connection loss (no in-band transport error) | Rerror message with UTF-8 error string |
| Genode | Ipc_error exception | Declared C++ exceptions via GENODE_RPC_THROW |
Common pattern: a small kernel error code set for transport failures, combined with service-specific typed errors for application failures.
POSIX errno: Why Not
POSIX errno is a global flat namespace of ~100 integers that conflates
transport errors (EBADF) with application errors (ENOENT). In a
capability system:
EACCES/EPERMdon’t apply – if you have the capability, you have permission; if you don’t, you can’t even name the resource.- A global error namespace conflicts with typed interfaces where errors should be scoped to the interface.
- No room for structured information (which argument was invalid, how much memory was needed).
- Not composable across trust boundaries – a callee’s errno has no meaning in the caller’s address space without explicit serialization.
Design
Principle: Two Levels, One Wire Format
Level 1 – Transport errors are returned in the syscall return value.
These indicate that the capability invocation mechanism itself failed before
the target CapObject was reached. No result buffer is written.
Level 2 – Application errors are returned as capnp-serialized messages in the result buffer. The capability was found and dispatched; the implementation returned a structured error. The syscall return value distinguishes this from a successful result.
Both levels use Cap’n Proto serialization for the error payload (level 2 always, level 1 when there’s a result buffer available). This keeps one parsing path in userspace.
Syscall Return Convention
The cap_call syscall (number=2) currently returns:
0..N– success, N bytes written to result bufferu64::MAX– error (undifferentiated)
New convention:
| Return value | Meaning |
|---|---|
0..=(u64::MAX - 256) | Success. Value = number of bytes written to result buffer. |
u64::MAX | Transport error: invalid capability ID or stale generation. |
u64::MAX - 1 | Transport error: invalid user buffer (bad pointer, unmapped, not writable). |
u64::MAX - 2 | Transport error: params too large (exceeds MAX_CAP_CALL_PARAMS). |
u64::MAX - 3 | Application error: the capability returned an error. A CapException message has been written to the result buffer. The message length is encoded in the low 32 bits of the value at result_ptr (the capnp message itself). |
u64::MAX - 4 | Application error, but the result buffer was too small or NULL. The error detail is lost; the caller should retry with a larger buffer or treat it as an opaque failure. |
The transport error codes are a small closed set (like seL4’s 11 values). New transport errors can be added, but the set should remain small and stable.
CapException Schema
Add to schema/capos.capnp:
enum ExceptionType {
failed @0;
overloaded @1;
disconnected @2;
unimplemented @3;
}
struct CapException {
type @0 :ExceptionType;
message @1 :Text;
}
This mirrors Cap’n Proto RPC’s Exception struct. The four types match
capnp::ErrorKind and describe client response strategy:
- failed – deterministic failure, retrying won’t help. Covers invalid
arguments, invariant violations, deserialization errors, and any
capnp::ErrorKindvariant not in the other three categories. - overloaded – temporary resource exhaustion (out of frames, table full). Client may retry with backoff.
- disconnected – the capability’s backing resource is gone (device removed, process exited). Client should re-acquire the capability.
- unimplemented – unknown method ID for this interface. Client should not retry.
The message field is a human-readable string for diagnostics/logging.
It must not contain security-sensitive information (internal pointers, kernel
addresses) since it crosses the kernel-user boundary.
Application-Level Errors in Interface Schemas
Following Cap’n Proto’s philosophy, expected error conditions that a caller should handle programmatically belong in the method return type, not in the exception mechanism.
Example – FrameAllocator can legitimately run out of memory:
struct AllocResult {
union {
ok @0 :UInt64; # physical address
outOfMemory @1 :Void;
}
}
interface FrameAllocator {
allocFrame @0 () -> (result :AllocResult);
freeFrame @1 (physAddr :UInt64) -> ();
allocContiguous @2 (count :UInt32) -> (result :AllocResult);
}
The caller can pattern-match on the result union without parsing an exception. This is the Zircon/FIDL model: transport errors at the syscall layer, application errors as typed return values.
When to use each:
| Situation | Mechanism |
|---|---|
| Bad cap ID, stale generation, bad buffer | Transport error (syscall return code) |
| Deserialization failure, unknown method | CapException with failed/unimplemented |
| Temporary resource exhaustion in dispatch | CapException with overloaded |
| Expected domain-specific error | Union in method return type |
| Bug in capability implementation | CapException with failed |
Kernel Implementation
CapObject trait change
The ring SQE does not carry a caller-supplied interface ID. The trait shape below keeps interface selection out of capability implementations because each capability entry owns one public interface:
#![allow(unused)]
fn main() {
pub trait CapObject: Send + Sync {
fn interface_id(&self) -> u64;
fn label(&self) -> &str;
fn call(
&self,
method_id: u16,
params: &[u8],
result: &mut [u8],
reply_scratch: &mut dyn ReplyScratch,
) -> capnp::Result<CapInvokeResult>;
}
}
Implementations serialize directly into the caller’s result buffer and return
a completion containing the number of bytes written, or Pending for async
endpoint calls. Dispatch uses the interface assigned to the target capability
entry; normal CALL SQEs do not need to repeat that interface ID. capnp::Error
carries ErrorKind with the four RPC exception types. The kernel’s dispatch
handler converts Err(capnp::Error) into a serialized CapException message
and writes it to the result buffer.
Syscall handler changes
In cap_call(), the error path changes from:
#![allow(unused)]
fn main() {
Err(e) => {
kprintln!("cap_call: ... error: {}", e);
u64::MAX
}
}
to:
#![allow(unused)]
fn main() {
Err(CapError::NotFound) => ECAP_NOT_FOUND,
Err(CapError::StaleGeneration) => ECAP_NOT_FOUND,
Err(CapError::InvokeError(e)) => {
// Serialize CapException to result buffer
let exception_bytes = serialize_cap_exception(&e);
if result_ptr != 0 && result_capacity >= exception_bytes.len() {
copy_to_user(result_ptr, &exception_bytes);
ECAP_APPLICATION_ERROR
} else {
ECAP_APPLICATION_ERROR_NO_BUFFER
}
}
}
The serialize_cap_exception function maps capnp::ErrorKind to
ExceptionType:
capnp::ErrorKind | ExceptionType |
|---|---|
Failed | failed |
Overloaded | overloaded |
Disconnected | disconnected |
Unimplemented | unimplemented |
| All other variants (deserialization, validation) | failed |
This matches how capnp-rpc maps exceptions to the wire format.
Userspace API
The init crate (and future userspace libraries) wraps cap_call in a
helper that interprets the return value:
#![allow(unused)]
fn main() {
pub enum CapCallResult {
Ok(Vec<u8>),
Exception(ExceptionType, String),
TransportError(TransportError),
}
pub enum TransportError {
InvalidCapability,
InvalidBuffer,
ParamsTooLarge,
}
pub fn cap_call(
cap_id: u32,
method_id: u16,
params: &[u8],
result_buf: &mut [u8],
) -> CapCallResult {
let ret = sys_cap_call(cap_id, method_id, params, result_buf);
match ret {
ECAP_NOT_FOUND => CapCallResult::TransportError(TransportError::InvalidCapability),
ECAP_BAD_BUFFER => CapCallResult::TransportError(TransportError::InvalidBuffer),
ECAP_PARAMS_TOO_LARGE => CapCallResult::TransportError(TransportError::ParamsTooLarge),
ECAP_APPLICATION_ERROR => {
let (typ, msg) = deserialize_cap_exception(result_buf);
CapCallResult::Exception(typ, msg)
}
ECAP_APPLICATION_ERROR_NO_BUFFER => {
CapCallResult::Exception(ExceptionType::Failed, String::new())
}
n => CapCallResult::Ok(result_buf[..n as usize].to_vec()),
}
}
}
Future: Batched Calls
When capOS adds batched capability invocations (async rings, pipelining), each request in the batch gets its own result status. The same two-level model applies per-request:
- Transport error for the batch envelope (invalid ring descriptor, bad capability table) fails the whole batch.
- Per-request transport errors (individual bad cap_id) fail that request.
- Application errors are per-request, written to each request’s result slot.
This matches how NFS compound operations and JSON-RPC batch requests work: a transport error on the batch vs per-operation results.
What This Does NOT Cover
- Error logging/tracing infrastructure. How errors get collected,
aggregated, or displayed is a separate concern. The kernel currently
prints to serial; a future
ErrorLogcapability could capture structured error streams. - Retry policy. The
ExceptionTypehints at retry strategy (overloaded -> retry, failed -> don’t), but the retry logic itself belongs in userspace libraries, not the kernel. - Error propagation across capability chains. When capability A calls capability B which calls capability C, and C fails – how does the error propagate back through A? This is a concern for the IPC and capability transfer stages (Stage 6+). The current proposal handles the single-hop case.
- Transactional semantics. Whether a failed operation has side effects (partial writes, allocated-but-not-returned frames) is per-capability semantics, not a kernel-level concern.
Migration Path
Phase 1: Transport error codes (minimal, no schema changes)
Change cap_call to return distinct error codes instead of u64::MAX for
all failures. Update the init crate to interpret them. No new schema types
needed – application errors still use u64::MAX - 3 but without a structured
payload (treated as opaque failure).
This is backward-compatible: existing userspace code that checks == u64::MAX
sees different values for different errors, but any >= u64::MAX - 255 check
catches all errors.
Phase 2: CapException serialization
Add ExceptionType and CapException to the schema. Implement
serialize_cap_exception in the kernel. Update init to deserialize and
display errors. Now userspace gets the exception type and message string.
Phase 3: Per-interface application errors
As interfaces mature, add typed error unions to method return types for
expected error conditions. FrameAllocator::allocFrame returns
AllocResult instead of bare UInt64. The exception mechanism remains for
unexpected failures.
Design Rationale
Why mirror capnp RPC’s Exception type instead of inventing our own?
Cap’n Proto already defines a well-thought-out exception taxonomy. The four
types (failed, overloaded, disconnected, unimplemented) map directly to
capnp::ErrorKind in Rust. Using the same vocabulary means capOS capabilities
can eventually participate in capnp RPC networks without translation. It also
means the Rust compiler enforces exhaustive matching on ErrorKind variants
that matter.
Why not put error codes in the syscall return value only (like seL4)?
seL4’s 11 error codes work because seL4 kernel objects are simple and
fixed-function. capOS capabilities are arbitrary typed interfaces – a file
system, a network stack, a GPU driver. The error vocabulary is open-ended.
Encoding all possible errors as syscall return values would either require an
ever-growing enum (fragile) or lose information (back to errno’s problems).
The capnp-serialized CapException in the result buffer gives unbounded
expressiveness without changing the syscall ABI.
Why not use capnp exceptions for everything (skip the transport error codes)?
Because transport errors happen before the capability is reached. There’s
no CapObject to serialize an exception. The kernel would have to synthesize
a capnp message on behalf of a non-existent capability, which is wasteful and
semantically wrong. A small integer return code is cheaper and more honest
about what happened.
Why not define a generic Result(Ok) wrapper in the schema?
Cap’n Proto generics only bind to pointer types (Text, Data, structs, lists,
interfaces), not to primitives (UInt32, Bool). A Result(UInt64) for
allocFrame wouldn’t work. Per-method result structs with unions are more
flexible and don’t hit this limitation. The cost is a bit more schema
boilerplate, which is acceptable given that capOS has a small number of
interfaces.
Why string-based messages (like Plan 9) instead of structured error fields?
String messages are adequate for diagnostics and logging. Structured error
data belongs in the typed return unions (Phase 3), where the schema enforces
what fields exist. Putting structured data in CapException would duplicate
the schema’s job and encourage using exceptions for flow control, which
Cap’n Proto explicitly warns against.
Security Review and Formal Verification Proposal
How to reason about the correctness and security of the capOS kernel and its
trust boundaries in a way that fits a research OS – pragmatic tooling now,
targeted verification where it pays off, no aspirational seL4-style full-
kernel proofs. The docs/research/sel4.md survey already concluded that
Isabelle/HOL-over-C verification does not transfer to Rust and that the
design constraints matter more than the proof artefact. This proposal
codifies that conclusion into a concrete tooling and process plan.
This proposal uses CWE for concrete vulnerability classes, CAPEC for attacker patterns, Rust language rules / unsafe-code guidance for low-level coding rules, Common Criteria protection-profile concepts for OS security functions, and capability-kernel practice (seL4/EROS-style invariants) for authority, IPC, object lifetime, and scheduler properties. Web-application checklists are not the baseline for OS design review.
Grounding sources:
- MITRE CWE for root-cause weakness labels: CWE-20 explicitly covers raw data, metadata, sizes, indexes, offsets, syntax, type, consistency, and domain rules; CWE also marks broad classes such as CWE-20 and CWE-400 as discouraged for final vulnerability mapping when a more precise child fits.
- MITRE CAPEC for attacker behavior, especially input manipulation (CAPEC-153), command injection (CAPEC-248), race exploitation (CAPEC-26 / CAPEC-29), and flooding/resource pressure (CAPEC-125).
- Rust Reference
and
Rust 2024 Edition Guide
for unsafe-block and
unsafe_op_in_unsafe_fnobligations. - seL4 MCS and the existing capOS research notes for capability-authorized access to kernel objects and CPU time.
- Common Criteria General Purpose Operating System Protection Profile for OS access-control, security-function, trusted-channel/path, and user-data protection concepts. capOS is not trying to certify against it; the PP is a vocabulary check for what an OS security review should not omit.
1. Philosophy and Scope
capOS is explicitly a research OS whose design principle is “schema-first typed capabilities, minimal kernel, reuse the Rust ecosystem.” Three consequences shape this proposal:
- The schema is part of the TCB. A bug in the
.capnpschema, or in the way generated code is patched forno_std, is exactly as dangerous as a bug in the kernel. The schema, thecapnpcbuild pipeline, and the generated code all need review attention – not only hand-written kernel code. - The kernel should stay small. “Everything else is a capability” means the TCB is naturally bounded. Verification effort scales with TCB size, so resisting kernel bloat is itself a security property.
- The interface is the permission. Access control lives in capnp method
definitions and in userspace cap wrappers (a narrow cap is a different
CapObject), not in kernel rights bitmasks. Review must confirm that the kernel never short-circuits this: no ambient authority, no method that bypassesCapObject::call, no syscall that exposes an object without a capability handle.
Non-goals:
- Full functional-correctness proof of the kernel à la seL4. Infeasible in Rust today, and the payoff is low for a research system whose surface area is still changing.
- Proving information-flow / confidentiality properties end-to-end.
- Certifying a specific configuration for external deployment.
2. Trust Boundaries and Threat Model
Enumerating the boundaries forces every future review to ask “which boundary does this change touch?” and picks out the code paths that matter.
Current boundaries
| Boundary | Who trusts whom | Code that enforces it |
|---|---|---|
| Ring 0 ↔ Ring 3 | kernel trusts nothing from user | kernel/src/mem/validate.rs, arch/x86_64/syscall.rs; exercised by init/ and demos/* |
| Kernel ↔ user pointer | kernel validates address + PTE perms | validate_user_buffer (single-threaded; TOCTOU-prone on SMP) |
| Manifest ↔ kernel | kernel parses capnp manifest at boot | capos-config::manifest, called from kmain |
| Build inputs ↔ TCB | kernel trusts schema/codegen/build artifacts | schema/capos.capnp, build.rs, Cargo.lock, Makefile |
| Host tools ↔ filesystem/process | tools must not let manifest/config input escape intended host boundaries | tools/mkmanifest, generators, CI scripts |
| ELF bytes ↔ kernel | kernel parses user ELF to map segments | capos-lib::elf |
| User ring ↔ kernel dispatch | kernel trusts no SQ state | kernel/src/cap/ring.rs |
CapObject::call wire format | kernel trusts no params bytes | generated capnp decoders + impls |
| Process ↔ process IPC | kernel routes calls between mutually isolated address spaces and trusts neither side’s buffers | kernel/src/cap/endpoint.rs, kernel/src/cap/ring.rs, kernel/src/sched.rs |
| Device DMA ↔ physical memory (future) | kernel must constrain device memory access | PCI enumeration exists for the QEMU smoke path; virtio DMA submission is not implemented yet. See networking/cloud proposals. |
Attacker model
- Untrusted service binaries. Today’s services are checked into the repo, but the manifest pipeline is meant to load arbitrary binaries eventually. Assume every byte of a service’s SQEs, params buffers, result buffer pointers, and return addresses is attacker-controlled.
- Untrusted manifest. Once manifests are produced outside the repo (e.g. generated from CUE fragments, passed in as a Limine module), the manifest parser must reject every malformed input without panicking.
- Resource exhaustion. Once multiple mutually-untrusting services run, a service can attack by filling rings, endpoint queues, capability tables, frame pools, scratch arenas, logs, or CPU time. Boundedness and accounting are security properties, not performance polish.
- Build input drift. The schema/codegen path is already part of the TCB. External build inputs such as the bootloader checkout, Rust dependencies, capnp code generation, and generated-code patching must be reproducible enough that review can tell what changed.
- Host tooling input. Build tools and generators run with developer/CI filesystem access. Treat manifest/config-derived paths and command arguments as untrusted until bounded to the intended directory and execution context.
- Residual state and disclosure. Kernel logs, returned buffers, recycled frames, endpoint scratch space, and generated artifacts must not expose kernel pointers, stale bytes from another process, secrets, or build-system paths that increase attacker leverage.
- Hostile interrupts / preemption. The scheduler preempts at arbitrary points. Any kernel invariant that is only transiently true must be held under the right lock or with interrupts disabled.
- Out of scope (for now): physical attacks, speculative-execution side channels, malicious hardware, IOMMU bypass from DMA devices. These become in-scope once the driver stack lands; revisit the threat model then.
3. Tiered Approach
Four tiers, cheapest first. Each tier is independently useful, and later tiers assume earlier ones are in place.
Tier 1 – Hygiene and CI (cheap, high value)
These are the controls that make every other tier work. The initial GitHub
Actions baseline exists in .github/workflows/ci.yml; it runs formatting,
host tests, cargo build --features qemu, make capos-rt-check, and
make generated-code-check. The QEMU smoke job installs its own boot tools
and runs make plus make run, but remains non-blocking, so it is not yet a
required boot assertion.
- Continuous integration via GitHub Actions (or equivalent). Current
baseline:
make fmt-check,cargo test-config,cargo test-ring-loom,cargo test-lib,cargo test-mkmanifest,cargo build --features qemu,make capos-rt-check, andmake generated-code-check. Remaining CI work: treat QEMU boot as a required CI gate once runtime flakiness is acceptable, then add the security policy jobs below. cargo clippy --all-targets -- -D warningsacross workspace members, with a curated set ofclippy::pedantic/clippy::nurserylints that pay off for kernel code (clippy::undocumented_unsafe_blocks,clippy::missing_safety_doc,clippy::cast_possible_truncation, etc.). Do NOT enable all of pedantic blindly – review each lint and either enable it or add a rationale comment.cargo-denyfor license and advisory gating;cargo-auditfor the RustSec advisory DB againstCargo.lock. Dependencies includecapnp,spin,x86_64,limine,linked_list_allocator– all externally maintained.cargo-geigerreport of unsafe surface area per crate, checked in as a snapshot and diffed in CI so growth is visible in PRs.- Deny
unsafe_op_in_unsafe_fn(already required by edition 2024; make sure it stays on) andmissing_docson public kernel items where it is not already the case. - Dependency review discipline: every new dep needs a one-line
rationale in the commit message and a check that it is
no_std-capable, maintained, and does not pull in a surprise async runtime or heavy transitive graph. - No-std dependency rubric: kernel/no_std additions require an explicit
compatibility check that
core/allocpaths do not regress tostdthrough default feature drift, and class ownership is recorded againstdocs/trusted-build-inputs.md. - Boot/build input pinning: pin external bootloader/tool downloads to an auditable revision or checksum. Branch names are not enough for TCB inputs. CI should fail when generated capnp bindings or no-std patching change outside an intentional schema/codegen update.
- Untrusted-path panic audit:
panic!,assert!,.unwrap(), and.expect()are acceptable during bring-up, but every path reachable from manifest bytes, ELF bytes, SQEs, params buffers, result buffers, and future IPC messages needs either a fail-closed error or a documented halt policy. - Hardware protection smoke tests: boot under QEMU with SMEP/SMAP-capable CPU flags and assert CR4.SMEP/CR4.SMAP once paging is initialized. Every explicit user-memory dereference must be wrapped in a short STAC/CLAC window once SMAP is enabled.
Tier 2 – Targeted dynamic analysis
Aimed at the host-testable pure-logic crates (capos-lib, capos-config)
where the Rust toolchain just works. No kernel changes required.
- Miri on the
cargo test-libandcargo test-configsuites. Catches UB in pure-logic code: invalid pointer arithmetic, uninitialized reads, bad provenance, unsoundunsafe. The FrameBitmap and CapTable tests in particular push against slot indexing, generation counters, and raw&mut [u8]handling – exactly what miri is good at. proptest(orquickcheck) on:capos-lib::elf::parse– random bytes / random perturbations of a valid header must never panic and must refuse anything that isn’t a correctly formed user-half ELF64.capos-lib::frame_bitmap– interleaved sequences ofalloc,alloc_contiguous,free,mark_usedpreserve the invariantfree_count == popcount(bitmap == 0)and never double-free.capos-lib::cap_table– insert/remove/lookup sequences preserve “every returned id resolves to its insertion-time object, and stale ids are rejected.”capos-config::manifestencode/decode round trip on arbitrary manifests.
cargo fuzzharnesses (libFuzzer) on the same three parsers:elf::parse,manifest::decode, and the ring SQE decoder that will land as part of IPC. These run outside CI (they never terminate) but should have a seed corpus checked intofuzz/corpus/and be run for a fixed budget nightly or on-demand.- Sanitizers on host tests:
RUSTFLAGS=-Zsanitizer=address(andthread) oncargo test-libunder nightly. Requires nothing more than a CI job; cheap to add.
Tier 3 – Concurrency model checking
The capability ring is a lock-free single-producer / single-consumer protocol using volatile reads, release/acquire fences, and a shared head/ tail pair. It is the most likely source of subtle memory-ordering bugs and is also the most isolated – a perfect fit for model checking.
- Loom on a host-buildable wrapper of the ring protocol. Extract the
producer/consumer state machine from
capos-config::ringinto a form where atomics can be swapped forloom::sync::atomic, and write Loom tests that enumerate all interleavings of producer/consumer for small ring sizes (2–4 slots). Properties to check:- No CQE is lost.
- No CQE is double-delivered.
- The
sq_head/sq_tailandcq_head/cq_tailpointers never observe a state that impliestail - head > SQ_ENTRIES. - The “corrupted producer state” fail-closed policy (REVIEW_FINDINGS.md “Userspace Ring Client”) holds under adversarial interleavings.
- Shuttle as a lighter alternative for regression-style tests once the specific bugs are known; cheaper per run, randomised rather than exhaustive. Good for long-running overnight jobs.
Loom coverage here is disproportionately valuable: it substitutes for the SMP-hardness work the project has explicitly deferred, and it exercises exactly the ordering that TOCTOU-style bugs hide in.
Tier 4 – Bounded verification of specific invariants
Not a full-kernel proof. Targeted, property-specific, one-module-at-a-time.
- Kani (bounded model checking for Rust, via CBMC). Good fit for
small, heap-free, arithmetic-heavy functions. Candidate modules:
capos-lib::cap_table– prove that for allinsert; remove; insert'sequences under au8generation counter, a stale CapId never resolves. Bound: table size ≤ 4, generation window ≤ 256.capos-lib::frame_bitmap– prove that for all bitmap sizes up to N bytes,alloc_framefollowed byfree_frameof the same frame restores the original bitmap andfree_count.capos-lib::elf::parsebounds checks: prove that every index into the program header table is< len, given the validatedphentsizeandphnum.
- Verus (SMT-based Rust verifier, active development at MSR) for invariants that Kani can’t handle ergonomically, particularly those involving loops and ghost state. Worth tracking but don’t commit to it yet – the proof-engineering cost is real, and the tool is still young. Revisit once IPC lands and the kernel has stable public APIs.
- Creusot / Prusti are alternatives in the same space. Do not
invest in more than one SMT-based verifier; pick whichever has the best
story for
no_std + alloccode when Tier 4 starts.
Deliberately out of scope: Isabelle/HOL, Coq proofs, Frama-C. They would require re-encoding Rust in a foreign semantic framework with no established Rust front-end mature enough for kernel code.
4. Security Review Process
REVIEW.md is the rules document and REVIEW_FINDINGS.md is the open-issue log. REVIEW.md contains the common security checklist that applies across kernel, userspace, host tooling, generators, and CI. The per-boundary prompts below are an expansion of that common checklist for OS-specific code paths.
CWE/CAPEC tagging policy
Security findings should carry CWE metadata when the mapping is specific enough to help a reviewer or future audit. Do not force a CWE into every title.
- Prefer Base/Variant CWE IDs when the root cause is known: CWE-770 for unbounded allocation, CWE-88 for argument injection, CWE-367 for a concrete validation-to-use race, CWE-416 for a real use-after-free.
- Use Class IDs as temporary or umbrella labels: CWE-20 for “input was not validated enough” before the missing property is known; CWE-400 for general resource exhaustion only when the enabling mistake is not more precise.
- Use capability-kernel invariants instead of weak CWE mappings for design properties such as “no ambient authority”, “cap transfer happens exactly once”, “revocation cannot leave stale authority”, and “scheduling context donation cannot fabricate CPU authority”. Cite CWE-862/CWE-863 only when the issue is actually a missing or incorrect authorization check.
- Use CAPEC for the attacker pattern when useful: input manipulation, command injection, race exploitation, flooding, or path/file manipulation. CAPEC is not a substitute for the CWE root-cause tag.
Current checklist coverage:
| Area | Primary tags | Review intent |
|---|---|---|
| Structured input validation | CWE-20, CWE-1284–CWE-1289 when precise | Validate syntax, type, range, length, indexes, offsets, and cross-field consistency before privileged use |
| Filesystem paths | CWE-22, CWE-23, CWE-59 | Keep host-tool paths inside intended roots across absolute paths, traversal, symlinks, and file-type confusion |
| Commands/processes | CWE-78, CWE-88 | Avoid shell interpolation; constrain binaries and arguments |
| Numeric/buffer bounds | CWE-190, CWE-125, CWE-787 | Check arithmetic before pointer, slice, copy, ELF segment, and page-table use |
| Resource exhaustion | CWE-770 preferred; CWE-400 broad | Bound queues, allocations, retries, spin loops, frames, scratch arenas, cap slots, and CPU budget |
| Exceptional paths | CWE-703, CWE-754, CWE-755; CWE-248 only for uncaught exceptions | Fail closed on malformed or adversarial input; avoid trust-boundary panic/abort |
| Authorization/cap authority | CWE-862, CWE-863 plus capOS invariants | Verify capability ownership, generation, object identity, address-space ownership, and transfer policy |
| Concurrency/TOCTOU | CWE-362, CWE-367, CWE-667 | Preserve lock ordering, interrupt masking, page-table stability, and validation-to-use assumptions |
| Lifetime/reuse | CWE-416, CWE-664, CWE-672 | Prevent stale caps, stale kernel stacks, stale frames, and expired IPC state from being used |
| Disclosure/residual data | CWE-200, CWE-226 | Prevent logs, result buffers, frames, scratch arenas, and generated artifacts from leaking stale or sensitive data |
| Supply chain / generated TCB | capOS TCB invariant; use CWE only for concrete bug | Pin or review-visible drift for bootloader, dependencies, schema/codegen, generated code, and patching |
Per-boundary review checklist
- Syscall surface change (
arch/x86_64/syscall.rs):- Every register-passed argument is treated as attacker-controlled.
- No user pointer is dereferenced without
validate_user_buffer. - Numeric conversions, copy lengths, and pointer arithmetic are checked
before constructing slices or entering a
UserAccessGuardscope. - Kernel stack pointer and TSS.RSP0 invariants are preserved.
- The syscall count stays bounded; a new syscall has an SQE-opcode alternative considered and explicitly rejected with rationale.
- Ring dispatch change (
kernel/src/cap/ring.rs):- SQ bounds check and per-dispatch SQE limit still enforced.
- Corrupted SQ state fails closed (never re-processes the same bad state on the next tick).
- No allocation in the interrupt-driven path beyond what is already documented in REVIEW_FINDINGS.md.
- Result buffers and endpoint scratch buffers cannot leak stale bytes beyond the returned completion length.
- User buffer validation change (
kernel/src/mem/validate.rs):- Address range check precedes PTE walk.
- PTE flags checked: present, user, and write (if the buffer is written).
- Single-CPU assumption explicit; TOCTOU note retained until SMP lands.
- ELF loader change (
capos-lib::elf):- Every field bounded before use (phentsize, phnum, p_offset, p_filesz, p_memsz, p_vaddr).
- Segments confined to the user half.
- Overlap check preserved.
- Integer arithmetic uses checked add/subtract before deriving mapped addresses, file slices, or zero-fill ranges.
- Manifest change (
capos-config::manifest):- Every optional field is either present or the service is rejected.
- Name / binary / cap source strings are length-bounded.
- Unknown / unsupported numbers in CUE input fail-closed with a path- specific error.
- Capability grants are checked as an authority graph before any rejected graph can start a service.
- Schema change (
schema/capos.capnp):- Backward-compatible with existing wire format, or migration documented.
- Every new method has an explicit capability-granting story (who mints the cap that lets this method be called?).
- Generated code
no_stdpatching still applies.
- Host tool or generator change (
tools/*,build.rs, CI scripts):- Manifest/config-derived paths cannot escape intended directories through absolute paths, traversal, symlinks, or file-type confusion.
- External command execution uses explicit binaries and argument APIs, not shell interpolation of untrusted strings.
- Generated outputs are review-visible and fail closed on malformed inputs.
- Generated files and diagnostics do not disclose secrets, absolute paths, or stale build outputs beyond what the developer intentionally requested.
- Unsafe block added or expanded: Tier 1 clippy lints plus REVIEW.md §“Unsafe Usage” checklist already cover this; the review should cite the specific invariant being maintained in the commit message.
Threat-model refresh
On every stage completion (Stage 6 IPC, Stage 7 SMP, first driver landing, first time a manifest comes from outside the repo), re-run §2 of this document and update it. The list of trust boundaries grows over time; the proposal decays if it doesn’t grow with the code.
Periodic full audit
Once per stage, schedule a focused audit pass:
- Re-verify every boundary’s code is still enforced at its documented entry point (no new bypass path).
- Re-run all Tier 2/3 jobs with the latest toolchain (catches tool-upgrade regressions).
- Walk through open items in REVIEW_FINDINGS.md and confirm each is still correctly classified (still open, fixed, or explicitly accepted).
- Record the audit date + outcome at the top of REVIEW_FINDINGS.md, matching the existing “Last verification” convention.
5. Concrete Verification Targets
Ordered by value and feasibility. Each one is a specific, bounded piece of work a contributor can pick up without needing to redesign the kernel.
| # | Target | Tier | Property | Blocker |
|---|---|---|---|---|
| 1 | capos-lib::cap_table | 4 (Kani) | Stale CapId never resolves after slot reuse within the generation window | None |
| 2 | capos-lib::frame_bitmap | 4 (Kani) | alloc/free preserve free_count invariant; no double-alloc | None |
| 3 | capos-lib::elf::parse | 2 (proptest + fuzz) | No panic on arbitrary input; only well-formed user-half ELF64 accepted | None |
| 4 | capos-config::manifest | 2 (proptest + fuzz) | Decode/encode round-trip; malformed input rejected without panic | None |
| 5 | Ring SPSC protocol | 3 (Loom) | No lost/doubled CQEs; fail-closed on corruption under all interleavings | Extract protocol into Loom-testable wrapper |
| 6 | validate_user_buffer | 4 (Kani) | Every accepted buffer lies entirely in user half with correct PTE flags | Formalise PTE model |
| 7 | Ring dispatch path | 3 (Loom + proptest) | SQE poll is bounded per tick; no allocation on the dispatch path | Initial alloc-free synchronous path landed; async transfer/release paths still need coverage |
| 8 | IPC routing | 3 | Capabilities transferred exactly once; no duplication under direct-switch | Capability transfer |
| 9 | Direct-switch IPC handoff | 2 + 3 | Scheduler invariants preserved when a blocked receiver bypasses normal run-queue order | Loom-testable scheduler/ring model |
| 10 | SMEP/SMAP + user access windows | 1 + QEMU integration | Kernel cannot execute user pages; kernel user-memory touches happen only inside audited access windows | Wire existing x86_64 helper into init path |
| 11 | Manifest authority graph | 2 (property tests) | Every granted cap source resolves, every export is unique, and no service starts after a rejected graph | Manifest executor path |
| 12 | Resource accounting | 2 + 3 | Rings, endpoints, cap tables, scratch arenas, frames, and CPU budget fail closed under exhaustion | S.9 design complete; implementation hooks pending |
| 13 | Build/codegen TCB | 1 | Bootloader/deps/codegen inputs are pinned and generated output changes are review-visible | CI bootstrap |
| 14 | Device DMA boundary (future) | 1 + design review | No driver or device can DMA outside explicitly granted buffers | PCI/device work; IOMMU or bounce-buffer decision |
Targets 1–4 are feasible today and should be the first batch of work. Target 10 is the security gate before treating Stage 6 services as untrusted. Targets 11–12 should be designed before capability transfer lands, otherwise the first IPC implementation will bake in ambient resource authority. Target 14 gates user-mode or semi-trusted drivers.
Current status as of 2026-04-21:
- Targets 1–2 have initial Kani coverage in
capos-lib. - Target 3 has arbitrary-input proptest coverage and a cargo-fuzz target for ELF bytes. The current Kani harness still only proves the short-input early-reject path because fully symbolic ELF parsing reaches allocator and sort internals before there is a sharper proof obligation.
- Target 4 has cargo-fuzz coverage for manifest decoding/roundtrip and mkmanifest exported-JSON conversion.
- Target 5 has a feature-gated Loom model for the shared ring protocol.
- Target 13 has an initial CI baseline plus generated-code drift checking, dependency audit/deny gates, and required QEMU boot still open.
6. Phased Plan
This slots into WORKPLAN.md as a cross-cutting track rather than a phase – items are independent of Stage 6 IPC and can proceed in parallel.
- Track S.1 – CI bootstrap – landed 2026-04-21
.github/workflows/ci.yml: fmt-check, test-config, test-ring-loom, test-lib, test-mkmanifest,cargo build --features qemu,make capos-rt-check, generated-code drift checking, and dependency policy checking.- QEMU smoke installs
build-essential,capnproto,qemu-system-x86,xorriso, andcuev0.16.0 before runningmakeandmake run; it remains optional/non-blocking until boot runtime is stable enough to make it a required gate. - Clippy-with-deny and cargo-geiger remain future hardening jobs.
- Track S.2 – Miri + proptest on capos-lib – landed 2026-04-21
- Add
proptestdev-dependency tocapos-lib. - Host properties for
capos-lib::cap_tableandcapos-lib::frame_bitmap; ELF arbitrary-input coverage remains open under S.13. cargo test-libruns the native host suite;cargo miri-libruns the same crate under Miri.
- Add
- Track S.3 – Manifest + mkmanifest fuzzing – landed 2026-04-21
fuzz/crate with harnesses formanifest::decodeandtools/mkmanifestCUE → capnp pipeline. Seed corpus checked in.
- Track S.4 – Ring Loom harness – landed 2026-04-21
- Extract the SPSC protocol from
capos-config::ringinto a test-only wrapper where atomics are swappable. - Loom tests covering corruption, overflow, and ordering.
- Doubles as regression coverage for Phase 1.5 in WORKPLAN.md.
- Extract the SPSC protocol from
- Track S.5 – Kani on capos-lib – initial harnesses landed 2026-04-21
- CapTable generation/index/stale-reference invariants.
- FrameBitmap alloc/free symmetry over a small symbolic bitmap model plus a concrete bounded contiguous-allocation proof.
- ELF parser short-input early-reject panic-freedom.
- The current bounds are intentionally conservative so
make kani-libremains a practical gate; broader symbolic ELF and contiguous-allocation proofs should wait for more specific invariants.
- Track S.6 – Security review docs stay aligned
- Keep REVIEW.md’s common security checklist aligned with §4’s boundary prompts as new boundaries land.
- Add a “threat model refresh” step to the stage-completion workflow in CLAUDE.md.
- Track S.7 – Stage-6-aware refresh
- Re-run §2 trust-boundary inventory after capability transfer/release semantics land.
- Plan Loom coverage for cross-process routing and direct-switch IPC.
- Track S.8 – Untrusted-service hardening gate
- Wire SMEP/SMAP enablement into x86_64 init after paging is live.
- Replace raw user-slice construction in syscall/ring paths with checked copy/access helpers that bracket the actual access with STAC/CLAC.
- Add QEMU hostile-userspace tests for bad pointers, kernel-half pointers, invalid caps, corrupted rings, and services without Console authority.
- Audit untrusted-input paths for panics before Stage 6 endpoints run mutually-untrusting processes.
- Track S.9 – Authority graph and resource accounting – landed 2026-04-21
- Concrete design is captured in
docs/authority-accounting-transfer-design.md. - Defines authority graph invariants, per-process quota ledger
(
cap slots,endpoint queue,outstanding calls,scratch,frame grants,log volume,CPU budget), diagnostic aggregation, and exactly-once transfer/rollback semantics. - Establishes acceptance criteria that gate WORKPLAN 3.6 capability transfer and 5.2 ProcessSpawner implementation.
- Concrete design is captured in
- Track S.10 – Supply-chain and generated-code TCB
- Pin Limine and other external build inputs by revision/checksum rather than branch name.
- Make capnp generated-code changes review-visible in CI, including the no-std patching step.
- Consider
cargo-vetonly aftercargo-deny/cargo-auditare in place; vetting too early is process theater. - S.10.3 adds a concrete dependency policy: no_std additions are accepted
only with class attribution,
cargo deny+cargo audit, and explicit lockfile intent. - S.10.3 enforcement is
make dependency-policy-check, backed bydeny.tomland pinned CI installs ofcargo-deny 0.19.4andcargo-audit 0.22.1.
- Track S.11 – Device/DMA isolation gate
- Before PCI/virtio/NVMe/user drivers land, choose the DMA isolation story: IOMMU-backed DMA domains or kernel-owned bounce buffers.
- Define
DMAPool,DeviceMmio, andInterruptcapability invariants: bounded physical ranges, explicit interrupt ownership, device reset on revoke, and no raw physical-address grants to untrusted drivers. - S.11.2 requires a concrete ownership-transition gate before userspace NIC/ block drivers are allowed.
- Track S.12 – Kani harness bounds refresh
- Revisit Kani bounds and harness shape once capability transfer,
resource-accounting, or
validate_user_bufferexposes concrete proof obligations. - Prefer actionably narrow properties over arbitrary symbolic parser exploration that spends verifier time in allocator or sort internals.
- Revisit Kani bounds and harness shape once capability transfer,
resource-accounting, or
- Track S.13 – ELF parser arbitrary-input coverage
- Add proptest coverage for
capos-lib::elf::parseon arbitrary bytes and valid-header perturbations. - Add a
cargo fuzztarget for ELF bytes once the corpus and runtime budget are defined.
- Add proptest coverage for
Tracks S.1 through S.5 have initial coverage. S.6 is ongoing doc hygiene and should move with review-process changes. S.8 must land before Stage 6 runs mutually-untrusting services. S.9 design is complete and now gates concrete implementation work in 3.6/5.2. S.11 gates device-driver work. S.12 should not expand bounds for their own sake; it is a refresh point when new kernel invariants make better proof targets available. S.13 closes the remaining target-3 gap from the table above.
7. What This Proposal Does Not Promise
- No claim that capOS will be “secure” at the end. It will be harder to write a silently wrong change to the code paths the tooling covers, and it will be easier to find the ones that are still wrong.
- No proof obligation on every PR. Kani and Loom are expensive to run on every push; CI runs them on a reduced schedule (e.g. nightly, or on PRs that touch the covered crates).
- Userspace and host-tool bugs are in scope, but their impact is classified by boundary. A userspace bug should not compromise kernel isolation; a host-tool bug can still compromise the build TCB or developer/CI filesystem.
- No claim that confidentiality is handled beyond architectural isolation. Timing channels, cache side channels, device side channels, and covert channels through shared services remain explicit research topics, not current implementation goals.
8. Relation to Other Docs
docs/research/sel4.md§1 and §6.1 already make the case that full verification is not the right goal. This proposal is the operational answer.REVIEW.mdis the reviewer’s rulebook. This proposal explains the security and verification rationale behind its common checklist and per-boundary prompts.REVIEW_FINDINGS.mdis the open-issue log. This proposal feeds it – every bug found by Tier 2/3/4 tooling lands there unless fixed in the same change.ROADMAP.mdowns the stages; this proposal does not add stages, only a cross-cutting track that runs alongside them.WORKPLAN.mdowns concrete ordering; Track S.1–S.11 above is the actionable slice mirrored there.
Proposal: Capability-Based Binaries, Language Support, and POSIX Compatibility
How userspace binaries receive, use, and compose capabilities — from the native Rust runtime through POSIX compatibility to running unmodified software.
Current State
The init binary (init/src/main.rs) is no_std Rust with a static heap
allocator and still reaches the raw bootstrap syscalls through shared demo
support. It validates its CapSet, invokes the Console capability over the
capability ring, and exits. The former ring and IPC smokes now live as
separate release-built binaries in the nested demos/ workspace. The kernel
creates multiple processes from the boot manifest and schedules them with
round-robin preemption. Init is not yet based on a reusable runtime crate —
demos/capos-demo-support is only a test-support shim, not capos-rt.
The kernel-side roadmap (Stages 4-6) provides the capability ring (SQ/CQ
shared memory + cap_enter syscall, implemented), scheduling, and IPC. This
proposal covers the userspace half: what binaries look like, how they’re built,
and how existing software runs on a system with no ambient authority.
Part 1: Native Userspace Runtime (capos-rt)
The Problem
Every userspace binary currently needs to:
- Define
_startand a panic handler - Set up an allocator
- Construct raw syscall wrappers
- Manually serialize/deserialize capnp messages
- Know the syscall ABI (register layout, method IDs)
This is fine for one proof-of-concept binary. It won’t scale to dozens of services.
Solution: A Userspace Runtime Crate
capos-rt is a no_std + alloc Rust crate that every native capOS binary
depends on. It provides:
1. Entry point and allocator setup.
// capos-rt provides the real _start that:
// - initializes the heap allocator (bump allocator over a fixed region,
// or grows via FrameAllocator cap if granted)
// - parses the initial capability set from a kernel-provided page
// - calls the user's main(CapSet)
// - calls sys_exit with the return value
#[capos_rt::main]
fn main(caps: CapSet) -> Result<(), Error> {
let console = caps.get::<Console>("console")?;
console.write_line("Hello from capOS")?;
Ok(())
}
2. Syscall layer. Raw syscall asm wrapped in safe Rust functions.
The entire syscall surface is 2 calls – new operations are SQE opcodes, not
new syscalls:
sys_exit(code)– terminate process (syscall 1)sys_cap_enter(min_complete, timeout_ns)– flush pending SQEs, then wait until N completions are available or the timeout expires (syscall 2)
Capability invocations go through the per-process SQ/CQ ring. capos-rt
provides helpers for writing SQEs and reading CQEs:
#![allow(unused)]
fn main() {
/// Submit a CALL SQE to the capability ring and wait for the CQE.
pub fn cap_call(
ring: &mut CapRing,
cap_id: u32,
method_id: u16,
params: &[u8],
result_buf: &mut [u8],
) -> Result<usize, CapError> {
ring.push_call_sqe(cap_id, method_id, params);
sys_cap_enter(1, u64::MAX);
ring.pop_cqe(result_buf)
}
}
3. Cap’n Proto integration. Re-exports generated types from schema/capos.capnp
and provides typed wrappers:
#![allow(unused)]
fn main() {
// Generated from schema + thin wrapper in capos-rt
impl Console {
pub fn write(&self, data: &[u8]) -> Result<(), CapError> {
let mut msg = capnp::message::Builder::new_default();
let mut req = msg.init_root::<console::write_params::Builder>();
req.set_data(data);
self.invoke(0, &msg) // method @0
}
pub fn write_line(&self, text: &str) -> Result<(), CapError> {
let mut msg = capnp::message::Builder::new_default();
let mut req = msg.init_root::<console::write_line_params::Builder>();
req.set_text(text);
self.invoke(1, &msg) // method @1
}
}
}
4. CapSet – the initial capability environment.
At spawn time, the kernel writes the process’s initial capabilities into a
well-known page (or passes them via registers/stack – ABI TBD). capos-rt
parses this into a CapSet: a name-to-CapId map.
#![allow(unused)]
fn main() {
pub struct CapSet {
caps: BTreeMap<String, CapEntry>,
}
struct CapEntry {
id: u32, // authority-bearing slot in the process CapTable
interface_id: u64, // generated capnp TYPE_ID, carried for type checking
}
impl CapSet {
/// Get a typed capability by name. Fails if not present or wrong type.
pub fn get<T: Capability>(&self, name: &str) -> Result<T, CapError> { ... }
/// List available capability names (for debugging/discovery).
pub fn list(&self) -> impl Iterator<Item = (&str, u64)> { ... }
}
}
interface_id is not a handle. It is metadata carrying the generated Cap’n
Proto TYPE_ID for the interface expected by the typed client. The handle is
id (CapId). A typed client constructor must check that
entry.interface_id == T::TYPE_ID, then store the CapId. Normal CALL SQEs
do not need to repeat the interface ID because each capability table entry
exposes one public interface. The ring SQE keeps fixed-size reserved padding
for ABI stability, not a required interface field for the system transport.
This matters for the system transport because several capabilities can expose
the same interface while representing different authority: a serial console, a
log-buffer console, and a console proxy all have the Console TYPE_ID, but
different CapId values.
Crate Structure
capos-rt/
Cargo.toml # no_std + alloc, depends on capnp
build.rs # capnpc codegen from schema/capos.capnp
src/
lib.rs # re-exports, #[capos_rt::main] macro
syscall.rs # raw asm syscall wrappers
caps.rs # CapSet, CapEntry, Capability trait
alloc.rs # userspace heap allocator setup
generated.rs # include!(capnp generated code)
capos-rt is NOT a workspace member (same as init/ – needs different
code model and linker script). It’s a path dependency for userspace crates.
Init After capos-rt
// init/src/main.rs -- after capos-rt exists
use capos_rt::prelude::*;
#[capos_rt::main]
fn main(caps: CapSet) -> Result<(), Error> {
let console = caps.get::<Console>("console")?;
let spawner = caps.get::<ProcessSpawner>("spawner")?;
let manifest = caps.get::<Manifest>("manifest")?;
console.write_line("capOS init starting")?;
for entry in manifest.services()? {
let binary_name = entry.binary();
let granted = resolve_caps(&entry, &running_services, &caps)?;
let handle = spawner.spawn(entry.name(), binary_name, &granted)?;
running_services.insert(entry.name(), handle);
}
supervisor_loop(&running_services, &spawner)
}
Part 2: Capability-Based Binary Model
Binary Format
ELF64, same as now. The kernel’s ELF loader (kernel/src/elf.rs) already
handles PT_LOAD segments. No changes to the binary format itself.
What changes is the ABI contract between kernel and binary:
| Aspect | Current (Stage 3) | After capos-rt |
|---|---|---|
| Entry point | _start(), no args | _start(cap_page: *const u8) or via well-known address |
| Syscall ABI | ad-hoc (rax=0 write, rax=1 exit) | SQ/CQ ring + sys_cap_enter + sys_exit |
| Capability access | None | CapSet parsed from kernel-provided page |
| Serialization | None | capnp messages |
| Allocator | None (no heap) | Bump allocator, optionally backed by FrameAllocator cap |
Initial Capability Passing
The kernel needs to communicate the initial cap set to the new process. Options:
Option A: Well-known page. Kernel maps a read-only page at a fixed virtual
address (e.g., 0x1000) containing a capnp-serialized InitialCaps message:
struct InitialCaps {
entries @0 :List(InitialCapEntry);
}
struct InitialCapEntry {
name @0 :Text;
id @1 :UInt32;
interfaceId @2 :UInt64;
}
Option B: Register convention. Pass pointer and length in rdi/rsi at
entry. Simpler, but the data still needs to live somewhere in user memory.
Option C: Stack. Push the cap descriptor onto the user stack before iretq.
Similar to how Linux passes auxv to _start.
Option A is cleanest – the page is always there, no calling-convention dependency, and it naturally extends to passing additional boot info later.
Service Binary Lifecycle
1. Kernel loads ELF, creates address space, populates cap table
2. Kernel maps InitialCaps page at well-known address
3. Kernel enters userspace at _start
4. capos-rt _start:
a. Initialize heap allocator
b. Parse InitialCaps page into CapSet
c. Call user's main(CapSet)
5. User main:
a. Extract needed caps from CapSet
b. Do work (invoke caps, serve requests)
c. Optionally export caps to parent once ProcessHandle export lookup exists
6. On return from main (or sys_exit):
a. Kernel destroys process
b. All caps in process's cap table are dropped
c. Parent's ProcessHandle receives exit notification
Part 3: Language Support Roadmap
Tier 1: Rust (native, now)
Rust is the only language that matters until the runtime is stable. Reasons:
no_std + allocworks today with the existing kernelcapnpcrate (v0.25) hasno_stdsupport with codegen- Zero runtime overhead – no GC, no dynamic linker, no libc
- Same language as the kernel, shared understanding of the memory model
- Ownership model maps naturally to capability lifecycle
All system services (drivers, network stack, store) will be Rust.
Tier 2: C (via libcapos.h, after Stage 6)
C is the second target because most existing driver code and system software is C, and the FFI boundary with Rust is trivial.
libcapos is a static library providing:
#include <capos.h>
// Ring-based capability invocation (synchronous wrapper around SQ/CQ ring)
int cap_call(cap_ring_t *ring, uint32_t cap_id, uint16_t method_id,
const void *params, size_t params_len,
void *result, size_t result_len);
// Typed wrappers (generated from .capnp schema)
int console_write(cap_t console, const void *data, size_t len);
int console_write_line(cap_t console, const char *text);
// CapSet access
cap_t capset_get(const char *name);
uint64_t capset_interface_id(const char *name);
// Syscalls (the entire syscall surface -- 2 calls total)
_Noreturn void sys_exit(int code); // terminate
uint32_t sys_cap_enter(uint32_t min_complete, // flush SQEs + wait
uint64_t timeout_ns);
Implementation: libcapos is Rust compiled to a static .a with a C ABI
(#[no_mangle] extern "C"). The capnp message construction happens in Rust
behind the C API. This avoids requiring a C capnp implementation.
C binaries link against libcapos.a and use the same linker script as Rust
userspace binaries. The entry point and allocator setup are in libcapos.
Tier 3: Regular Rust Runtime Support
After the native capos-rt service model is stable, the next language
priority is making capOS build and run ordinary Rust programs as far as the
capability model permits. The target is not an ambient POSIX clone; it is a
Rust runtime path where common crates can use allocation, time, threading
where available, and capability-backed I/O through capOS-native shims.
This has higher priority than C++ and should be evaluated before broad POSIX
compatibility work, because Rust is already the system language and can reuse
the existing capos-rt ownership and ring abstractions directly.
Tier 4: Go (GOOS=capos)
Go is the next high-priority runtime after regular Rust. It needs in-process threading, futex-like wait/wake, TLS/runtime metadata support, GC integration, and a network poller mapped to capOS capabilities. See docs/proposals/go-runtime-proposal.md for the dedicated plan.
Go has higher priority than C++ because it unlocks CUE and a large practical tooling/runtime ecosystem; C++ support should not displace the Go runtime track.
Tier 5: Any Language Targeting WASI (longer term)
See Part 5 below. Languages that compile to WASI (Rust, C, Go, etc.) can run on capOS through a WASI-to-capability translation layer.
Important distinction: WASI works differently for compiled vs. interpreted languages:
- Compiled languages (Rust, C) compile directly to
.wasm— no interpreter in the loop. WASI is a clean, efficient execution path. - Interpreted languages (Python, JS, Lua) still need their interpreter
(CPython, QuickJS, etc.) — it’s just compiled to
.wasminstead of native code. The stack becomes: script → interpreter.wasm → WASI runtime → kernel. You pay for a wasm sandbox layer on top of the interpreter you’d need anyway.
For interpreted languages, WASI sandboxing is valuable when running untrusted code (plugins, user-submitted scripts) where you don’t trust the interpreter itself. For trusted system scripts, native CPython/QuickJS via the POSIX layer (Part 4) is simpler and faster — the capability model already constrains what the process can do.
Tier 6: Managed Runtimes (much later)
Languages with their own runtimes (Java, .NET) would need their runtime ported to capOS. This is large effort and low priority. WASI is the pragmatic answer for these languages.
Go is a special case — see docs/proposals/go-runtime-proposal.md
for the custom GOOS=capos path (motivated by CUE support). Go via WASI
(GOOS=wasip1) is an alternative for CPU-bound use cases but lacks
goroutine parallelism and networking.
C++ Note: pg83/std
pg83/std (https://github.com/pg83/std) was reviewed as a possible easier
path to C++ on capOS. It is MIT licensed and centered on ObjPool, an
arena-owned object graph model with small containers and lightweight public
interfaces.
The useful subset for capOS is the low-level core: std/mem, std/lib,
std/str, std/map, std/alg, std/typ, and std/sys/atomic. The main
shim boundary is std/sys/crt.cpp, which currently provides allocation,
memory/string intrinsics, and monotonic time through hosted libc calls.
The full library is not a shortcut to C++ support. It assumes hosted/POSIX
facilities in large areas: malloc/free, clock_gettime, pthreads, poll,
epoll/kqueue, sockets, fd I/O, DNS, and optional TLS libraries. Its build also
expects a C++26-capable compiler. On the current development host, g++
13.3.0 rejected -std=c++26 and clang++ was unavailable.
Treat it as a later C++ experiment after libcapos and C/C++ startup exist:
port only the freestanding arena/container subset first, with exceptions and
RTTI disabled unless a concrete C++ ABI decision enables them. Regular Rust and
Go remain higher-priority runtime tracks.
Language-Specific Notes
Python
CPython is a C program. It can reach capOS via two paths:
- WASI: CPython compiled to
python.wasm, runs inside Wasmtime/WAMR on capOS. Note: this is still CPython — WASI doesn’t eliminate the interpreter, it just compiles it to wasm. The stack is:script.py → python.wasm → WASI runtime (native) → kernel. - POSIX layer: CPython compiled to native ELF via musl +
libcapos-posix. Direct:script.py → cpython (native) → kernel.
WASI path — upstream status (as of March 2026):
- CPython on WASI is Tier 2 since Python 3.13 (PEP 816)
- Works for compute-only workloads (no I/O beyond stdout)
- No sockets/networking — blocked on WASI 0.3 (no release date)
- No threading — WASI SDK 26/27 have bugs, skipped by CPython
- WASI 0.2 skipped entirely — going straight to 0.3
- Python 3.14 targets WASI 0.1, SDK 24
POSIX path:
- Full CPython built against musl +
libcapos-posix - Networking works immediately (via TcpSocket/UdpSocket caps behind the POSIX socket shim), no dependency on WASI 0.3
- More integration work than WASI, but unblocked
MicroPython: Small C program (~100K source) designed for embedded use.
Builds against musl + libcapos-posix with minimal effort. No threading,
no mmap, minimal syscall surface. Good for early scripting needs before
full CPython is ported.
When to use which:
| Use case | Path | Why |
|---|---|---|
| Untrusted Python plugins | WASI | Wasm sandbox isolates interpreter bugs |
| System scripts, config tooling | POSIX (native CPython) | Simpler, faster, networking works |
| Early scripting before POSIX layer | WASI (compute-only) | Works today, no porting needed |
| Lightweight embedded scripting | MicroPython via POSIX | Tiny footprint, minimal deps |
Recommendation: Use POSIX path (native CPython) as the primary Python target once the POSIX layer exists. WASI path for sandboxed/untrusted execution. MicroPython for early experimentation. No custom Python runtime port needed — both paths reuse upstream CPython.
JavaScript / TypeScript
Same situation as Python — JS engines (V8, SpiderMonkey, QuickJS) are C/C++ programs that can be compiled to native via POSIX layer or to wasm via WASI. In both cases, the engine interprets JS; WASI just sandboxes the engine itself.
QuickJS is the MicroPython equivalent — tiny (~50K lines C), embeddable,
trivially builds against libcapos. Good candidate for embedded scripting
in capOS services without pulling in a full V8.
Lua
Tiny C implementation (~30K lines). Trivially builds against libcapos.
Good candidate for an embedded scripting language in capOS services.
Alternatively, runs via WASI with near-zero overhead.
Part 4: POSIX Compatibility Layer
Why POSIX at All?
capOS is not POSIX and doesn’t want to be. But:
-
Existing software. Most useful software assumes POSIX. A DNS resolver, an HTTP server, a database – all speak
open()/read()/write()/socket(). Without some compatibility layer, every piece of software must be rewritten. -
Developer familiarity. Programmers know POSIX. A compatibility layer lowers the barrier to writing capOS software, even if native caps are better.
-
Gradual migration. Port software first with POSIX compat, then incrementally convert to native capabilities for tighter sandboxing.
The goal is NOT full POSIX compliance. It’s a pragmatic translation layer that maps POSIX concepts to capabilities, enabling existing software to run with minimal modification while preserving capability-based security.
Architecture: libcapos-posix
Application (C/Rust, uses POSIX APIs)
│
│ open(), read(), write(), socket(), ...
│
v
libcapos-posix (POSIX-to-capability translation)
│
│ Maps fds to caps, paths to namespace lookups
│
v
libcapos (native capability invocation)
│
│ SQ/CQ ring + cap_enter syscall
│
v
Kernel (capability dispatch)
libcapos-posix is a static library that provides POSIX-like function
signatures. It is NOT libc – it doesn’t provide malloc (that’s the
allocator in capos-rt/libcapos), locale support, or the thousand other
things in glibc. It’s the ~50 syscall wrappers that matter for I/O.
File Descriptor Table
POSIX programs think in file descriptors. capOS has capabilities. The
translation is a per-process fd-to-cap mapping table inside libcapos-posix:
#![allow(unused)]
fn main() {
struct FdTable {
entries: BTreeMap<i32, FdEntry>,
next_fd: i32,
}
enum FdEntry {
/// Backed by a Console cap (stdout/stderr)
Console { cap_id: u32 },
/// Backed by a Namespace + hash (opened "file")
StoreObject { namespace_cap: u32, hash: Vec<u8>, cursor: usize },
/// Backed by a TcpSocket cap
TcpSocket { cap_id: u32 },
/// Backed by a UdpSocket cap
UdpSocket { cap_id: u32 },
/// Backed by a TcpListener cap
Listener { cap_id: u32 },
/// Pipe (IPC channel between two caps)
Pipe { read_cap: u32, write_cap: u32 },
}
}
On process startup, libcapos-posix pre-populates:
- fd 0 (stdin): if a
ConsoleorStdinReadercap is in the CapSet - fd 1 (stdout): mapped to
Consolecap - fd 2 (stderr): mapped to
Consolecap (or a separateLogcap)
Path Resolution
POSIX open("/etc/config.toml", O_RDONLY) becomes:
libcapos-posixlooks up the process’sNamespacecap (from CapSet, name"fs"or"root")- Strips leading
/(there is no global root – the namespace IS the root) - Calls
namespace.resolve("etc/config.toml")to get a store hash - Calls
store.get(hash)to retrieve the object data - Creates an
FdEntry::StoreObjectwith cursor at 0 - Returns the fd number
Relative paths work the same way – there’s no cwd concept by default, but
libcapos-posix can maintain a synthetic cwd string and prepend it.
Path scoping is automatic. If the process was granted a Namespace scoped
to "myapp/", then open("/data.db") resolves to "myapp/data.db" in the
store. The process can’t escape its namespace – there’s no .. traversal
because namespaces are flat prefix scopes, not hierarchical directories.
Supported POSIX Functions
Grouped by what capability backs them:
Console cap -> stdio:
| POSIX | capOS translation |
|---|---|
write(1, buf, len) | console.write(buf[..len]) |
write(2, buf, len) | console.write(buf[..len]) (or log cap) |
read(0, buf, len) | stdin.read(buf, len) if stdin cap exists |
Namespace + Store caps -> file I/O:
| POSIX | capOS translation |
|---|---|
open(path, flags) | namespace.resolve(path) -> store.get(hash) -> fd |
read(fd, buf, len) | memcpy from cached store object at cursor |
write(fd, buf, len) | buffer writes, flush to store.put() on close |
close(fd) | if modified: store.put(data) -> namespace.bind(path, hash) |
lseek(fd, off, whence) | update cursor in FdEntry |
stat(path, buf) | namespace.resolve(path) -> synthesize stat from object metadata |
unlink(path) | namespace.unbind(path) (object remains in store if referenced elsewhere) |
opendir/readdir | namespace.list() filtered by prefix |
mkdir(path) | no-op or create empty namespace prefix (namespaces are implicit) |
TcpSocket/UdpSocket caps -> networking:
| POSIX | capOS translation |
|---|---|
socket(AF_INET, SOCK_STREAM, 0) | net_mgr.create_tcp_socket() -> fd |
connect(fd, addr, len) | tcp_socket.connect(addr) |
bind(fd, addr, len) | tcp_listener.bind(addr) |
listen(fd, backlog) | no-op (listener cap is already listening) |
accept(fd) | tcp_listener.accept() -> new fd |
send(fd, buf, len, 0) | tcp_socket.send(buf[..len]) |
recv(fd, buf, len, 0) | tcp_socket.recv(buf, len) |
Not supported (returns ENOSYS or EACCES):
| POSIX | Why not |
|---|---|
fork() | No address space cloning. Use posix_spawn() (maps to ProcessSpawner) |
exec() | No in-place replacement. Use posix_spawn() |
kill(pid, sig) | No signals. Future lifecycle work may add ProcessHandle kill semantics |
chmod/chown | No permission bits. Authority is structural |
mmap(MAP_SHARED) | No shared memory yet (future: SharedMemory cap) |
ioctl | No device files. Use typed capability methods |
ptrace | No debugging interface yet |
pipe() | Possible via IPC caps, but not in initial version |
select/poll/epoll | Requires async cap invocation (Stage 5+). Initial version is blocking only |
Process Creation Compatibility
capOS process creation is spawn-style, not fork/exec-style. A new process is a
fresh ELF instance selected by ProcessSpawner, with an explicit initial
CapSet assembled from granted capabilities. The parent address space is not
cloned, and an existing process image is not replaced in place.
posix_spawn() is the compatibility primitive for subprocess creation. A
libcapos-posix implementation maps it to ProcessSpawner.spawn(), translates
file actions into fd-table setup and capability grants, and passes argv and
environment data through the process bootstrap channel once that ABI exists.
Programs that use the common fork() followed immediately by exec() pattern
should be patched to call posix_spawn() directly.
Full fork() is intentionally not a native kernel primitive. Supporting it
would require copy-on-write address-space cloning, parent/child register return
semantics, fd-table duplication, a per-capability inheritance policy, safe
handling for outstanding SQEs/CQEs, and defined behavior for endpoint calls,
timers, waits, and process handles that are in flight at the fork point.
Threaded POSIX processes add another constraint: only the calling thread is
cloned, while locks and async-signal-safe state must remain coherent in the
child.
If a concrete port needs more than posix_spawn(), the next step should be a
narrow compatibility shim with vfork()/fork-for-exec semantics backed by
ProcessSpawner, not a general kernel clone operation. That shim would suspend
the parent, restrict child actions to exec-or-exit, and avoid pretending that
arbitrary address-space cloning exists.
Security Model
The POSIX layer does NOT weaken capability security. Every POSIX call translates to a capability invocation on caps the process was actually granted:
open("/etc/passwd")fails if the process’s namespace doesn’t contain"etc/passwd"– not because of permission bits, but because the name doesn’t resolvesocket(AF_INET, SOCK_STREAM, 0)fails if the process wasn’t granted aNetworkManagercapfork()fails unconditionally – there’s no way to synthesize it from caps
A POSIX binary on capOS is more constrained than on Linux, not less. The compatibility layer provides familiar function signatures, not familiar authority.
Building POSIX-Compatible Binaries
my-app/
Cargo.toml # depends on capos-posix (which depends on capos-rt)
src/main.rs # uses libc-style APIs
Or for C:
#include <capos/posix.h> // open, read, write, close, socket, ...
#include <capos/capos.h> // cap_call, capset_get, ...
int main() {
// Works -- stdout is mapped to Console cap
write(1, "hello\n", 6);
// Works -- if "data" namespace cap was granted
int fd = open("/config.toml", O_RDONLY);
char buf[4096];
ssize_t n = read(fd, buf, sizeof(buf));
close(fd);
// Works -- if NetworkManager cap was granted
int sock = socket(AF_INET, SOCK_STREAM, 0);
// ...
}
The linker pulls in libcapos-posix.a -> libcapos.a -> startup code.
Same ELF output, same kernel loader.
musl as a Base (Optional, Later)
For broader C compatibility (printf, string functions, math), libcapos-posix
can be layered under musl libc. musl has a clean
syscall interface – all system calls go through a single __syscall() function.
Replacing that function with capability-based dispatch gives you full libc on
top of capOS capabilities:
// musl's syscall entry point -- we replace this
long __syscall(long n, ...) {
switch (n) {
case SYS_write: return capos_write(fd, buf, len);
case SYS_open: return capos_open(path, flags, mode);
case SYS_socket: return capos_socket(domain, type, protocol);
// ...
default: return -ENOSYS;
}
}
This is the same approach Fuchsia uses with fdio + musl, and Redox OS uses
with relibc. It works and it gives you printf, fopen, getaddrinfo, and
most of the C standard library.
Priority: after native capos-rt and libcapos are stable. musl integration is a significant engineering effort and should only be done when there’s actual software to port.
Part 5: WASI as an Alternative to POSIX
Why WASI Fits capOS Better Than POSIX
WASI (WebAssembly System Interface) was designed from the start as a capability-based system interface. Its concepts map almost directly to capOS:
| WASI concept | capOS equivalent |
|---|---|
fd (pre-opened directory) | Namespace cap |
fd (socket) | TcpSocket/UdpSocket cap |
fd_write on stdout | Console.write() |
| Pre-opened dirs at startup | CapSet at spawn |
| No ambient filesystem access | No ambient authority |
path_open scoped to pre-opened dir | namespace.resolve() scoped to granted prefix |
WASI programs already assume they get no ambient authority. A WASI binary compiled for capOS would need essentially zero translation for the security model – just a thin ABI adapter.
Architecture: Wasm Runtime as a capOS Service
WASI binary (.wasm)
│
│ WASI syscalls (fd_read, fd_write, path_open, ...)
│
v
wasm-runtime process (Wasmtime/wasm-micro-runtime, native capOS binary)
│
│ Translates WASI calls to capability invocations
│ Each wasm instance gets its own CapSet
│
v
libcapos (native capability invocation)
│
v
Kernel
The wasm runtime is itself a native capOS process. It receives caps from its parent and partitions them among the wasm modules it hosts. This gives you:
- Language independence. Any language that compiles to WASI (Rust, C, C++, Go, Python, JS, …) runs on capOS
- Sandboxing for free. Wasm’s memory isolation + capOS capability scoping
- No porting effort for software that already targets WASI
- Density. Multiple wasm modules in one process, each with different caps
WASI vs Native Performance
Wasm adds overhead: bounds-checked memory, indirect calls, no SIMD (WASI preview 2 adds some). For system services (drivers, network stack), native Rust is the right choice. For application-level code (business logic, CLI tools, web services), wasm overhead is acceptable and the portability is worth it.
WASI Implementation Phases
Phase 1: wasm-micro-runtime as a capOS service. WAMR
is a lightweight C wasm runtime designed for embedded/OS use. Build it as a
native capOS C binary (via libcapos). Support fd_write (Console),
proc_exit, and args_get – enough to run “hello world” wasm modules.
Phase 2: WASI filesystem via Namespace. Map WASI path_open/fd_read/
fd_write to Namespace + Store caps. Pre-opened directories become
Namespace caps.
Phase 3: WASI sockets. Map WASI socket APIs to TcpSocket/UdpSocket caps.
Phase 4: WASI component model. WASI preview 2 components can expose and consume typed interfaces. These map naturally to capOS capability interfaces – a wasm component that exports an HTTP handler becomes a capability that other processes can invoke.
Part 6: Putting It All Together – Porting Strategy
Spectrum of Integration
Most native Most compatible
| |
v v
Native Rust C with libcapos C with POSIX layer WASI binary
(capos-rt) (typed caps) (libcapos-posix) (wasm runtime)
- Best perf - Good perf - Familiar API - Any language
- Full cap - Full cap - Auto sandboxing - Auto sandboxing
control control via cap scoping via wasm + caps
- Most work - Moderate work - Least rewrite - Zero rewrite
to write to write for existing C for WASI targets
Example: Porting a DNS Resolver
Native Rust: Rewrite using capos-rt. Receives UdpSocket cap, serves
DNS lookups as a DnsResolver capability. Other processes get a
DnsResolver cap instead of calling getaddrinfo(). Clean, typed, minimal
authority.
C with POSIX layer: Take an existing DNS resolver (e.g., musl’s
getaddrinfo implementation or a standalone resolver). Compile against
libcapos-posix. Give it a UdpSocket cap and a Namespace cap for
/etc/resolv.conf. It calls socket(), sendto(), recvfrom() – all
translated to cap invocations. Works with minimal changes, but can’t export
a typed DnsResolver cap (it speaks POSIX, not caps).
WASI: Compile a Rust DNS resolver to WASI. Run it in the wasm runtime. Same capability scoping, but through the wasm sandbox.
Recommended Approach for capOS
-
System services: native Rust only. Drivers, network stack, store, init – these are the foundation and must use capabilities natively. No POSIX layer here.
-
First applications: native Rust. While the ecosystem is young, applications should use
capos-rtdirectly. This validates the cap model. -
C compatibility: when porting specific software. Don’t build the POSIX layer speculatively. Build it when there’s a specific C program to port (e.g., a DNS resolver, an HTTP server, a database). Let real porting needs drive which POSIX functions to implement.
-
WASI: as the general-purpose application runtime. Once the native runtime is stable, the wasm runtime becomes the “run anything” answer. Lower priority than native Rust, but higher priority than full POSIX/musl compat, because WASI’s capability model is a natural fit.
Part 7: Schema Extensions
New schema types needed for the userspace runtime:
# Extend schema/capos.capnp
struct InitialCaps {
entries @0 :List(InitialCapEntry);
}
struct InitialCapEntry {
name @0 :Text;
id @1 :UInt32;
interfaceId @2 :UInt64;
}
interface ProcessSpawner {
spawn @0 (name :Text, binaryName :Text, grants :List(CapGrant)) -> (handleIndex :UInt16);
}
struct CapGrant {
name @0 :Text;
capId @1 :UInt32;
interfaceId @2 :UInt64;
}
interface ProcessHandle {
wait @0 () -> (exitCode :Int64);
}
These definitions now live in schema/capos.capnp as the single source of
truth. spawn() returns the ProcessHandle through the ring result-cap list;
handleIndex identifies that transferred cap in the completion. The first
slice passes a boot-package binaryName instead of raw ELF bytes so spawn
requests stay inside the bounded ring parameter buffer; manifest-byte exposure
and bulk-buffer spawning remain later work. kill, post-spawn grants, and
exported-cap lookup are deferred until their lifecycle semantics are
implemented.
Implementation Phases
Phase 1: capos-rt (parallel with Stage 4)
- Create
capos-rt/crate (no_std + alloc, path dependency) - Implement syscall wrappers (
sys_exit,sys_cap_enter) and ring helpers - Implement CapSet parsing from well-known page
- Implement typed Console wrapper (first cap used from userspace)
- Rewrite
init/to use capos-rt - Entry point macro, panic handler, allocator setup
Deliverable: init prints “Hello” via Console cap invocation through capos-rt, not raw asm.
Phase 2: Service binaries (after Stage 6)
- Add capnp codegen to capos-rt build.rs (shared with kernel)
- Implement typed wrappers for all schema-defined caps
- Build the first multi-process demo: init spawns server + client, client invokes server cap
- Establish the pattern for service binaries (Cargo.toml template, linker script, build integration)
Deliverable: two userspace processes communicate via typed capabilities.
Phase 3: libcapos for C (after Phase 2)
- Expose capos-rt functionality via
extern "C"API - Write
capos.hheader - Build system support for C userspace binaries (linker script, startup)
- Port one small C program as validation
Deliverable: a C “hello world” using console_write_line().
Phase 4: POSIX compatibility (driven by need)
- Implement FdTable and path resolution
- Start with file I/O (open/read/write/close over Namespace + Store)
- Add socket wrappers when networking is userspace
- Optionally integrate musl for full libc
Deliverable: an existing C program (e.g., a simple HTTP server) runs on capOS with minimal source changes.
Phase 5: WASI runtime (after Phase 3)
- Build wasm-micro-runtime as a native capOS C binary
- Map WASI fd_write/proc_exit to caps
- Extend to filesystem and socket WASI APIs
- Run a “hello world” wasm module
Deliverable: hello.wasm runs on capOS.
Open Questions
-
Allocator strategy. Should the userspace heap be a fixed-size region (simple, but limits memory), or should it grow by invoking a FrameAllocator cap (flexible, but every allocation might syscall)? Likely answer: fixed initial region + grow-on-demand via cap.
-
Async I/O. The SQ/CQ ring is inherently asynchronous (submit SQEs, poll CQEs), but the initial
capos-rtwrappers provide blocking convenience (submit one CALL SQE +cap_enter(1, MAX)). Real services need batched async patterns. Options:- Submit multiple SQEs, poll CQEs in an event loop (io_uring style)
- Green threads in capos-rt, each blocking on its own
cap_enter - Userspace executor (like tokio) driving the ring
-
Cap passing in POSIX layer. POSIX has
SCM_RIGHTSfor passing fds over Unix sockets. Should the POSIX layer support something similar for passing caps? Or is this native-only? -
Dynamic linking. Currently all binaries are statically linked. Should capOS support shared libraries? Probably not initially – static linking is simpler and the binaries are small. Revisit if binary size becomes a concern.
-
WASI component model integration. WASI preview 2 components have typed imports/exports that could map to capnp interfaces. Should the wasm runtime auto-generate capnp-to-WIT adapters from schemas? This would let wasm components participate natively in the capability graph.
-
Build system. How are userspace binaries packed into the boot image? Currently the Makefile builds
init/separately. With multiple service binaries, need a more scalable approach (build manifest that lists all binaries, Makefile target that builds and packs them all).
Relationship to Other Proposals
- Service architecture proposal – defines what services exist and how they compose. This proposal defines how those service binaries are built, what runtime they use, and how non-Rust software fits in.
- Storage and naming proposal – the POSIX
open()/read()/write()translation targets the Store and Namespace caps defined there. - Networking proposal – the POSIX socket translation targets the TcpSocket/UdpSocket caps from the network stack.
Proposal: Native Shell, Agent Shell, and POSIX Shell
How interactive operation should work on capOS without reintroducing ambient authority through a Unix-like command line.
Problem
capOS deliberately avoids global paths, inherited file descriptors, ambient network access, and process-wide privilege bits. A conventional shell assumes all of those. If capOS copied a Unix shell model directly, the shell would either be mostly useless or become an ambiently privileged escape hatch around the capability model.
The system needs three related, but distinct, shell layers:
- Native shell: schema-aware capability REPL and scripting language.
- Agent shell: natural-language planning layer over the native shell.
- POSIX shell: compatibility personality for existing programs and scripts.
All three must be ordinary userspace processes. None of them should receive special kernel privilege. The kernel and trusted capability-serving processes remain the enforcement boundary.
The first boot-to-shell milestone is text-only: local console login/setup and, later in the same family, a browser-hosted terminal gateway. Graphical shells, desktop UI, compositors, and GUI app launchers are a later tier. See boot-to-shell-proposal.md.
Design Principles
- A shell starts with only the capabilities it was granted.
- Natural language is not authority.
- A shell command compiles to typed capability calls, not stringly syscalls.
- Child processes receive explicit grants. There is no implicit inheritance of the shell’s full authority.
- Elevation is a capability request mediated by a trusted broker, not a flag inside the shell.
- Shell startup is a workload launch from a
UserSession, service principal, or recovery profile. Session metadata informs policy and audit; it is not authority. - Default interactive cap sets are broker-issued session bundles, not hard-coded shell privileges.
- POSIX behavior is an adapter over scoped
Directory,File, socket factory, and process capabilities. It is not the native authority model.
User identity and policy sit above this shell model. A shell session may be
associated with a human, service, guest, anonymous, or pseudonymous principal,
but the session’s capabilities remain the authority. RBAC, ABAC, and mandatory
policy decide which scoped caps a broker may grant; they do not create a
kernel-side uid, role bit, or label check on ordinary capability calls. See
user-identity-and-policy-proposal.md.
Layering
flowchart TD
Input[Login, guest, anonymous, or service request] --> SessionMgr[SessionManager]
SessionMgr --> Session[UserSession metadata cap]
Session --> Broker[AuthorityBroker / PolicyEngine]
Broker --> Bundle[Scoped session cap bundle]
Bundle --> Agent[Agent shell]
Bundle --> Native[Native shell]
Bundle --> Posix[POSIX shell]
Agent --> Plan[Typed action plan]
Plan --> Native
Posix --> Compat[POSIX compatibility runtime]
Native --> Ring[capos-rt capability transport]
Compat --> Ring
Ring --> Kernel[Kernel cap ring]
Ring --> Services[Userspace services]
Agent --> Approval[Approval client cap]
Approval --> Broker
Broker --> Services
Broker --> Audit[AuditLog]
The native shell is the primitive interactive surface. The agent shell emits native-shell plans after inspecting available schemas, current caps, and the session-bound policy context exposed to it. The POSIX shell is a compatibility consumer of capOS capabilities, not the model other shells are built on.
A shell may display a principal name, profile, role set, label, or POSIX UID,
but those values are descriptive unless a trusted broker uses them to return a
specific capability. Losing a home, logs, launcher, or approval cap
cannot be repaired by presenting the same session ID back to the kernel.
Native Shell
The native shell is a typed capability graph operator. Its job is to inspect, invoke, pass, attenuate, release, and trace capabilities.
Example init or development session with explicit spawn authority:
capos:init> caps
log Console
spawn ProcessSpawner
boot BootPackage
vm VirtualMemory
capos:init> call @log.writeLine({ text: "hello" })
ok
capos:init> spawn "tls-smoke" with {
log: @log
} -> $child
started pid 12
capos:init> wait $child
exit 0
Values
Native shell values should include:
@name: a named capability in the current shell context.$name: a local value, result, promise, or process handle.- structured values: text, bytes, integers, booleans, lists, and structs.
- result-cap values returned through the capOS transfer-result path.
- trace values representing CQE and call-history slices.
The shell should preserve interface metadata with every capability value. A method call is valid only if the target cap exposes the method’s schema.
Commands
Initial commands should be small and explicit:
caps
inspect @log
methods @spawn
call @log.writeLine({ text: "boot complete" })
spawn "ipc-server" with { log: @log, ep: @serverEp } -> $server
wait $server
release @temporary
trace $server
bind scratch = @store.sub("scratch")
derive readonly = @home.sub("config").readOnly()
inspect should show the interface ID, label, transferability, revocation
state when available, and callable methods. It should not imply that two caps
with the same interface ID are the same authority.
Syntax
The syntax should be structured rather than shell-token based. A CUE-like or Cap’n-Proto-literal-like shape fits capOS better than POSIX word splitting:
spawn "net-stack" with {
log: @log
nic: @virtioNic
timer: @timer
}
The shell can still provide abbreviations, but the executable representation
should be an ActionPlan object with typed fields.
Composition
Native composition should pass typed caps or structured values, not inherited byte streams by default:
pipe @camera.frames()
|> spawn "resize" with { input: $, width: 640, height: 480 }
|> spawn "jpeg-encode" with { input: $, quality: 85 }
|> call @photos.write({ name: "frame.jpg", data: $ })
If a byte stream is desired, it should be explicit through a ByteStream,
File, or POSIX adapter capability. This keeps the “pipe” operator from
silently turning every interface into untyped bytes.
Namespaces
There is no global root. A native shell may have a current Directory or
Namespace capability, but that is just a default argument:
capos:user> ls @config
services
network
capos:user> cd @config.sub("services")
capos:@config/services> ls
logger
net-stack
The shell cannot traverse above a scoped directory or namespace unless it holds another capability that names that authority.
Session Context
A session-aware shell may hold a self or session cap for UserSession.info()
and audit context. That cap is metadata. It can identify the principal, auth
strength, expiry, quota profile, and audit identity, but it cannot widen the
shell’s CapSet or authorize kernel operations by itself.
The launcher or supervisor starts the shell with a CapSet returned by
AuthorityBroker(session, profile). For interactive work, that bundle should
usually include scoped terminal, home, logs, launcher, status, and approval
caps. For service accounts, guest sessions, anonymous workloads, and recovery
mode, the broker returns different bundles under explicit policy profiles.
Shell-launched children inherit only the caps named in the spawn plan. A child
may receive a UserSession or session badge for audit, per-client quotas, or
service-side selection, but object access still comes from the scoped object
caps passed to that child.
Agent Shell
The agent shell is a natural-language planner that emits typed native-shell plans. It should not directly own broad administrative authority.
Example:
capos:init> start the IPC demo, give the client only the server endpoint and console, then wait for both
The agent produces a plan:
Plan:
1. Spawn "ipc-server" with:
- log: Console
- ep: Endpoint(owner)
2. Spawn "ipc-client" with:
- log: Console
- server: Endpoint(client facet from server)
3. Wait for both ProcessHandle caps.
Required authority:
- ProcessSpawner
- Console
- Endpoint owner cap
- BootPackage binary access
Only after validation does the plan execute. Validation checks the current cap set, schema method IDs, transferability, grant names, quota limits, and policy.
What the Agent Adds
The useful AI-specific behavior is not raw command execution. It is:
- intent decomposition into spawn, grant, wait, trace, and release steps.
- schema-aware parameter construction.
- least-authority grant selection.
- explanation of missing capabilities.
- diagnosis from structured errors, CQEs, logs, and process handles.
- conversion of vague requests into an explicit plan that can be audited.
- retry after typed failures without bypassing policy.
The agent should reason over capOS objects and schemas, not over an unbounded shell prompt.
Minimal Daily Cap Set
The daily-use agent shell should start with the user-identity proposal’s
session bundle, minted by AuthorityBroker for one UserSession and profile:
terminal TerminalSession or Console
self self/session introspection
status read-only SystemStatus
logs read-only LogReader scoped to this user/session
home Directory or Namespace scoped to user data
launcher restricted launcher for approved user applications
approval ApprovalClient
It should not receive these by default:
ProcessSpawner(all)
BootPackage(all)
DeviceManager
StoreAdmin
FrameAllocator
VirtualMemory for other processes
raw networking caps
global service supervisor caps
The shell can ask for more authority, but it cannot mint that authority for itself.
Guest and anonymous profiles should receive narrower variants. A guest shell
may get terminal, tmp, and a restricted launcher, while an anonymous
workload normally receives short-lived purpose caps, strict quotas, and no
durable home namespace. An approval path exists only when the profile policy
explicitly grants one.
Approval and Authentication
Elevation belongs in a trusted broker service that is outside the model-controlled agent process.
Conceptual interfaces:
interface ApprovalClient {
request @0 (
reason :Text,
plan :ActionPlan,
requestedCaps :List(CapRequest),
durationMs :UInt64
) -> (grant :ApprovalGrant);
}
enum ApprovalState {
pending @0;
approved @1;
denied @2;
expired @3;
}
interface ApprovalGrant {
state @0 () -> (state :ApprovalState, reason :Text);
claim @1 () -> (caps :List(GrantedCap));
cancel @2 () -> ();
}
interface AuthorityBroker {
request @0 (
session :UserSession,
plan :ActionPlan,
requestedCaps :List(CapRequest),
durationMs :UInt64
) -> (grant :ApprovalGrant);
}
The agent shell holds only a session-bound ApprovalClient. It does not submit
arbitrary PrincipalInfo, role, UID, label values, or authentication proofs as
authority. The ApprovalClient forwards the bound UserSession and typed
request to AuthorityBroker. The broker or a consent service wrapping it holds
powerful caps, drives any trusted consent or step-up authentication path, and
mints attenuated temporary caps after policy and authentication checks.
The conceptual API intentionally has no authProof argument on the
agent-visible path. If a proof is needed, it is collected by SessionManager,
the broker, or a trusted approval UI and reflected back to the agent only as
pending, approved, denied, or expired.
Elevation Flow
User request:
restart the network stack
Agent plan:
Requested action:
- stop service "net-stack"
- spawn "net-stack"
- grant: nic, timer, log
- wait for health check
Missing authority:
- ServiceSupervisor(net-stack)
Requested duration:
- 60 seconds
Broker decision:
- Which
UserSessionand profile is this request bound to? - Is that principal/profile allowed to restart
net-stack? - Is the requested binary allowed?
- Are the requested grants narrower than policy permits?
- Do mandatory confidentiality and integrity constraints allow the grant?
- Is there fresh user presence?
- Does this require step-up authentication?
If approved, the broker returns a narrow leased capability:
supervisor: ServiceSupervisor(service="net-stack", expires=60s)
It should not return broad ProcessSpawner, BootPackage, or DeviceManager
authority when a scoped supervisor cap can do the job.
Authentication
Authentication proof should be consumed by the SessionManager or broker
boundary, not exposed as a secret to the agent. Suitable mechanisms include:
- password or PIN for medium-risk local actions.
- hardware key or WebAuthn-style challenge for administrative actions.
- TPM-backed local presence for device or boot-policy operations.
- multi-party approval for destructive policy, storage, or recovery actions.
The model should never receive raw tokens, private keys, recovery codes, or full environment dumps.
Agent Hardening
The agent shell must treat files, logs, web pages, service output, and CQE payloads as untrusted data. They are not instructions.
Required behavior:
- show an executable typed plan before authority-changing actions.
- keep elevated caps leased, narrow, and short-lived.
- release temporary caps after the plan finishes or fails.
- audit every approval request, grant, cap transfer, and release.
- require exact targets for destructive actions.
- refuse broad phrases such as “give it everything” unless a trusted policy explicitly allows a named emergency mode.
- keep model memory separate from secrets and authentication proofs.
The enforcement rule is simple: the model may plan, explain, and request. Capabilities decide what can happen.
POSIX Shell
The POSIX shell is a compatibility layer for existing software and scripts. It should be useful, but it should not define native capOS administration.
Mapping
POSIX concepts map onto granted capabilities:
| POSIX concept | capOS backing |
|---|---|
/ | synthetic root built from granted Directory or FileServer caps |
| cwd | current scoped Directory cap |
| fd | local handle to File, ByteStream, pipe, terminal, or socket cap |
| pipe | ByteStream pair or userspace pipe service |
PATH | search inside the synthetic root or a command registry cap |
exec | ProcessSpawner or restricted launcher cap |
| sockets | socket factory caps such as TcpProvider or HttpEndpoint |
uid, gid, user, group | synthetic POSIX profile derived from session metadata |
$HOME | path alias backed by a granted home directory or namespace cap |
/etc/passwd, /etc/group | profile service view, scoped to the compatibility environment |
| env vars | data only; never authority by themselves |
If a POSIX process has no network cap, connect() fails. If it has no
directory mounted at /etc, opening /etc/resolv.conf fails. If it has no
device cap, /dev is empty or synthetic.
A POSIX shell is launched with both a CapSet and compatibility profile metadata. The profile controls what legacy APIs report. The CapSet controls what the process can actually do.
Compatibility Limits
Exact Unix semantics should not be promised early.
- Prefer
posix_spawnover fullforkfor the first implementation. forkwith arbitrary shared process state can be emulated later if needed.setuidcannot grant caps. At most it asks a compatibility broker to replace the POSIX profile or launch a new process with a different broker-issued cap bundle.- Mode bits and ownership metadata do not create authority.
chmodcan modify filesystem metadata exposed by a filesystem service, but it cannot grant caps outside that service’s policy./procis a debugging service view, not kernel ambient introspection.- Device files exist only when a capability-backed adapter deliberately exposes them.
This is enough for many build tools and CLI programs without making POSIX the security model.
POSIX Session Caps
A normal POSIX shell session might receive:
terminal TerminalSession
session UserSession metadata
profile POSIX profile view
root Directory or FileServer synthetic root
launcher restricted ProcessSpawner/command launcher
pipeFactory ByteStream factory
clock Timer
Optional caps:
tcp scoped socket provider
home writable user Directory
tmp temporary Directory
proc read-only process inspection tree
Administrative caps still require broker-mediated approval.
Recovery Shell
A recovery shell is a separate policy profile, not the normal agent shell with hidden extra privileges. It may receive a larger cap set, but only after strong local authentication and with full audit logging. Guest and anonymous profiles must not fall into recovery authority by omission.
Possible recovery bundle:
console
boot package read
system status read
service supervisor for critical services
read-only storage inspection
scoped repair caps
approval client
Destructive recovery operations should still go through exact-target approval. The recovery shell should be local-only unless a separate remote recovery policy explicitly grants network access.
Required Interfaces
This proposal implies several service interfaces beyond the current smoke-test surface:
UserSession/SessionManager: principal/session metadata, audit context, and guest or anonymous profile creation (user identity proposal).TerminalSession: structured terminal I/O, window size, paste boundaries.SchemaRegistry: maps interface IDs to method names and parameter schemas.CommandRegistry: optional registry of native command capabilities.SystemStatus: read-only process and service status.LogReader: scoped log access.ServiceSupervisor: restart/status authority for one service or subtree.AuthorityBroker/ApprovalClient: session-bound base bundles, plan-specific leased grants, and policy/authentication mediation.CredentialStore,ConsoleLogin, andWebShellGateway: boot-to-shell authentication services for password-verifier setup, passkey registration, and text terminal launch (boot-to-shell proposal).AuditLog: append-only record of plans, approvals, grants, and releases.POSIXProfile/ compatibility broker: synthetic UID/GID, names,$HOME, cwd, and profile replacement without treating POSIX metadata as authority.ByteStream/ pipe factory: explicit byte-stream composition for POSIX and selected native pipelines.
These should be ordinary capabilities. A shell only sees the subset it has been granted.
Implementation Plan
-
Native serial shell
- Built on
capos-rt. - Lists initial CapSet entries.
- Invokes typed Console methods.
- Spawns and waits on boot-package binaries through
ProcessSpawner. - Provides
caps,inspect,call,spawn,wait,release, andtrace.
- Built on
-
Session-aware shell profile
- Use the
SessionManager -> UserSession metadataandAuthorityBroker(session, profile) -> cap bundlesplit. - Add
self/sessionintrospection without making identity metadata authoritative. - Start with guest, local-presence, and service-account profiles before durable account storage exists.
- Use the
-
Structured native scripting
- Add typed variables, result-cap binding, and plan serialization.
- Add schema registry support for method names and argument validation.
- Add explicit byte-stream adapters for commands that need text streams.
-
Approval broker
- Define
ActionPlan,CapRequest,ApprovalClient, and leased grant records. - Add local authentication and audit logging.
- Make administrative native-shell operations request scoped caps through the broker instead of running from a permanently privileged shell.
- Define
-
Boot-to-shell integration
- Add local console login/setup in front of the native shell.
- Require a configured password verifier when one exists.
- Enter setup mode when no console password verifier exists.
- Treat guest as an explicit local profile and anonymous as a separate remote/programmatic profile, not as missing-password fallbacks.
- Support passkey-only web terminal setup through local/bootstrap authority, not unauthenticated remote first use.
-
Agent shell
- Natural-language frontend that emits native
ActionPlanobjects. - Starts with the broker-issued minimal daily session bundle.
- Uses the approval broker for elevation.
- Treats all external content as untrusted data.
- Natural-language frontend that emits native
-
POSIX shell
- Implement after
Directory/File,ByteStream, and restricted process launch exist. - Start with
posix_spawn, fd table emulation, cwd, scoped root, pipes, and terminal I/O, plus synthetic POSIX profile metadata. - Add broader compatibility only as real workloads demand it.
- Implement after
Non-Goals
- No global root namespace.
- No shell-owned root/admin bit.
- No model-visible secrets.
- No default inheritance of all shell caps into children.
- No authorization from
PrincipalInfo, UID/GID, role, or label values alone. - No promise that POSIX scripts observe exact Unix behavior without a compatibility profile that grants the needed caps.
Open Questions
- Should the native shell syntax be CUE-derived, Cap’n-Proto-literal-derived, or a smaller custom grammar?
- How should schema reflection be packaged before a full runtime
SchemaRegistryexists? - What is the first minimal
TerminalSessioninterface beyondConsole? - Should approval be synchronous only, or can long-running agent plans request staged approvals?
- How should audit logs be stored before persistent storage exists?
Proposal: Boot to Shell
How capOS should move from “boot runs smokes and halts” to an authenticated, text-only interactive shell without weakening the capability model.
Problem
The current boot path is still a systems bring-up path. It starts fixed services, proves kernel and userspace invariants, and exits cleanly. That is useful for validation, but it is not an operating environment.
The first interactive milestone should be deliberately modest:
- Boot QEMU or a local machine to a text console login/setup prompt.
- Start a native capability shell after local authentication or first-boot setup.
- Offer a browser-hosted text terminal later in the same milestone family, with WebAuthn/passkey authentication.
- Keep graphical shells, desktop UI, window systems, and app launchers as a later tier.
The risk is that “make it interactive” tends to smuggle ambient authority back
into the system. A login prompt must not become a kernel uid, a web terminal
must not become an unaudited remote root shell, and first-boot setup must not
be a first-remote-client-wins race.
Scope
In scope:
- Serial/local text console login and first-boot credential setup.
- Native text shell as the post-login workload.
- Minimal
SessionManager,CredentialStore,AuthorityBroker, andAuditLogpieces needed to launch that shell with an explicit CapSet. - Password verifier records stored with a memory-hard password hash.
- Passkey registration and authentication for a web text shell.
- A passkey-only account path that does not require creating a password first.
- Local recovery/setup policy for machines with no credential records.
Out of scope:
- Graphical shell, desktop session, compositor, GUI app launcher, clipboard, or remote desktop.
- POSIX
/bin/login, PAM,sudo,su, or Unixuid/gidsemantics. - Password reset by policy fiat. Recovery is a separate authenticated setup or operator action.
- Making authentication proofs visible to the shell, agent, logs, or ordinary application processes.
Design Principles
- Authentication creates a
UserSession; capabilities remain the authority. - The shell is an ordinary process launched with a broker-issued CapSet.
- Console authentication and web authentication feed the same session model.
- Passwords are verified against versioned password-verifier records; raw passwords are never stored, logged, or passed to the shell.
- Passkeys store public credential material only; private keys stay in the authenticator.
- First-boot setup requires local setup authority or an explicitly configured bootstrap credential. Remote first-come setup is not acceptable.
- A missing credential store does not imply an unlocked system.
- Guest and anonymous sessions are explicit policy profiles, not fallbacks for missing credentials.
- Development images may have an explicit insecure profile, but that must be visible in the manifest and serial output.
Architecture
The boot-to-shell path is a userspace service graph started by init after
the manifest executor milestone is complete:
flowchart TD
Kernel[kernel starts init only]
Init[init manifest executor]
Boot[BootPackage]
Cred[CredentialStore]
Session[SessionManager]
Broker[AuthorityBroker]
Audit[AuditLog]
Console[ConsoleLogin]
Web[WebShellGateway]
Launcher[RestrictedShellLauncher]
Shell[Native text shell]
Kernel --> Init
Init --> Boot
Init --> Cred
Init --> Session
Init --> Broker
Init --> Audit
Init --> Console
Init --> Web
Console --> Session
Web --> Session
Session --> Broker
Broker --> Launcher
Launcher --> Shell
Cred --> Session
Audit --> Session
Audit --> Broker
init owns broad boot authority long enough to start the authentication and
session services. It should not spawn the interactive shell directly with broad
boot caps. The broker returns a narrow shell bundle such as:
terminal TerminalSession or Console
self UserSession metadata
status read-only SystemStatus
logs scoped LogReader
home scoped Namespace or temporary Namespace
launcher RestrictedLauncher
approval ApprovalClient
Early builds can omit storage-backed home and use a temporary namespace. They
still should not hand the shell broad BootPackage, ProcessSpawner,
FrameAllocator, raw device, or global service-supervisor authority by default.
Console Login
The local console path has three states.
Password Configured
If CredentialStore has an enabled console password verifier for the selected
principal or profile, ConsoleLogin prompts for the password before launching
the shell.
The verifier record should be versioned:
PasswordVerifier {
algorithm: "argon2id"
params: { memoryKiB, iterations, parallelism, outputLen }
salt: random bytes
hash: verifier bytes
createdAtMs
credentialId
principalId
}
Argon2id is the default target because it is memory-hard and widely reviewed. The record must include parameters so stronger settings can be introduced without invalidating older records. A deployment may add a TPM- or secret-store-backed pepper later, but the design must not depend on a pepper being present.
On failed attempts, ConsoleLogin records an audit event and applies bounded
backoff. The backoff state is not a security boundary by itself, because local
attackers may reboot; the password hash strength still matters.
No Console Password
If no console password verifier exists, the console does not launch an ordinary shell. It enters setup mode.
Setup mode can:
- create the first console password verifier,
- enroll a first passkey for the web text shell,
- create both credentials,
- choose an explicit local guest or development profile if the manifest permits it.
For normal images, the setup flow must create at least one usable credential or leave the machine without an ordinary interactive shell. This matches the operator expectation: no configured password means “setup required”, not “open console”.
Passkey-Only Deployment
Passkey-only should be possible without creating a password. It still needs a bootstrap authority path.
Acceptable first-passkey bootstrap paths:
- local console setup enrolls the first passkey and then never creates a password verifier,
- the manifest or cloud metadata includes a predeclared passkey public credential for an operator principal,
- the console prints a short-lived setup challenge that a web enrollment flow must redeem before registering the first passkey.
Unacceptable path:
- the first remote browser to reach the web endpoint becomes administrator because no password exists.
If a machine is passkey-only, the local console can still expose setup, recovery, guest, or diagnostic profiles according to policy. It should not silently become an unauthenticated administrator shell.
Guest and Anonymous Profiles
The user-identity proposal distinguishes authenticated, guest, anonymous, and pseudonymous sessions. Boot-to-shell should consume that model directly.
Authenticated password login creates a human or operator UserSession with
auth strength password. Authenticated passkey login normally creates a human,
operator, or pseudonymous UserSession with auth strength hardwareKey.
Neither proof is authority by itself; both feed the broker.
Guest is the only unauthenticated profile that belongs on the local interactive
console by default. It is a deliberate SessionManager.guest() path with a
local interactive affordance, weak or no authentication, short expiry, tight
quotas, no durable home unless policy grants one, and a bundle such as:
terminal TerminalSession
self guest UserSession metadata
tmp temporary Namespace
launcher RestrictedLauncher(allowed = ["help", "settings"])
logs scoped LogReader for this guest session
Guest should not receive ApprovalClient for administrative actions unless a
named policy grants it. If no console password exists, setup may offer a guest
session only when the manifest explicitly enables a guest profile. Otherwise
the operator must create a credential or leave the ordinary shell unavailable.
Anonymous is different. It is usually remote or programmatic, has a random ephemeral principal ID, receives a smaller cap bundle than guest, and has no elevation path except “authenticate” or “create account”. It is not the console fallback for missing credentials, and it should not be counted as “booted to shell” unless the product goal is an explicitly anonymous demo.
If the web gateway later supports anonymous access, it should be a purpose-scoped workload or very restricted text terminal with no durable home, strict quotas, short expiry, and audit keyed by network context plus ephemeral session ID. It must not share the passkey setup path, because passkey-only bootstrap is a credential-enrollment flow, not anonymous access.
An empty CapSet remains the “Unprivileged Stranger” case. It is useful for attack-surface demonstration, but it is not a session profile and not a shell login mode.
Web Text Shell and Passkeys
The web shell in this milestone is a browser-hosted terminal transport, not a graphical shell. It should display the same native text shell protocol through a terminal UI and should launch the same kind of session bundle as the local console path.
Required pieces:
- network stack and HTTP/WebSocket or equivalent streaming transport,
- TLS or a deployment mode acceptable to browsers for WebAuthn,
- stable relying-party ID and origin policy,
- random challenge generation,
- passkey credential storage,
- user-verification policy,
- audit and rate limiting.
Passkey credential records should store public material:
PasskeyCredential {
credentialId
principalId
publicKey
relyingPartyId
userHandle
signCount
transports
userVerificationRequired
createdAtMs
}
The authentication flow is:
- Browser requests a login challenge.
WebShellGatewayasksSessionManagerorCredentialStorefor a bounded, random challenge tied to the relying-party ID and intended principal.- Browser calls the platform authenticator.
- Gateway verifies the WebAuthn assertion, origin, challenge, credential ID, public-key signature, user-presence/user-verification flags, and sign-count behavior.
SessionManagermints aUserSessionwith auth strengthhardwareKey.AuthorityBrokerreturns the shell bundle for that session/profile.RestrictedShellLauncherstarts the native text shell connected to the web terminal stream.
Registration requires an existing authenticated session, local setup authority, or an explicit bootstrap path. Passwordless registration is allowed; unauthenticated remote registration is not.
Required Interfaces
These are ordinary capabilities, not kernel modes.
CredentialStore
Owns credential verifier records and challenge state.
Responsibilities:
- list whether setup is required without exposing hashes,
- create password verifier records from setup authority,
- verify password attempts without returning the password or verifier bytes,
- register passkey public credentials,
- issue and consume bounded WebAuthn challenges,
- rotate or disable credentials through an authenticated admin path.
SessionManager
Creates UserSession metadata after successful authentication, explicit local
guest policy, purpose-scoped anonymous policy, or setup policy. It should
record auth method, auth strength, freshness, expiry, profile, and audit
context. It should not hand out broad system caps directly. Boot-to-shell uses
authenticated sessions and optional local guest sessions for ordinary
interactive shells; anonymous sessions are narrower remote/programmatic
contexts unless a manifest explicitly defines an anonymous demo terminal.
AuthorityBroker
Maps a session/profile to a narrow CapSet. Early policy can be static and manifest-backed. The important constraint is that the broker returns capabilities, not roles or strings that downstream services treat as authority.
ConsoleLogin
Consumes TerminalSession, CredentialStore, SessionManager, broker access,
and a restricted shell launcher. It never receives broad boot-package or device
authority unless a recovery profile explicitly grants it.
WebShellGateway
Terminates the browser terminal session, handles passkey challenge/response, and connects the authenticated session to the shell process. It should not own general administrative caps. It should ask the broker for the same narrow shell bundle as any other session.
AuditLog
Records setup entry, credential creation, failed attempts, successful session creation, broker decisions, shell launch, credential disablement, and logout. Audit entries must not include passwords, password hashes, passkey private material, bearer tokens, or complete environment dumps.
Prerequisites
Boot-to-shell should not be selected before these pieces are credible:
- Default boot uses init-owned manifest execution; the kernel starts only
initwith fixed bootstrap authority. initcan start long-lived services and not just short smoke binaries.ProcessSpawnercan launch the shell and login services with exact grants.- A terminal input path exists. Current
Consoleis output-oriented; login needs line input, paste boundaries later, and cancellation behavior. - The native text shell exists as a
capos-rtbinary withcaps,inspect,call,spawn,wait,release, and basic error display. - Secure randomness exists for salts, session IDs, WebAuthn challenges, and setup tokens.
- There is at least boot-config-backed credential storage. Durable credential storage can come later, but the first implementation must be honest about whether credentials survive reboot.
- Minimal
SessionManager,AuthorityBroker, andAuditLogservices exist. - A restricted launcher or broker wrapper prevents the shell from receiving broad init authority.
- Web text shell requires networking, HTTP/WebSocket or equivalent, TLS/origin handling, and WebAuthn verification. It can lag local console boot-to-shell.
Milestone Definition
The “Boot to Shell” milestone is complete when:
make run-shellor the default boot path reaches a text login/setup prompt.- With a configured password verifier, the console refuses the shell on a bad password and launches it on the correct password.
- With no console password verifier, the console enters setup mode and requires creating a credential or selecting an explicitly configured local guest or development policy before launching a normal shell.
- Guest console sessions, when enabled, are created through
SessionManager.guest()and receive only terminal/tmp/restricted-launcher style caps with no administrative approval path by default. - Anonymous sessions are not used as the missing-password console fallback and are not accepted as proof that the ordinary boot-to-shell milestone works.
- The shell starts with a broker-issued CapSet and can prove at least one typed capability call plus one exact-grant child spawn.
- Audit output records setup/auth/session/broker/shell-launch events without leaking secrets.
- The web text shell can authenticate with a registered passkey and launch the same native text shell profile.
- A passkey-only account can be enrolled through local setup authority or an explicit bootstrap credential, with no password verifier present.
- Graphical shell work is not part of the acceptance criteria.
Implementation Plan
-
Text console substrate. Add
TerminalSessionor extend the console service enough for authenticated line input, echo control, paste/framing markers later, and cancellation. -
Native shell binary. Land the shell proposal’s minimal REPL over
capos-rt: list CapSet entries, inspect metadata, callConsole, spawn a boot-package binary, wait, release, and print typed errors. -
Credential store prototype. Add boot-config-backed credential records and Argon2id verification. If Argon2id is too heavy for the first kernel/userspace environment, use a host-generated verifier in the manifest only as a temporary gate and keep the milestone open until in-system verification is real.
-
Console setup/login. Implement the configured-password path and no-password setup path. The setup code should create credential records through
CredentialStore, not write ad hoc config in the shell process. -
Minimal session and broker. Create
UserSessionmetadata and a static policy broker that returns a narrow shell bundle. Add a manifest-gated local guest bundle and keep anonymous bundles separate from ordinary shell login. Prove the shell cannot obtain broad boot authority by default. -
Audit and failure policy. Add audit records and bounded attempt backoff. Verify logs do not contain raw passwords, verifier bytes, passkey private data, or challenge secrets.
-
Web text shell gateway. After networking and a terminal transport exist, add WebAuthn registration and authentication for the browser-hosted terminal. Support passkey-only enrollment through local setup or explicit bootstrap authority.
-
Durability and recovery. Move credential records from boot config or RAM into a storage-backed service once storage exists. Define recovery as a credential-admin operation, not an implicit bypass.
Security Notes
- Password hashing belongs in userspace auth services, not the kernel fast path.
- WebAuthn challenge state must be single-use and bounded by expiry.
- The web gateway must validate origin and relying-party ID; otherwise passkey authentication is meaningless.
- Setup tokens are credentials. They must be short-lived, single-use, audited, and hidden from ordinary process output.
- Credential records are sensitive even though they are not raw secrets; avoid printing them in debug logs.
- The shell and any agent running inside it must treat logs, terminal input, files, web pages, and service output as untrusted data.
Non-Goals
- No graphical shell in this milestone.
- No passwordless remote first-use takeover.
- No kernel
uid,gid,root, or login mode. - No default shell access to broad
BootPackage, rawProcessSpawner,DeviceManager, raw storage, or global supervisor caps. - No authentication proof passed through command-line arguments, environment variables, shell variables, audit records, or agent prompts.
Open Questions
- Which Argon2id parameters fit the early userspace memory budget while still resisting offline guessing?
- Should the first credential store be manifest-backed, RAM-backed, or wait for the first storage service?
- How should local console setup prove physical presence on cloud VMs where serial console access may itself be remote?
- What is the first acceptable TLS/origin story for QEMU and local development WebAuthn testing?
- Should passkey-only machines keep a disabled console password slot for later recovery, or should recovery be entirely credential-admin/passkey based?
Proposal: Symmetric Multi-Processing (SMP)
How capOS goes from single-CPU execution to utilizing all available processors.
This document has three phases: a per-CPU foundation (prerequisite plumbing), AP startup (bringing secondary CPUs online), and SMP correctness (making shared state safe under concurrency).
Depends on: Stage 5 (Scheduling) – needs a working timer, context switch, and run queue on the BSP before adding more CPUs.
Can proceed in parallel with: Stage 6 (IPC and Capability Transfer).
Current State
Everything is single-CPU. Specific assumptions that SMP breaks:
| Component | File | Assumption |
|---|---|---|
| Syscall stack switching | kernel/src/arch/x86_64/syscall.rs | Global SYSCALL_KERNEL_RSP / SYSCALL_USER_RSP statics |
| GDT, TSS, kernel stacks | kernel/src/arch/x86_64/gdt.rs | One static GDT, one TSS, one kernel stack, one double-fault stack |
| IDT | kernel/src/arch/x86_64/idt.rs | Single static IDT (shareable – IDT can be the same across CPUs) |
| SYSCALL MSRs | kernel/src/arch/x86_64/syscall.rs | STAR/LSTAR/SFMASK/EFER set once on BSP only |
| Current process | kernel/src/sched.rs | SCHEDULER with BTreeMap<Pid, Process> + current: Option<Pid> — single global behind Mutex |
| Frame allocator | kernel/src/mem/frame.rs | Single global ALLOCATOR behind one spinlock |
| Heap allocator | kernel/src/mem/heap.rs | linked_list_allocator behind one spinlock |
The comment in syscall.rs:12 already anticipates the fix: “Will be
replaced by per-CPU data (swapgs) for SMP.”
Phase A: Per-CPU Foundation
Establish per-CPU data structures on the BSP. No APs are started yet – this phase makes the BSP’s own code SMP-ready so Phase B is a clean addition.
Per-CPU Data Region
Each CPU needs a private data area accessible via the GS segment base. On
x86_64, swapgs switches between user-mode GS (usually zero) and
kernel-mode GS (pointing to per-CPU data). The kernel sets KernelGSBase
MSR on each CPU during init.
#![allow(unused)]
fn main() {
/// Per-CPU data, one instance per processor.
/// Accessed via GS-relative addressing after swapgs.
#[repr(C)]
struct PerCpu {
/// Self-pointer for accessing the struct from GS:0.
self_ptr: *const PerCpu,
/// Kernel stack pointer for syscall entry (replaces SYSCALL_KERNEL_RSP).
kernel_rsp: u64,
/// Saved user RSP during syscall (replaces SYSCALL_USER_RSP).
user_rsp: u64,
/// CPU index (0 = BSP).
cpu_id: u32,
/// LAPIC ID (from Limine SMP info or CPUID).
lapic_id: u32,
/// Pointer to the currently running process on this CPU.
current_process: *mut Process,
}
}
The syscall entry stub changes from:
movq %rsp, SYSCALL_USER_RSP(%rip)
movq SYSCALL_KERNEL_RSP(%rip), %rsp
to:
swapgs
movq %rsp, %gs:16 ; PerCpu.user_rsp
movq %gs:8, %rsp ; PerCpu.kernel_rsp
And symmetrically on return:
movq %gs:16, %rsp ; restore user RSP
swapgs
sysretq
Per-CPU GDT, TSS, and Stacks
Each CPU needs its own:
- GDT – the TSS descriptor encodes a physical pointer to the CPU’s TSS, so each CPU needs a GDT with its own TSS entry. The segment layout (kernel CS/DS, user CS/DS) is identical across CPUs.
- TSS –
privilege_stack_table[0](kernel stack for interrupts from Ring 3) and IST entries (double-fault stack) must be per-CPU. - Kernel stack – each CPU needs its own stack for syscall/interrupt handling. Current size: 16 KB (4 pages). Same size per CPU.
- Double-fault stack – each CPU needs its own IST stack. Current size: 20 KB (5 pages).
#![allow(unused)]
fn main() {
/// Allocate and initialize per-CPU structures for one CPU.
fn init_per_cpu(cpu_id: u32, lapic_id: u32) -> &'static PerCpu {
// Allocate kernel stack (4 pages) and double-fault stack (5 pages)
let kernel_stack = alloc_stack(4);
let df_stack = alloc_stack(5);
// Create TSS with per-CPU stacks
let mut tss = TaskStateSegment::new();
tss.privilege_stack_table[0] = kernel_stack.top();
tss.interrupt_stack_table[DOUBLE_FAULT_IST_INDEX] = df_stack.top();
// Create GDT with this CPU's TSS
let (gdt, selectors) = create_gdt(&tss);
// Allocate and populate PerCpu struct
let per_cpu = Box::leak(Box::new(PerCpu {
self_ptr: core::ptr::null(), // filled below
kernel_rsp: kernel_stack.top().as_u64(),
user_rsp: 0,
cpu_id,
lapic_id,
current_process: core::ptr::null_mut(),
}));
per_cpu.self_ptr = per_cpu as *const PerCpu;
per_cpu
}
}
LAPIC Initialization
Stage 5 uses the 8254 PIT (100 Hz) and 8259A PIC (IRQ0 → vector 32) for preemption on the BSP. Phase A must migrate from PIT to LAPIC timer before bringing APs online, since the PIT is a single shared device that cannot provide per-CPU timer interrupts. Phase A sets up the full LAPIC, which is needed for:
- Per-CPU timer – replace PIT with LAPIC timer (required for SMP)
- IPI – inter-processor interrupts for TLB shootdown and AP startup
- Spurious interrupt vector – must be configured per-CPU
Crate Dependencies
| Crate | Purpose | no_std |
|---|---|---|
x2apic or manual MMIO | LAPIC/IOAPIC access | yes |
The x86_64 crate (already a dependency) provides MSR access. LAPIC
register access can use the existing HHDM for MMIO, or x2apic crate for
the MSR-based interface.
Migration Path
Phase A is a refactor of existing single-CPU code, not an addition:
- Add
PerCpustruct, allocate one instance for BSP - Set BSP’s
KernelGSBaseMSR, addswapgsto syscall entry/exit - Replace
SYSCALL_KERNEL_RSP/SYSCALL_USER_RSPglobals with GS-relative accesses - Replace scheduler’s global
SCHEDULER.currentwithPerCpu.current_pid - Move GDT/TSS creation into
init_per_cpu(), call it for BSP - Migrate BSP from PIT to LAPIC timer (PIT initialized in Stage 5)
After Phase A, the kernel still runs on one CPU but the per-CPU plumbing is
in place. Existing tests (make run) continue to pass.
Phase B: AP Startup
Bring Application Processors (APs) online. Each AP runs the same kernel code with its own per-CPU state.
Limine SMP Request
Limine provides an SMP response with per-CPU records. Each record contains
the LAPIC ID and a goto_address field – writing a function pointer there
starts the AP at that address.
#![allow(unused)]
fn main() {
use limine::request::SmpRequest;
#[used]
#[unsafe(link_section = ".requests")]
static SMP_REQUEST: SmpRequest = SmpRequest::new();
fn start_aps() {
let smp = SMP_REQUEST.get_response().expect("no SMP response");
for cpu in smp.cpus() {
if cpu.lapic_id == smp.bsp_lapic_id {
continue; // skip BSP
}
let per_cpu = init_per_cpu(cpu.id, cpu.lapic_id);
// Limine starts the AP at ap_entry with the cpu info pointer
cpu.goto_address.write(ap_entry);
}
}
}
AP Entry
Each AP must:
- Load its per-CPU GDT and TSS
- Load the shared IDT
- Set
KernelGSBaseMSR to itsPerCpupointer - Configure SYSCALL MSRs (STAR, LSTAR, SFMASK, EFER.SCE)
- Initialize its LAPIC (enable, set timer, set spurious vector)
- Signal “ready” to BSP (atomic flag or counter)
- Enter the scheduler idle loop
#![allow(unused)]
fn main() {
/// AP entry point. Called by Limine with the SMP info pointer.
unsafe extern "C" fn ap_entry(info: &limine::smp::Cpu) -> ! {
let per_cpu = /* retrieve PerCpu for this LAPIC ID */;
// Load this CPU's GDT + TSS
per_cpu.gdt.load();
unsafe { load_tss(per_cpu.selectors.tss); }
// Shared IDT (same across all CPUs)
idt::load();
// Set GS base for swapgs
unsafe { wrmsr(IA32_KERNEL_GS_BASE, per_cpu as *const _ as u64); }
// Configure syscall MSRs (same values as BSP)
syscall::init_msrs();
// Initialize local APIC
lapic::init_local();
// Signal ready
AP_READY_COUNT.fetch_add(1, Ordering::Release);
// Enter scheduler
scheduler::idle_loop();
}
}
Scheduler Integration
Stage 5 establishes a single run queue. Phase B extends it:
- Per-CPU run queues – each CPU pulls work from its local queue. Avoids global lock contention on the scheduler hot path.
- Global overflow queue – when a CPU’s local queue is empty, it steals from the global queue (or from other CPUs’ queues).
- CPU affinity – optional, not needed initially. All processes are eligible to run on any CPU.
- Idle loop – when no work is available,
hltuntil the next timer interrupt or IPI.
The Process struct gains a cpu field indicating which CPU it’s currently
running on (or None if queued).
Boot Sequence
BSP: kernel init (GDT, IDT, memory, heap, caps, scheduler)
BSP: init_per_cpu(0, bsp_lapic_id)
BSP: start_aps()
AP1: ap_entry() → init GDT/TSS/LAPIC → idle_loop()
AP2: ap_entry() → init GDT/TSS/LAPIC → idle_loop()
...
BSP: wait for all APs ready
BSP: load init process, schedule it
BSP: enter scheduler
Phase C: SMP Correctness
With multiple CPUs running, shared mutable state needs careful handling.
TLB Shootdown
When the kernel modifies page tables that other CPUs may have cached in their TLBs, it must send an IPI to those CPUs to invalidate the affected entries.
Scenarios requiring shootdown:
- Process exit – unmapping user pages. Only the CPU running the process has the mapping cached, but if the process migrated recently, stale TLB entries may exist on the old CPU.
- Shared kernel mappings – changes to the kernel half of page tables (e.g., heap growth, MMIO mappings) require all-CPU shootdown.
- Capability-granted shared memory – if future stages allow shared memory regions between processes, modifications require targeted shootdown.
Implementation: IPI vector + bitmap of target CPUs + invlpg on each
target. Linux uses a more sophisticated batching scheme, but a simple
broadcast IPI with single-page invlpg is sufficient initially.
#![allow(unused)]
fn main() {
/// Flush TLB entry on all CPUs except the caller.
fn tlb_shootdown(addr: VirtAddr) {
// Record the address to flush
SHOOTDOWN_ADDR.store(addr.as_u64(), Ordering::Release);
// Send IPI to all other CPUs
lapic::send_ipi_all_excluding_self(TLB_SHOOTDOWN_VECTOR);
// Wait for all CPUs to acknowledge
wait_for_shootdown_ack();
}
/// IPI handler on receiving CPU.
fn handle_tlb_shootdown_ipi() {
let addr = VirtAddr::new(SHOOTDOWN_ADDR.load(Ordering::Acquire));
x86_64::instructions::tlb::flush(addr);
SHOOTDOWN_ACK.fetch_add(1, Ordering::Release);
}
}
Lock Audit
Existing spinlocks need review for SMP safety:
| Lock | Current Use | SMP Concern |
|---|---|---|
SERIAL | COM1 output | Safe but high contention if many CPUs print. Acceptable for debug output. |
ALLOCATOR | Frame bitmap | Hot path. Holding lock during full bitmap scan is O(n). Consider per-CPU free lists. |
KERNEL_CAPS | Kernel cap table | Low contention (init only). Safe. |
SCHEDULER.current | Single global running-process slot | Split into PerCpu.current_process in Phase A. |
Interrupt + spinlock deadlock: if CPU A holds a spinlock and takes an
interrupt whose handler tries to acquire the same lock, deadlock. This is
already noted in REVIEW.md. Fix: disable interrupts while holding locks
that interrupt handlers may need (frame allocator, serial). The spin crate
supports MutexIrq for this pattern, or use manual cli/sti wrappers.
Allocator Scaling
The frame allocator is behind a single spinlock with O(n) bitmap scan. Under SMP, this becomes a contention bottleneck.
Options (in order of complexity):
- Per-CPU free list cache – each CPU maintains a small cache of free frames (e.g., 64 frames). Refill from the global allocator when empty, return batch when full. Reduces lock acquisitions by ~64x.
- Region partitioning – divide physical memory into per-CPU regions. Each CPU owns a bitmap partition. Cross-CPU allocation falls back to a global lock. More complex, better NUMA behavior (future).
Option 1 is recommended for initial SMP. ~50-100 lines.
The heap allocator (linked_list_allocator) is also behind a single lock.
For a research OS this is acceptable initially – heap allocations in the
kernel should be infrequent compared to frame allocations.
Cap’n Proto Schema Additions
SMP introduces a kernel-internal CpuManager capability for inspecting and
controlling CPU state. This is not exposed to userspace initially but follows
the “everything is a capability” principle.
interface CpuManager {
# Number of online CPUs.
cpuCount @0 () -> (count :UInt32);
# Per-CPU info.
cpuInfo @1 (cpuId :UInt32) -> (lapicId :UInt32, online :Bool);
}
This capability would be held by init (or a system monitor process) for diagnostics. It’s additive and can be deferred until the mechanism is useful.
Estimated Scope
| Phase | New/Changed Code | Depends On |
|---|---|---|
| Phase A: Per-CPU foundation | ~300-400 lines (PerCpu struct, swapgs migration, per-CPU GDT/TSS) | Stage 5 |
| Phase B: AP startup | ~200-300 lines (SmpRequest, ap_entry, scheduler extension) | Phase A |
| Phase C: SMP correctness | ~200-300 lines (TLB shootdown, allocator cache, lock audit) | Phase B |
| Total | ~700-1000 lines |
Milestones
- M1: Per-CPU data on BSP –
swapgs-based syscall entry, per-CPU GDT/TSS, global current-process state split. Single CPU still.make runpasses. - M2: APs running – secondary CPUs reach
idle_loop(). BSP prints “N CPUs online”.make runstill runs init on BSP. - M3: Multi-CPU scheduling – processes can run on any CPU. The existing
boot-manifest service set still works, but the scheduler distributes work
across CPUs once runnable processes are available (runtime spawning still
depends on
ProcessSpawner). - M4: TLB shootdown – page table modifications are safe across CPUs. Process exit on one CPU doesn’t leave stale mappings on others.
Open Questions
-
LAPIC vs x2APIC. Modern hardware supports x2APIC (MSR-based, no MMIO). Should we require x2APIC, support both, or start with xAPIC? QEMU supports both. x2APIC is simpler (no MMIO mapping needed).
-
Idle strategy.
hltis the simplest idle.mwaitis more power-efficient and can be used to wake on memory writes. Overkill for QEMU, but worth noting for future hardware targets. -
CPU hotplug. Limine starts all CPUs at boot. Runtime CPU online/offline is a future concern, not needed initially.
-
NUMA awareness. Multi-socket systems have non-uniform memory access. Per-CPU frame allocator regions could be NUMA-aware. Deferred – QEMU emulates flat memory by default.
-
Scheduler policy. Round-robin per-CPU queues with global overflow is the simplest starting point. Work stealing, priority scheduling, and CPU affinity are future refinements.
References
Specifications
- Intel SDM Vol. 3, Chapter 8 – Multiple-Processor Management (AP startup, APIC, IPI)
- Intel SDM Vol. 3, Chapter 10 – APIC (Local APIC, I/O APIC, x2APIC)
- OSDev Wiki: SMP
- OSDev Wiki: APIC
Limine
- Limine SMP Feature
–
SmpRequest/SmpResponseAPI, AP startup mechanism
Prior Art
- Redox SMP – per-CPU contexts, LAPIC timer, IPI-based TLB shootdown
- xv6-riscv SMP – minimal multi-core OS, clean per-CPU implementation
- Hermit SMP – Rust unikernel with SMP support via per-core data and APIC
- BlogOS – educational x86_64 Rust OS (single-CPU, but good APIC coverage)
Other Proposals
This page keeps the mdBook sidebar compact by grouping proposal documents that are not listed individually in the main Design Proposals section.
Active Support Proposals
| Proposal | Status | Purpose |
|---|---|---|
| mdBook Documentation Site | Partially implemented | Defines the documentation site structure, status vocabulary, and curation rules for architecture, proposal, security, and research pages. |
Future Runtime and Deployment
| Proposal | Status | Purpose |
|---|---|---|
| Go Runtime | Future design | Plans a custom GOOS=capos userspace port and runtime services for Go programs. |
| Cloud Metadata | Future design | Describes cloud bootstrap inputs and manifest deltas without importing cloud-init. |
| Cloud Deployment | Future design | Tracks hardware abstraction, cloud VM support, storage/network dependencies, and aarch64 deployment direction. |
| Browser/WASM | Future design | Explores a browser-hosted capOS model using WebAssembly and workers. |
Future Security, Policy, and Lifecycle
| Proposal | Status | Purpose |
|---|---|---|
| User Identity and Policy | Future design | Defines user/session identity and policy layers over capability grants. |
| System Monitoring | Future design | Defines scoped observability capabilities for logs, metrics, traces, health, status, crash records, and audit. |
| Formal MAC/MIC | Future design | Defines a formal access-control and integrity model for later proof work. |
| Live Upgrade | Future design | Designs service replacement while preserving handles, calls, and authority. |
| GPU Capability | Future design | Sketches isolated GPU device, memory, and compute authority. |
Rejected
| Proposal | Status | Purpose |
|---|---|---|
| Cap’n Proto SQE Envelope | Rejected | Records the rejected idea of encoding SQEs themselves as Cap’n Proto messages. |
| Sleep(INF) Process Termination | Rejected | Records the rejected idea of using infinite sleep as process termination. |
Proposal: mdBook Documentation Site
Turn the existing Markdown documentation into a navigable mdBook site that explains capOS as a working system, while keeping proposals and research as deep reference material.
The current docs are useful for agents and maintainers who already know what
they are looking for. They are weaker as a reader path: a new contributor has
to jump between README.md, ROADMAP.md, WORKPLAN.md, proposal files,
research reports, and source code before they can form an accurate model of
the system. The mdBook site should fix that by adding a concise, current
system manual above the existing archive.
Status: Partially implemented. The book structure exists and is deployed; this proposal remains the editorial contract for keeping the site navigable and honest about current versus future behavior.
Goals
- Make the first reading path obvious: what capOS is, how to build it, what works today, and where the important subsystems live.
- Separate implemented behavior from future design, rejected ideas, and research background.
- Preserve existing long-form proposal and research documents instead of rewriting them prematurely.
- Give architecture pages a repeatable structure so future edits do not turn into ad hoc status notes.
- Make validation visible: each architecture page should name the host tests, QEMU smokes, fuzz targets, Kani proofs, Loom models, or manual checks that support its claims.
- Keep the docs useful from a local clone, without requiring hosted services, databases, or custom frontend code.
Non-Goals
- Replacing
ROADMAP.md,WORKPLAN.md, orREVIEW_FINDINGS.md. Those files remain operational planning documents. - Turning proposals into user manuals by bulk editing every existing document. Long proposal files stay as references until a subsystem needs a targeted refresh.
- Building a marketing site, blog, changelog, or public product page.
- Adding MDX, React, Vue, custom components, or a JavaScript application layer.
- Automatically generating API reference documentation from Rust or Cap’n Proto. That can be evaluated later as a separate documentation track.
Audience
The site should serve three readers:
- New contributor: wants to build the ISO, boot QEMU, understand the current architecture, and find the right files to edit.
- Reviewer: wants to verify whether a change preserves the intended ownership, authority, lifecycle, and validation rules.
- Future agent: wants current project context without having to infer the system from stale proposals or source code alone.
The primary audience is maintainers and agents, not end users. This matters: accuracy, status labels, and code maps are more important than a polished external landing page.
Current State
The repository already has a substantial Markdown corpus:
README.mdexplains the project and core commands.ROADMAP.mddescribes long-range stages and visible milestones.WORKPLAN.mdtracks the selected milestone and active implementation order.REVIEW_FINDINGS.mdtracks open remediation and verification history.docs/capability-model.mdis a real architecture reference.docs/proposals/contains accepted, future, exploratory, and rejected design material.docs/research.mdanddocs/research/contain prior-art analysis.docs/*-design.mdand inventory files capture targeted design/security decisions.
The weakness is not lack of content. The weakness is lack of a stable reader path and status model.
Site Shape
The mdBook site should be structured as a book, not as a mirror of the file tree.
# Summary
[Introduction](index.md)
# Start Here
- [What capOS Is](overview.md)
- [Current Status](status.md)
- [Build, Boot, and Test](build-run-test.md)
- [Repository Map](repo-map.md)
# System Architecture
- [Boot Flow](architecture/boot-flow.md)
- [Process Model](architecture/process-model.md)
- [Capability Model](capability-model.md)
- [Capability Ring](architecture/capability-ring.md)
- [IPC and Endpoints](architecture/ipc-endpoints.md)
- [Userspace Runtime](architecture/userspace-runtime.md)
- [Manifest and Service Startup](architecture/manifest-startup.md)
- [Memory Management](architecture/memory.md)
- [Scheduling](architecture/scheduling.md)
# Security and Verification
- [Trust Boundaries](security/trust-boundaries.md)
- [Verification Workflow](security/verification-workflow.md)
- [Panic Surface Inventory](panic-surface-inventory.md)
- [Trusted Build Inputs](trusted-build-inputs.md)
- [DMA Isolation](dma-isolation-design.md)
- [Authority Accounting](authority-accounting-transfer-design.md)
# Design Proposals
- [Proposal Index](proposals/index.md)
- [Service Architecture](proposals/service-architecture-proposal.md)
- [Storage and Naming](proposals/storage-and-naming-proposal.md)
- [Networking](proposals/networking-proposal.md)
- [Error Handling](proposals/error-handling-proposal.md)
- [Userspace Binaries](proposals/userspace-binaries-proposal.md)
- [Shell](proposals/shell-proposal.md)
- [SMP](proposals/smp-proposal.md)
- [Other Proposals](proposals/other.md)
- [Security and Verification](proposals/security-and-verification-proposal.md)
- [mdBook Documentation Site](proposals/mdbook-docs-site-proposal.md)
- [Go Runtime](proposals/go-runtime-proposal.md)
- [User Identity and Policy](proposals/user-identity-and-policy-proposal.md)
- [Cloud Metadata](proposals/cloud-metadata-proposal.md)
- [Cloud Deployment](proposals/cloud-deployment-proposal.md)
- [Live Upgrade](proposals/live-upgrade-proposal.md)
- [GPU Capability](proposals/gpu-capability-proposal.md)
- [Formal MAC/MIC](proposals/formal-mac-mic-proposal.md)
- [Browser/WASM](proposals/browser-wasm-proposal.md)
- [Rejected: Cap'n Proto SQE Envelope](proposals/rejected-capnp-ring-sqe-proposal.md)
# Research
- [Research Index](research.md)
- [seL4](research/sel4.md)
- [Zircon](research/zircon.md)
- [Genode](research/genode.md)
- [Plan 9 and Inferno](research/plan9-inferno.md)
- [EROS, CapROS, Coyotos](research/eros-capros-coyotos.md)
- [LLVM Target](research/llvm-target.md)
- [Cap'n Proto Error Handling](research/capnp-error-handling.md)
- [OS Error Handling](research/os-error-handling.md)
- [IX-on-capOS Hosting](research/ix-on-capos-hosting.md)
- [Out-of-Kernel Scheduling](research/out-of-kernel-scheduling.md)
The exact page list may change during implementation, but the hierarchy should stay stable:
- Start Here: reader orientation and commands.
- System Architecture: current implementation, with code maps and invariants.
- Security and Verification: threat boundaries, validation workflow, and security inventories.
- Design Proposals: accepted/future/rejected design documents.
- Research: prior art and its consequences for capOS.
Page Standard
Every architecture page should use this shape:
# Page Title
What problem this subsystem solves and why a reader should care.
**Status:** Implemented / Partially implemented / Proposal / Research note.
## Current Behavior
What exists in the repo today.
## Design
How it works, with concrete data flow.
## Invariants
Security, lifetime, ownership, ordering, or failure rules.
## Code Map
Important files and entry points.
## Validation
Relevant host tests, QEMU smokes, fuzz/Kani/Loom checks.
## Open Work
Concrete known gaps, linked to WORKPLAN or REVIEW_FINDINGS when relevant.
Architecture pages should normally stay between 100 and 300 lines. Longer background belongs in proposals or research reports.
Status Vocabulary
Use explicit inline status labels near the top of proposal, research, and architecture pages when the label distinguishes current behavior from future or rejected design, after the lead paragraph:
- Implemented: behavior exists in the mainline code and has validation.
- Partially implemented: some behavior exists, but the page also describes missing work.
- Accepted design: intended direction, not fully implemented.
- Future design: plausible direction, not selected for near-term work.
- Rejected: explicitly not the chosen direction.
- Research note: background used to inform design, not a direct plan.
Avoid status labels on orientation, index, command-reference, and workflow pages where the sidebar section or title already gives the page role. Avoid ambiguous language like “planned” without a stage, dependency, or status label. When a page mixes current and future behavior, split those sections.
Content Rules
- Start with operational facts, not motivation.
- Prefer concrete nouns: process, cap table, ring, endpoint, manifest, init, QEMU smoke.
- Name source files when a claim depends on implementation.
- State authority and ownership rules explicitly.
- State failure behavior explicitly.
- Link to proposals and research instead of duplicating long rationale.
- Keep
ROADMAP.mdandWORKPLAN.mdas planning sources, not as content to paste into the book. - Do not describe behavior as implemented unless validation exists or the code map makes the claim directly checkable.
- Do not bury current limitations at the bottom of a long proposal.
Proposal Index
docs/proposals/index.md should classify proposal files instead of listing
them alphabetically. A useful first classification:
- Active or near-term:
- service architecture
- storage and naming
- error handling
- security and verification
- Future architecture:
- networking
- SMP
- userspace binaries
- shell
- user identity and policy
- cloud metadata
- cloud deployment
- live upgrade
- GPU capability
- formal MAC/MIC
- browser/WASM
- Rejected or superseded:
- rejected Cap’n Proto ring SQE envelope
Each proposal entry should have a one-sentence purpose and a status label.
Research Index
docs/research.md should remain the top-level research index, but it should
gain a short “Design consequences for capOS” section near the top. Readers
should not need to read every long report to learn which ideas were accepted.
Each long research report should eventually end with:
## Used By
- Architecture or proposal page that relies on this research.
- Concrete design decision influenced by this report.
Diagrams
Use Mermaid only where it clarifies flow or authority:
- boot flow: firmware, Limine, kernel, manifest, init
- capability ring: SQE submission,
cap_enter, CQE completion - endpoint IPC: client CALL, server RECV, server RETURN
- manifest startup: boot package, init, ProcessSpawner, child caps
Avoid diagrams that duplicate file layout or become stale when a function is renamed. Every diagram should have nearby text that states the same key invariant in prose.
Migration Plan
Phase 1: Skeleton and Reader Path
- Add
book.tomlwithdocsas the source directory and output undertarget/docs-site. - Add
docs/SUMMARY.md. - Add
docs/index.md. - Add
docs/overview.md. - Add
docs/status.md. - Add
docs/build-run-test.md. - Add
docs/repo-map.md.
Acceptance criteria:
mdbook buildsucceeds.- The first section explains what capOS is, how to build it, how to boot it, and where to find the major code areas.
- Existing proposal and research files are reachable through the sidebar.
Phase 2: Current Architecture Pages
- Add the first architecture pages:
- boot flow
- process model
- capability ring
- IPC and endpoints
- userspace runtime
- manifest and service startup
- memory management
- scheduling
- Keep
docs/capability-model.mdas a first-class architecture page.
Acceptance criteria:
- Each architecture page has status, current behavior, invariants, code map, validation, and open work.
- Each page distinguishes implemented behavior from future design.
- At least boot flow, capability ring, IPC, and manifest startup include a concise Mermaid diagram.
Phase 3: Security and Verification Pages
- Add
docs/security/trust-boundaries.md. - Add
docs/security/verification-workflow.md. - Link existing inventories and designs from the security section.
- Make each security page name the relevant validation commands and review documents.
Acceptance criteria:
- A reviewer can find the hostile-input boundaries, trusted inputs, and verification workflow without reading all proposals.
- The security section links to
REVIEW.md,REVIEW_FINDINGS.md,docs/trusted-build-inputs.md, anddocs/panic-surface-inventory.md.
Phase 4: Proposal and Research Curation
- Add
docs/proposals/index.md. - Add
docs/proposals/other.mdonly if the sidebar would otherwise become too noisy. - Add status labels to proposal files as they are touched.
- Add “Used By” sections to research files incrementally.
Acceptance criteria:
- Proposal status is visible before a reader opens a long document.
- Rejected and future proposals are not confused with implemented behavior.
- Research pages point back to the architecture or proposal pages they influence.
Maintenance Rules
- When implementation changes a subsystem, update the corresponding architecture page in the same change when the page would otherwise become misleading.
- When a proposal is accepted, rejected, or partially implemented, update its status and the proposal index.
- When
WORKPLAN.mdchanges the selected milestone, updatedocs/status.mdonly if the public current-system summary changes. Do not mirror every operational task into the docs site. - When validation commands change, update
docs/build-run-test.mdand the affected architecture page.
Tooling Follow-Up
The content proposal assumes mdBook because it matches the repo’s Rust toolchain and plain Markdown corpus. A minimal tooling follow-up should add:
book.tomlmake docsmake docs-serve- optional link checking after the first site build is stable
Do not add a frontend package manager, theme framework, or generated site assets unless the content structure proves insufficient.
Open Questions
- Should
ROADMAP.md,WORKPLAN.md, andREVIEW_FINDINGS.mdbe included in the mdBook sidebar, or only linked fromstatus.md? - Should long proposal files keep their current filenames, or should accepted
designs eventually move from
docs/proposals/intodocs/architecture/? - Should
docs/status.mdbe manually maintained, or generated from a smaller checked-in status data file later? - Should Cap’n Proto schema documentation be generated into the book once the interface surface stabilizes?
Recommended First Commit
The first implementation commit should be deliberately small:
- Add mdBook config.
- Add
SUMMARY.md. - Add the Start Here pages.
- Link existing proposal and research files without rewriting them.
- Verify
mdbook build.
That gives the project a usable docs site quickly, without blocking on a full architecture rewrite.
Proposal: Go Language Support via Custom GOOS
Running Go programs natively on capOS by implementing a GOOS=capos target
in the Go runtime.
Motivation
Go is the implementation language of CUE, the configuration language planned for system manifests. Beyond CUE, Go has a large ecosystem of systems software (container runtimes, network tools, observability agents) that would be valuable to run on capOS without rewriting.
The userspace-binaries proposal (Part 3) places Go in Tier 4 (“managed
runtimes, much later”) and suggests WASI as the pragmatic path. This proposal
explores the native alternative: a custom GOOS=capos that lets Go programs
run directly on capOS hardware, without a WASM interpreter in between.
Why Go is Hard
Go’s runtime is a userspace operating system. It manages its own:
- Goroutine scheduler — M:N threading (M OS threads, N goroutines), work-stealing, preemption via signals or cooperative yield points
- Garbage collector — concurrent, tri-color mark-sweep, requires write barriers, stop-the-world pauses, and memory management syscalls
- Stack management — segmented/copying stacks with guard pages, grow/shrink on demand
- Network poller — epoll/kqueue-based async I/O for
net.Conn - Memory allocator — mmap-based, spans, mcache/mcentral/mheap hierarchy
- Signal handling — goroutine preemption, crash reporting, profiling
Each of these assumes a specific OS interface. The Go runtime calls ~40 distinct syscalls on Linux. capOS currently has 2.
Syscall Surface Required
The Go runtime’s Linux syscall usage, grouped by subsystem:
Memory Management (critical, blocks everything)
| Go runtime needs | Linux syscall | capOS equivalent |
|---|---|---|
| Heap allocation | mmap(MAP_ANON) | FrameAllocator cap + page table manipulation |
| Heap deallocation | munmap | Unmap + free frames |
| Stack guard pages | mmap(PROT_NONE) + mprotect | Map page with no permissions |
| GC needs contiguous arenas | mmap with hints | Allocate contiguous frames, map contiguously |
| Commit/decommit pages | madvise(DONTNEED) | Unmap or zero pages |
capOS needs: A sys_mmap-like capability or syscall that can:
- Map anonymous pages at arbitrary user addresses
- Set per-page permissions (R, W, X, none)
- Allocate contiguous virtual ranges
- Decommit without unmapping (for GC arena management)
This could be a VirtualMemory capability:
interface VirtualMemory {
# Map anonymous pages at hint address (0 = kernel chooses)
map @0 (hint :UInt64, size :UInt64, prot :UInt32) -> (addr :UInt64);
# Unmap pages
unmap @1 (addr :UInt64, size :UInt64) -> ();
# Change permissions on mapped range
protect @2 (addr :UInt64, size :UInt64, prot :UInt32) -> ();
# Decommit (release physical frames, keep virtual range reserved)
decommit @3 (addr :UInt64, size :UInt64) -> ();
}
Threading (critical for goroutines)
| Go runtime needs | Linux syscall | capOS equivalent |
|---|---|---|
| Create OS thread | clone(CLONE_THREAD) | Thread cap or in-process threading primitive |
| Thread-local storage | arch_prctl(SET_FS) | Per-thread FS base (kernel sets on context switch) |
| Block thread | futex(WAIT) | Futex cap or kernel-side futex |
| Wake thread | futex(WAKE) | Futex cap |
| Thread exit | exit(thread) | Thread exit syscall |
capOS needs: Threading support within a process. Options:
Option A: Kernel threads. The kernel manages threads (multiple execution contexts sharing one address space). Each thread has its own stack, register state, and FS base, but shares page tables and cap table with the process. This is what Linux does and what Go expects.
Option B: User-level threading. The process manages its own threads (like green threads). The kernel only sees one execution context per process. Go’s scheduler already does M:N threading, so it could work with a single OS thread per process — but the GC’s stop-the-world relies on being able to stop other OS threads, and the network poller blocks an OS thread.
Option A is simpler for Go compatibility. Option B is more capability-aligned (threads are a process-internal concern) but requires Go runtime modifications.
Synchronization
| Go runtime needs | Linux syscall | capOS equivalent |
|---|---|---|
| Futex wait | futex(FUTEX_WAIT) | Futex authority cap, ABI selected by measurement |
| Futex wake | futex(FUTEX_WAKE) | Futex authority cap, ABI selected by measurement |
| Atomic compare-and-swap | CPU instructions | Already available (no kernel support needed) |
Futexes are a kernel primitive (block/wake on a userspace address). capOS should expose futex authority through a capability from the start. The ABI is still a measurement question: generic capnp/ring method if overhead is close to a compact path, otherwise a compact capability-authorized operation.
Time
| Go runtime needs | Linux syscall | capOS equivalent |
|---|---|---|
| Monotonic clock | clock_gettime(MONOTONIC) | Timer cap .now() |
| Wall clock | clock_gettime(REALTIME) | Timer cap or RTC driver |
| Sleep | nanosleep or futex with timeout | Timer cap .sleep() or futex timeout |
| Timer events | timer_create / timerfd | Timer cap with callback or poll |
Timer cap already planned. Go needs monotonic time for goroutine scheduling
and wall time for time.Now().
I/O
| Go runtime needs | Linux syscall | capOS equivalent |
|---|---|---|
| Network I/O | epoll_create, epoll_ctl, epoll_wait | Async cap invocation or poll cap |
| File I/O | read, write, open, close | Namespace + Store caps (via POSIX layer) |
| Stdout/stderr | write(1, ...), write(2, ...) | Console cap |
| Pipe (runtime internal) | pipe2 | IPC caps or in-process channel |
Go’s network poller (netpoll) is pluggable per-OS — each GOOS provides
its own implementation. For capOS, it would use async capability invocations
or a polling interface over socket caps.
Signals (for preemption)
| Go runtime needs | Linux syscall | capOS equivalent |
|---|---|---|
| Goroutine preemption | tgkill + SIGURG | Thread preemption mechanism |
| Crash handling | sigaction(SIGSEGV) | Page fault notification |
| Profiling | sigaction(SIGPROF) + setitimer | Profiling cap (optional) |
Go 1.14+ uses asynchronous preemption: the runtime sends SIGURG to a
thread to interrupt a long-running goroutine. On capOS, alternatives:
- Cooperative preemption only. Go inserts yield points at function prologues and loop back-edges. This works but means tight loops without function calls won’t yield. Acceptable for initial support.
- Timer interrupt notification. The kernel notifies the process (via a cap invocation or a signal-like mechanism) when a time quantum expires. The notification handler in the Go runtime triggers goroutine preemption.
Implementation Strategy
Phase 1: Minimal GOOS (single-threaded, cooperative)
Fork the Go toolchain, add GOOS=capos GOARCH=amd64. Implement the minimum
runtime changes:
What to implement:
osinit()— read Timer cap from CapSet for monotonic clocksysAlloc/sysFree/sysReserve/sysMap— translate to VirtualMemory capnewosproc()— stub (single OS thread, M:N scheduler still works with M=1)futexsleep/futexwake— spin-based fallback (no real futex yet)nanotime/walltime— Timer capwrite()(for runtime debug output) — Console capexit/exitThread— sys_exitnetpoll— stub returning “nothing ready” (no async I/O)
What to stub/disable:
- Signals (no SIGURG preemption, cooperative only)
- Multi-threaded GC (single-thread STW is fine initially)
- CGo (no C interop)
- Profiling
- Core dumps
Deliverable: GOOS=capos go build ./cmd/hello produces an ELF that
runs on capOS, prints “Hello, World!”, and exits.
Estimated effort: ~2000-3000 lines of Go runtime code (mostly in
runtime/os_capos.go, runtime/sys_capos_amd64.s,
runtime/mem_capos.go). Reference: runtime/os_js.go (WASM target) is
~400 lines; runtime/os_linux.go is ~700 lines. capOS sits between these.
Phase 2: Kernel Threading + Futex
Add kernel support for:
- Multiple threads per process (shared address space, separate stacks)
- Futex authority capability and measured wait/wake ABI
- FS base per-thread (for goroutine-local storage)
Update Go runtime:
newosproc()creates a real kernel threadfutexsleep/futexwakeuse the selected futex capability ABI- GC runs concurrently across threads
- Enable
GOMAXPROCS > 1
Deliverable: Go programs use multiple CPU cores. GC is concurrent.
Phase 3: Network Poller
Implement runtime/netpoll_capos.go:
- Register socket caps with the poller
- Use an async notification mechanism (capability-based
poll()or notification cap) net.Dial(),net.Listen(),http.Get()work
This depends on the networking stack being available as capabilities.
Deliverable: Go HTTP client/server runs on capOS.
Phase 4: CUE on capOS
With Go working, CUE runs natively. This enables:
- Runtime manifest evaluation (not just build-time)
- Dynamic service reconfiguration via CUE expressions
- CUE-based policy enforcement in the capability layer
Kernel Prerequisites
| Prerequisite | Roadmap Stage | Why |
|---|---|---|
| Capability syscalls | Stage 4 (sync path done) | Go runtime invokes caps (VirtualMemory, Timer, Console) |
| Scheduling | Stage 5 (core done) | Go needs timer interrupts for goroutine preemption fallback |
| IPC + cap transfer | Stage 6 | Go programs are service processes that export/import caps |
| VirtualMemory capability | Stage 5 | mmap equivalent for Go’s memory allocator and GC |
| Thread support | Extends Stage 5 | Multiple execution contexts per process |
| Futex authority capability | Extends Stage 5 | Go runtime synchronization |
VirtualMemory Capability
This is the biggest new kernel primitive. Go’s allocator requires:
- Reserve large virtual ranges without committing physical memory (Go reserves 256 TB of virtual space on 64-bit systems)
- Commit pages within reserved ranges (back with physical frames)
- Decommit pages (release frames, keep virtual range reserved)
- Set permissions (RW for data, none for guard pages)
The existing page table code (kernel/src/mem/paging.rs) supports mapping
and unmapping individual pages. It needs to be extended with:
- Virtual range reservation (mark ranges as reserved in some bitmap/tree)
- Lazy commit (map as
PROT_NONEinitially, page fault handler commits on demand — or explicit commit via cap call) - Permission changes on existing mappings
Thread Support
Extending the process model (kernel/src/process.rs). See the
SMP proposal for the PerCpu struct layout (per-CPU
kernel stack, saved registers, FS base); Thread extends this for
multi-thread-per-process. See also the In-Process Threading section in
ROADMAP.md for the
roadmap-level view.
#![allow(unused)]
fn main() {
struct Process {
pid: u64,
address_space: AddressSpace, // shared by all threads
caps: CapTable, // shared by all threads
threads: Vec<Thread>,
}
struct Thread {
tid: u64,
state: ThreadState,
kernel_stack: VirtAddr,
saved_regs: RegisterState, // rsp, rip, etc.
fs_base: u64, // for thread-local storage
}
}
The scheduler (Stage 5) schedules threads, not processes. Each thread gets its own kernel stack and register save area. Context switch saves/restores thread state. Page table switch only happens when switching between threads of different processes.
Alternative: Go via WASI
For comparison, the WASI path from the userspace-binaries proposal:
| Native GOOS | WASI | |
|---|---|---|
| Performance | Native speed | ~2-5x overhead (wasm interpreter/JIT) |
| Go compatibility | Full (after Phase 3) | Limited (WASI Go support is experimental) |
| Goroutines | Real M:N scheduling | Single-threaded (WASI has no threads yet) |
| Net I/O | Native async via poller | Blocking only (WASI sockets are sync) |
| Kernel work | VirtualMemory, threads, futex | None (wasm runtime handles it) |
| Go runtime fork | Yes (maintain a fork) | No (upstream GOOS=wasip1) |
| GC | Full concurrent GC | Conservative GC (wasm has no stack scanning) |
| Maintenance burden | High (track Go releases) | Low (upstream supported) |
WASI is easier but limited. Go on WASI (GOOS=wasip1) is officially
supported but experimental — no goroutine parallelism, no async I/O, limited
stdlib. For running CUE (which is CPU-bound evaluation, no I/O, single
goroutine), WASI might be sufficient.
Native GOOS is harder but complete. Full Go with goroutines, concurrent
GC, network I/O, and the entire stdlib. Required for Go network services
or anything using net/http.
Recommendation: Start with WASI for CUE evaluation (Phase 4 of the WASI proposal in userspace-binaries). If Go network services become a goal, invest in the native GOOS.
Relationship to Other Proposals
- Userspace binaries proposal — this extends Tier 4 (managed runtimes) with concrete Go implementation details. The POSIX layer (Part 4) is NOT sufficient for Go — Go doesn’t use libc on Linux, it makes raw syscalls. The GOOS approach bypasses POSIX entirely.
- Service architecture proposal — Go services participate in the capability graph like any other process. The Go net poller (Phase 3) uses TcpSocket/UdpSocket caps from the network stack.
- Storage and naming proposal — Go’s
os.Open()/os.Read()map to Namespace + Store caps via the GOOS file I/O implementation. Go doesn’t use POSIX for this — it has its ownruntime/os_capos.gowith direct cap invocations. - SMP proposal — Go’s
GOMAXPROCSuses multiple OS threads (Phase 2). Requires per-CPU scheduling from Stage 5/7.
Open Questions
-
Fork maintenance. A
GOOS=caposfork must track upstream Go releases. How much drift is acceptable? Could the capOS-specific code eventually be upstreamed (like Fuchsia’s was)? -
CGo support. Go’s FFI to C (
cgo) requires a C toolchain and dynamic linking. Should capOS support cgo, or is pure Go sufficient? CUE doesn’t use cgo, but some Go libraries do. -
GOROOT on capOS. Go programs expect
$GOROOT/libat runtime for some stdlib features. Where does this live on capOS? In the Store? Baked into the binary via static compilation? -
Go module proxy.
go getneeds HTTP access. On capOS, this would use aFetchcap. But cross-compilation on the host is more practical than building Go on capOS itself. -
Debugging. Go’s
runtime/debugandpprofexpect signals and/procaccess. What debugging capabilities should capOS expose? -
GC tuning. Go’s GC is tuned for Linux’s mmap semantics (decommit is cheap, virtual space is nearly free). capOS’s VirtualMemory cap needs to match these assumptions or the GC will need retuning.
Estimated Scope
| Phase | New kernel code | Go runtime changes | Dependencies |
|---|---|---|---|
| Phase 1: Minimal GOOS | ~200 (VirtualMemory cap) | ~2000-3000 | Stages 4-5 |
| Phase 2: Threading | ~500 (threads, futex) | ~500 | Stage 5, SMP |
| Phase 3: Net poller | ~100 (async notification) | ~300 | Networking, Stage 6 |
| Phase 4: CUE on capOS | 0 | 0 | Phase 1 (or WASI) |
| Total | ~800 | ~2800-3800 |
Plus ongoing maintenance to track Go upstream releases.
Proposal: Capability-Native System Monitoring
How capOS should expose logs, metrics, health, traces, crash records, and
service status without introducing global /proc, ambient log access, or a
privileged monitoring daemon that bypasses the capability model.
Problem
The current system is observable mostly through serial output, QEMU exit status, smoke-test lines, CQE error codes, and a small measurement-only build feature. That is enough for early kernel work, but it is not enough for a system whose claims depend on service decomposition, explicit authority, restart policy, auditability, and later cloud operation.
Monitoring is also not harmless. A monitoring service can reveal capability
topology, service names, badges, timing, crash context, request payloads, and
security decisions. If capOS imports a Unix-style “read everything under
/proc” or “global syslog” model, monitoring becomes an ambient authority
escape hatch. If it imports a kernel-programmable tracing model too early, it
adds a large privileged execution surface before the basic service graph is
stable.
The design target is narrower: make operational state visible through typed, attenuable capabilities. A process should observe only the services, logs, and signals it was granted authority to inspect.
Current State
Implemented signal sources:
- Kernel diagnostics are printed through COM1 serial via
kprintln!, timestamped with the PIT tick counter. Panic and fault paths use a mutex-free emergency serial writer. - Userspace logging currently goes through the kernel
Consolecapability, backed directly by serial and bounded per call. - Runtime panics can use an emergency console path, then exit with a fixed code.
- Capability-ring CQEs carry structured transport results, including negative
CAP_ERR_*values and serializedCapExceptionpayloads. - The ring tracks
cq_overflow, corrupted SQ/CQ recovery, and bounded SQE dispatch, but these facts are not exported as normal metrics. ProcessSpawnerandProcessHandle.waitexpose basic child lifecycle observation, but restart policy, health checks, and exported-cap lifecycle are future work.capos-lib::ResourceLedgertracks cap slots, outstanding calls, scratch bytes, and frame grants, but only as local accounting state.- The
measurefeature adds benchmark-only counters and TSC helpers for controlledmake run-measureboots. SystemConfig.logLevelexists in the schema and is printed at boot, but there is no filtering, routing, or retention policy behind it.
That means the system has useful raw signals but lacks a capability-shaped monitoring architecture.
Design Principles
- Observation is authority. Reading logs, status, metrics, traces, crash records, or audit entries requires a capability.
- No global monitoring root.
SystemStatus(all),LogReader(all), andServiceSupervisor(all)are powerful caps. Normal sessions receive scoped wrappers. - Kernel facts, userspace policy. The kernel may expose bounded facts about processes, rings, resources, and faults. Retention, filtering, aggregation, health semantics, restart policy, and user-facing views belong in userspace.
- Separate signal classes. Logs, metrics, lifecycle events, traces, health, crash records, and audit logs have different readers, retention rules, and security properties.
- Bounded by construction. Every producer path has a byte, entry, or time budget. Loss is explicit and summarized.
- Payload capture is exceptional. Default tracing records headers, interface IDs, method IDs, sizes, result codes, badges when authorized, and timing. Capturing method payloads needs a stronger cap because payloads may contain secrets.
- Serial remains emergency plumbing. Early boot, panic, and recovery still
need direct serial output. Normal services should receive log caps rather
than broad
Console. - Audit is not debug logging. Audit records security-relevant decisions and capability lifecycle events. It is append-only from producers and exposed through scoped readers.
- Pull by default, push when justified. Status and metrics are pull-shaped (reader polls a snapshot cap). Logs, lifecycle events, crash records, and audit entries are push-shaped (producer calls into a sink). Traces are pull with an explicit arm/drain lifecycle because capture is expensive. Each direction has its own cap surface; do not generalize one shape to cover all signals.
- Narrow kernel stats caps over one god-cap. The kernel exposes bounded
facts through several small read-only caps (ring, scheduler, resource
ledger, frames, endpoints, caps, crash) rather than one
KernelDiagnosticsthat grants everything. Narrow caps let an init-owned status service be assembled by composition, and let a broker lease a subset to an operator without handing over the rest.
Signal Taxonomy
Logs
Human-oriented diagnostic records:
- severity, component, service name, pid, optional session/service badge, monotonic timestamp, message text;
- rate-limited at producer and log service boundaries;
- suitable for serial forwarding, ring-buffer retention, and later storage;
- not a source of truth for security decisions.
Metrics
Low-cardinality numeric state:
- per-process ring SQ/CQ occupancy,
cq_overflow, invalid SQE counts, opcode counts, transport error counts; - scheduler runnable/blocked counts, direct IPC handoffs, cap-enter timeouts, process exits;
- resource ledger usage: cap slots, outstanding calls, scratch bytes, frame grants, endpoint queue occupancy, VM mapped pages;
- heap/frame allocator pressure;
- later device, network, storage, and CPU-time counters.
Metric shape is fixed to three forms:
- Counter — monotonic
u64, reset only by reboot. Cumulative semantics make aggregation composable. - Gauge —
i64that moves both ways. Used for queue depths, free-frame counts, mapped-page counts. - Histogram — fixed bucket layout carried in the descriptor,
u64per bucket. Used for ring-dispatch duration, context-switch latency, IPC RTT.
Richer shapes (top-k tables, exponential histograms) are emitted as opaque typed payloads through the producer-scoped envelope described under “Core Interfaces”; the generic reader treats them as data, and a schema-aware viewer decodes them. Metrics should be snapshots or monotonic counters, not unbounded label streams.
Events
Discrete lifecycle facts:
- process spawned, started, exited, waited, killed, or failed to load;
- service declared healthy, unhealthy, restarting, quiescing, or upgraded;
- endpoint queue overflow, cancellation, disconnected holder, transfer rollback;
- resource quota rejection;
- device reset, interrupt storm, link up/down, block I/O error once devices exist.
Events are useful for supervisors and status views. They may also feed logs.
Traces
Bounded high-detail capture for debugging:
- SQE/CQE records around one pid, service subtree, endpoint, cap id, or error class;
- optional capnp payload capture only with explicit authority;
- offline schema-aware viewer for reproducing and explaining a failure;
- short retention by default.
This is the Ring as Black Box milestone from WORKPLAN.md, not full replay.
Health
Declared service state:
- ready, starting, degraded, draining, failed, stopped;
- last successful health check and last failure reason;
- dependency health summaries;
- supervisor-owned restart intent and backoff state.
Health is not inferred only from process liveness. A process can be alive and unhealthy, or intentionally draining and still useful.
Crash Records
Panic, exception, and fatal userspace runtime records:
- boot stage, current pid if known, fault vector, RIP/CR2/error code where applicable, recent SQE context when safe, and last serial line cursor;
- bounded, redacted, and readable through a crash/debug capability;
- serial fallback remains mandatory when no reader exists.
Audit
Security and policy records:
- session creation, approval request, policy decision, cap grant, cap transfer, cap release/revocation, denial, declassification/relabel operation;
- no raw authentication proofs, private keys, bearer tokens, or full environment dumps;
- query access is scoped by session, service subtree, or operator role.
Proposed Architecture
flowchart TD
Kernel[Kernel primitives] --> KD[KernelDiagnostics cap]
Kernel --> Serial[Emergency serial]
Init[init / root supervisor] --> LogSvc[Log service]
Init --> MetricsSvc[Metrics service]
Init --> StatusSvc[Status service]
Init --> AuditSvc[Audit log]
Init --> TraceSvc[Trace capture service]
KD --> MetricsSvc
KD --> StatusSvc
KD --> TraceSvc
Services[Services and drivers] --> LogSink[Scoped LogSink caps]
Services --> Health[Health caps]
Services --> AuditWriter[Scoped AuditWriter caps]
LogSink --> LogSvc
Health --> StatusSvc
AuditWriter --> AuditSvc
Broker[AuthorityBroker] --> Readers[Scoped readers]
Readers --> Shell[Shell / agent / operator tools]
StatusSvc --> Readers
LogSvc --> Readers
MetricsSvc --> Readers
TraceSvc --> Readers
AuditSvc --> Readers
The important property is that there is no ambient monitoring namespace. The graph is assembled by init and supervisors. Readers are capabilities, not paths.
Core Interfaces
These are conceptual interfaces. They should not be added to
schema/capos.capnp until the current manifest-executor work is complete and a
specific implementation slice needs them.
enum Severity {
debug @0;
info @1;
warn @2;
error @3;
critical @4;
}
struct LogRecord {
tick @0 :UInt64;
severity @1 :Severity;
component @2 :Text;
pid @3 :UInt32;
badge @4 :UInt64;
message @5 :Text;
}
struct LogFilter {
minSeverity @0 :Severity;
componentPrefix @1 :Text;
pid @2 :UInt32;
includeDebug @3 :Bool;
}
interface LogSink {
write @0 (record :LogRecord) -> ();
}
interface LogReader {
read @0 (cursor :UInt64, maxRecords :UInt32, filter :LogFilter)
-> (records :List(LogRecord), nextCursor :UInt64, dropped :UInt64);
}
LogSink is what ordinary services receive. LogReader is what shells,
operators, supervisors, and diagnostic tools receive. A scoped reader can filter
to one service subtree or session before the caller ever sees the record.
struct ProcessStatus {
pid @0 :UInt32;
serviceName @1 :Text;
state @2 :Text;
capSlotsUsed @3 :UInt32;
capSlotsMax @4 :UInt32;
outstandingCalls @5 :UInt32;
cqReady @6 :UInt32;
cqOverflow @7 :UInt64;
lastExitCode @8 :Int64;
}
struct ServiceStatus {
name @0 :Text;
health @1 :Text;
pid @2 :UInt32;
restartCount @3 :UInt32;
lastError @4 :Text;
}
interface SystemStatus {
listProcesses @0 () -> (processes :List(ProcessStatus));
listServices @1 () -> (services :List(ServiceStatus));
service @2 (name :Text) -> (status :ServiceStatus);
}
SystemStatus is read-only. A broad instance can see the system; wrappers can
expose one service, one supervision subtree, or one session.
enum MetricKind {
counter @0;
gauge @1;
histogram @2;
}
struct MetricSample {
# Well-known fixed-name slot for counters and gauges the aggregator
# understands without additional schema lookup. Use this for stable
# kernel counters to keep the hot path allocation-free.
name @0 :Text;
kind @1 :MetricKind;
value @2 :Int64;
tick @3 :UInt64;
# Producer-scoped typed envelope for richer samples (histograms,
# top-k tables, per-subsystem structs). Payload is a capnp message;
# the schema is identified by `schemaHash` (capnp node id) and keyed
# per producer. Opaque to the generic reader; a schema-aware viewer
# decodes it.
producerId @4 :UInt64;
schemaHash @5 :UInt64;
payload @6 :Data;
}
struct MetricFilter {
prefix @0 :Text;
service @1 :Text;
}
interface MetricsReader {
snapshot @0 (filter :MetricFilter, maxSamples :UInt32)
-> (samples :List(MetricSample), truncated :Bool);
}
Early metrics should be fixed-name counters and gauges in the name/value
slot. Avoid arbitrary labels until there is a concrete memory and cardinality
policy. The producer-scoped envelope exists so richer samples do not force the
generic reader to learn a string-key taxonomy — if a producer needs per-queue
or per-device detail, it ships a typed capnp struct keyed by schemaHash
rather than synthesizing name strings.
struct TraceSelector {
pid @0 :UInt32;
serviceName @1 :Text;
errorCode @2 :Int32;
includePayloadBytes @3 :Bool;
}
struct TraceRecord {
tick @0 :UInt64;
pid @1 :UInt32;
opcode @2 :UInt16;
capId @3 :UInt32;
methodId @4 :UInt16;
interfaceId @5 :UInt64;
result @6 :Int32;
flags @7 :UInt16;
payload @8 :Data;
}
interface TraceCapture {
arm @0 (selector :TraceSelector, maxRecords :UInt32, maxBytes :UInt32)
-> (captureId :UInt64);
drain @1 (captureId :UInt64, maxRecords :UInt32)
-> (records :List(TraceRecord), complete :Bool, dropped :UInt64);
}
Payload capture should default off. A capture cap that can read payload bytes is closer to a debug privilege than a normal status cap.
enum HealthState {
starting @0;
ready @1;
degraded @2;
draining @3;
failed @4;
stopped @5;
}
interface Health {
check @0 () -> (state :HealthState, reason :Text);
}
interface ServiceSupervisor {
status @0 () -> (status :ServiceStatus);
restart @1 () -> ();
}
ServiceSupervisor is authority-changing. Normal monitoring readers should not
receive it. A broker can mint a leased ServiceSupervisor(net-stack) for one
operator action.
Kernel Diagnostics Contract
The kernel should expose a small read-only diagnostics surface for facts only the kernel can know:
- process table snapshot: pid, state, service name if known, wait state, exit code, ring physical identity hidden or omitted;
- ring snapshot: SQ/CQ head/tail-derived occupancy, overflow count, corrupted head/tail recovery counts, opcode/error counters;
- resource snapshot: cap slot usage, outstanding calls, scratch reservation, frame grant pages, mapped VM pages, free frame count, heap pressure;
- scheduler snapshot: tick count, current pid, run queue length, blocked count, direct IPC handoff count, timeout wake count;
- crash record: last panic/fault metadata and early boot stage.
The kernel should not implement log routing, alerting, dashboards, retention policy, restart decisions, RBAC, ABAC, or text search. Those are userspace service responsibilities.
Implementation shape:
- Maintain fixed-size counters in existing kernel structures where the source event already occurs.
- Prefer snapshots computed from existing state over duplicate counters when the cost is bounded.
- Expose snapshots through a small set of narrow read-only capabilities,
not one
KernelDiagnosticsgod-cap. The initial decomposition:SchedStats— tick count, current pid, run queue length, blocked count, direct IPC handoff count,cap_entertimeout/wake counts.FrameStats— free/used frame counts, frame-grant pages, allocator pressure histogram.RingStats— per-process SQ/CQ occupancy,cq_overflow, corrupted-head recovery counts, opcode counters, transport-error counters.CapTableStats— per-process slot occupancy, generation-rollover counts, insertion/remove rates.EndpointStats— per-endpoint waiter depth, RECV/RETURN match rate, abort/cancellation counts.CrashSnapshot— last panic/fault metadata, early boot stage, recent SQE context when safe.
- Each narrow cap exposes
snapshot() -> (sample :MetricSample)or a typed struct; none of them enumerates processes or reads cap tables beyond what the subsystem owns. A trusted status service composes the ones it needs; a broker leases a subset for operator sessions without the rest. ProcessInspector(pid-scoped process table, cap-table enumeration, VM map) is a distinct, stronger cap and lives with process-management authority, not with monitoring.- Convert broad diagnostics into scoped userspace wrappers before handing them to shells or applications.
- Keep panic/fault serial writes independent of any diagnostics service.
Promotion from the measure feature: the benchmark counters in
kernel/src/measure.rs graduate to always-on in RingStats / SchedStats
when the per-event cost is provably a single relaxed atomic add. Cycle-counter
instrumentation (rdtsc/rdtscp) stays behind cfg(feature = "measure")
because it is serializing and benchmark-only. The promotion threshold keeps
normal dispatch builds free of instrumentation cost without forcing monitoring
into a second build configuration.
Logging Model
Early boot has only serial. After init starts the log service, ordinary services
should receive LogSink rather than raw Console unless they need emergency
console access.
Recommended path:
- Kernel serial remains for boot, panic, and fault records.
- Init starts a userspace log service and passes scoped
LogSinkcaps to children. - The log service forwards selected records to
Consoleuntil persistent storage exists. SystemConfig.logLevelbecomes an initial policy input for which records the log service forwards and retains.- Session and operator tools receive scoped
LogReadercaps from a broker.
Services should not put secrets, raw capability payloads, full auth proofs, or arbitrary user input into logs without explicit redaction. Log records are data, not commands.
Metrics and Status
Status answers “what is alive and what state is it in.” Metrics answer “what is the numeric behavior over time.” Keeping them separate avoids a common failure mode where a human-readable status API grows into an unbounded metrics store.
Initial status fields should cover:
- pid, service name, binary name, process state, exit code;
- process handle wait state;
- supervisor health and restart policy once supervision exists;
- cap table occupancy and outstanding call count;
- ring CQ availability and overflow;
- endpoint queue occupancy where authorized.
Initial metrics should cover:
- ring dispatches, SQEs processed, per-op counts, transport error counts;
- cap-enter wait count, timeout count, wake count;
- scheduler context switches and direct IPC handoffs;
- frame free/used counts, frame grant pages, VM mapped pages;
- log records accepted, suppressed, dropped, and forwarded;
- trace records captured and dropped.
Avoid per-method, per-cap-id, per-badge, or per-user high-cardinality metrics by default. Those belong in short-lived traces or scoped logs.
Ring as Black Box
The first concrete monitoring milestone should be the existing WORKPLAN.md
Ring-as-Black-Box item:
- define a bounded capture format for SQE/CQE and endpoint transition records;
- export capture through a debug capability or QEMU-only debug path;
- build a host-side viewer that decodes records and capnp payloads when payload capture is authorized;
- add one failing-call smoke whose captured log can be inspected offline.
This buys immediate debugging value without committing to durable audit, network export, service restart policy, or replay semantics.
This is inspection, not record/replay. Replay requires stronger determinism, payload retention, timer/input modeling, and capability-state checkpoints.
Capture path cost. The capture cap (working name RingTap) is
feature-gated (cfg(feature = "debug_tap") analogous to measure). Every
armed tap imposes a serializing fan-out on dispatch; keeping it out of the
default kernel feature set prevents always-on cost. Arming a tap is itself
an auditable event — the tapped process and the audit log observe it —
and tap grants respect move-semantics so a tap cannot be silently cloned
past its intended holder. Payload-capturing taps require a separately
leased cap distinct from metadata-only capture because payloads may
contain secrets.
Health and Supervision
Health and restart policy should live with supervisors, not in a central kernel daemon.
Each supervisor owns:
- a narrowed
ProcessSpawner; - child
ProcessHandlecaps; - the cap bundle needed to restart its subtree;
- optional
Healthcaps exported by children; - a
LogSinkandAuditWriterfor its own decisions.
Status services aggregate supervisor-reported health. They should distinguish:
- no process exists;
- process exists but never reported ready;
- process is alive and ready;
- process is alive but degraded;
- process exited normally;
- process failed and supervisor is backing off;
- process was intentionally stopped or draining.
Restart authority should be a separate ServiceSupervisor cap. A read-only
SystemStatus cap must not be able to restart anything.
Audit Integration
Audit should share infrastructure with logging only at the storage or transport layer. Its semantics are different.
Audit producers:
AuthorityBrokerfor policy decisions and leased grants;- supervisors for restarts and service lifecycle actions;
- session manager for session creation and logout;
- kernel or status service for cap transfer/release/revocation summaries when those events become part of the exported authority graph;
- recovery tools for repair actions.
Audit readers are scoped:
- a user can read records for its own session;
- an operator can read a service subtree;
- a recovery or security role can read broader streams after policy approval.
Audit entries must avoid secrets and payload dumps. They should record object identity, service identity, policy decision summaries, and capability interface classes rather than raw data.
Security and Backpressure
Monitoring must not become the easiest denial-of-service path.
Required controls:
- Per-process log token buckets, matching the S.9 diagnostic aggregation design.
- Suppression summaries for repeated invalid submissions.
- Fixed-size ring buffers with explicit dropped counts.
- Maximum record size for logs, events, crash records, and traces.
- Bounded formatting outside interrupt context.
- No heap allocation in timer or panic paths.
- No unbounded metric label creation from user-controlled strings.
- Payload tracing disabled by default.
- Redaction rules at producer boundaries and at reader wrappers.
- Capability-scoped readers; no unauthenticated “debug all” endpoint.
When pressure forces dropping, preserve first-observation diagnostics and later summaries. Losing detailed logs is acceptable; corrupting scheduler progress or blocking the kernel on log I/O is not.
Relationship to Existing Proposals
- Service Architecture: monitoring services are ordinary userspace services spawned by init or supervisors. Logging policy and service topology stay out of the pre-init kernel path.
- Shell: the native and agent shell should receive scoped
SystemStatusandLogReadercaps in daily profiles, not global supervisor authority. - User Identity and Policy:
AuthorityBrokermints scoped readers and leased supervisor caps based on session policy;AuditLogrecords the decisions. - Error Handling: transport errors and
CapExceptionpayloads are monitoring signals, but retry policy remains userspace. - Authority Accounting: resource ledgers provide the first metrics substrate and define quota/backpressure boundaries.
- Security and Verification: hostile-input tests should cover log flood aggregation and bounded diagnostic paths.
- Live Upgrade: health, audit, and service status become prerequisites for credible upgrade orchestration.
Implementation Plan
-
Document the model. Keep monitoring as a future architecture proposal and do not disturb the current manifest-executor milestone.
-
Ring as Black Box. Implement bounded CQE/SQE capture, host-side decoding, and one failing-call smoke. This is the first useful monitoring artifact.
-
Userspace log service. Add
LogSinkandLogReaderschemas, start a log service from init, forward selected records to Console, and enforcelogLevel, record size, and drop summaries. -
Narrow kernel stats caps and SystemStatus. Add the narrow read-only caps (
SchedStats,FrameStats,RingStats,CapTableStats,EndpointStats,CrashSnapshot) as bounded snapshot surfaces. A userspaceSystemStatusservice composes the ones it needs and exposes scoped wrappers to shells and operator tools. LeaveProcessInspectorout of this step — it belongs with process-management authority, not monitoring. -
Metrics snapshots. Add fixed counters and gauges for ring, scheduler, resource, log, and trace state. Keep labels static until a cardinality policy exists.
-
Health and supervisor status. Add
Healthand read-only supervisor status once restart policy and exported service caps are concrete. Keep restart authority in separateServiceSupervisorcaps. -
Audit path. Add append-only audit records for broker decisions, cap grants, releases, revocations, restarts, and recovery actions. Start serial or memory backed; move to storage once the storage substrate exists.
-
Crash records. Preserve bounded panic/fault metadata across the current boot where possible; later store records durably.
-
Device, network, and storage metrics. Add driver metrics only after those drivers exist: interrupts, DMA/bounce usage, queue depth, RX/TX/drop/error counts, block latency, and reset events.
Non-Goals
- No global
/procor/sysequivalent with ambient read access. - No kernel-resident dashboard, alert manager, text search, or policy engine.
- No programmable kernel tracing language in the first monitoring design.
- No promise of durable log retention before storage exists.
- No default payload tracing.
- No service restart authority bundled into ordinary read-only status caps.
- No network export path until networking and policy can constrain it.
Open Questions
- Should
KernelDiagnosticsexpose snapshots only, or also a bounded event cursor? - What is the minimum timestamp model before wall-clock time exists?
- Should log records carry local cap IDs, stable object IDs, or only interface and service metadata by default?
- How should schema-aware trace decoding find schemas before a full
SchemaRegistryexists? - Which crash fields are safe to expose to non-recovery sessions?
- What retention policy is acceptable before persistent storage?
- Should
MetricsReaderuse typed structs for each subsystem instead of generic name/value samples? - Where should remote monitoring export fit once network transparency exists: a dedicated exporter service, capnp-rpc forwarding, or storage replication?
Proposal: User Identity, Sessions, and Policy
How capOS should represent human users, service identities, guests, anonymous callers, and policy systems without reintroducing Unix-style ambient authority.
Problem
capOS has processes, address spaces, capability tables, object identities,
badges, quotas, and transfer rules. It deliberately does not have global
paths, ambient file descriptors, a privileged root bit, or Unix uid/gid
authorization in the kernel.
Interactive operation still needs a way to answer practical questions:
- Who is using this shell session?
- Which caps should a normal daily session receive?
- How does a service distinguish Alice, Bob, a service account, a guest, and an anonymous network caller?
- How do RBAC, ABAC, and mandatory policy fit a capability system?
- How does POSIX compatibility expose users without letting
uidbecome authority?
The answer should keep the enforcement model simple: capabilities are the authority. Identity and policy decide which capabilities get minted, granted, attenuated, leased, revoked, and audited.
Design Principles
useris not a kernel primitive.uid,gid, role, and label values do not authorize kernel operations.- A process is authorized only by capabilities in its table.
- Authentication proves or selects a principal; it does not itself grant authority.
- A session is a live policy context that receives a cap bundle.
- A workload is a process or supervision subtree launched with explicit caps.
- POSIX user concepts are compatibility metadata over scoped caps.
- Guest and anonymous access are explicit policy profiles, not missing policy.
Concepts
Principal
A principal is a durable or deliberately ephemeral identity known to auth and policy services. It is useful for policy decisions, ownership metadata, audit records, and user-facing display. It is not a kernel subject.
Examples:
- human account
- operator account
- service account
- cloud instance or deployment identity
- guest profile
- anonymous caller
- pseudonymous key-bound identity
enum PrincipalKind {
human @0;
operator @1;
service @2;
guest @3;
anonymous @4;
pseudonymous @5;
}
struct PrincipalInfo {
id @0 :Data; # Stable opaque ID, or random ephemeral ID.
kind @1 :PrincipalKind;
displayName @2 :Text;
}
PrincipalInfo is intentionally descriptive. Possessing a serialized
PrincipalInfo value must not grant authority.
Session
A session is a live context derived from a principal plus authentication and policy state. Sessions carry freshness, expiry, auth strength, quota profile, audit identity, and default cap-bundle selection.
enum AuthStrength {
none @0;
localPresence @1;
password @2;
hardwareKey @3;
multiParty @4;
}
struct SessionInfo {
sessionId @0 :Data;
principal @1 :PrincipalInfo;
authStrength @2 :AuthStrength;
createdAtMs @3 :UInt64;
expiresAtMs @4 :UInt64;
profile @5 :Text;
}
interface UserSession {
info @0 () -> (info :SessionInfo);
defaultCaps @1 (profile :Text) -> (caps :List(GrantedCap));
auditContext @2 () -> (sessionId :Data, principalId :Data);
logout @3 () -> ();
}
interface SessionManager {
login @0 (method :Text, proof :Data) -> (session :UserSession);
guest @1 () -> (session :UserSession);
anonymous @2 (purpose :Text) -> (session :UserSession);
}
GrantedCap should be the same transport-level result-cap concept used by
ProcessSpawner, not a parallel authority encoding.
Workload
A workload is a process or supervision subtree started from a session, service, or supervisor. Workloads may carry session metadata for audit and policy, but they do not run “as” a user in the Unix sense. They run with a CapSet.
Common workload shapes:
- interactive native shell
- agent shell
- POSIX shell compatibility session
- user-facing application
- per-user service instance
- shared service handling many user sessions
- service account process
Capability
A capability remains the actual authority. A process can only use what is in its local capability table. Policy services can choose to mint, attenuate, lease, transfer, or revoke capabilities, but they do not create a second authorization channel.
Session Startup Flow
flowchart TD
Input[Login, guest, or anonymous request]
Auth[Authentication or guest policy]
Session[UserSession cap]
Broker[AuthorityBroker / PolicyEngine]
Bundle[Scoped cap bundle]
Shell[Native, agent, or POSIX shell]
Audit[AuditLog]
Input --> Auth
Auth --> Session
Session --> Broker
Broker --> Bundle
Bundle --> Shell
Broker --> Audit
Shell --> Audit
The shell proposal’s minimal daily cap set is a session bundle:
terminal TerminalSession or Console
self self/session introspection
status read-only SystemStatus
logs read-only LogReader scoped to this user/session
home Directory or Namespace scoped to user data
launcher restricted launcher for approved user applications
approval ApprovalClient
The shell still cannot mint additional authority. It can ask
ApprovalClient for a plan-specific grant, and a trusted broker can return
a narrow leased capability if policy and authentication allow it.
Multi-User Workloads
capOS should support two normal multi-user patterns.
Per-Session Subtree
The session owns a shell or supervisor subtree. Every child process receives an explicit CapSet assembled from the session bundle plus workflow-specific grants.
Example:
- Alice’s shell receives
home = Namespace("/users/alice"). - Bob’s shell receives
home = Namespace("/users/bob"). - The same editor binary launched from each shell receives different
homeandterminalcaps. - The editor cannot cross from Alice’s namespace into Bob’s unless a broker deliberately grants a sharing cap.
This is the right default for interactive applications and POSIX shells.
Shared Service With Per-Client Session Authority
A server process may handle many users in one address space. It should not infer authority from a caller’s self-reported user name. Instead, each client connection carries one or more of:
- a badge on the endpoint cap identifying the client/session relation,
- a
UserSessionorSessionContextcap, - a scoped object cap such as
Directory,Namespace,LogWriter, orApprovalClient, - a quota donation for server-side state.
The service uses these values to select scoped storage, enforce per-client limits, emit audit records, and return narrowed caps. This is the right shape for HTTP services, databases, log services, terminals, and shared daemons.
Service Accounts
Service identities are principals too. They are usually non-interactive and receive caps from init, a supervisor, or a deployment manifest rather than from a human login flow.
Service-account policy should be explicit:
- which binary or measured package may use the identity,
- which supervisor may spawn it,
- which caps are in its base bundle,
- which caps it may request from a broker,
- which audit stream records its activity.
Anonymous, Guest, and Pseudonymous Access
These are distinct profiles.
Empty Cap Set
An untrusted ELF with an empty CapSet is not a user session. It is the roadmap’s “Unprivileged Stranger”: code with no useful authority. It can terminate itself and interact with the capability transport, but it cannot reach a resource because it has no caps.
Anonymous
Anonymous means unauthenticated and usually remote or programmatic. It should receive a random ephemeral principal ID and a very small cap bundle.
Typical properties:
- no durable home namespace by default,
- strict CPU, memory, outstanding-call, and log quotas,
- short session expiry,
- no elevation path except “authenticate” or “create account”,
- audit records keyed by ephemeral session ID and network/service context.
Guest
Guest means an interactive local profile with weak or no authentication.
Typical properties:
- terminal/UI access,
- temporary namespace,
- optional ephemeral home reset on logout,
- restricted launcher,
- no administrative approval path unless policy grants one explicitly,
- clearer user-facing affordance than anonymous.
Pseudonymous
Pseudonymous means durable identity without necessarily naming a human. A public key, passkey, service token, or cloud identity can select the same principal across sessions. This can receive persistent storage and quotas while still remaining separate from a verified human account.
POSIX Compatibility
POSIX user concepts are compatibility metadata, not authority.
uid,gid, user names, groups,$HOME,/etc/passwd,chmod, andchownlive inlibcapos-posix, a filesystem service, or a profile service.open("/home/alice/file")succeeds only if the process has aDirectoryorNamespacecap that resolves that synthetic path.setuidcannot grant new caps. At most it asks a compatibility broker to replace the process’s POSIX profile or launch a new process with a different cap bundle.- POSIX ownership bits may influence one filesystem service’s policy, but they cannot authorize access to caps outside that service.
This lets existing programs inspect plausible user metadata without making Unix permission bits the capOS security model.
Policy Models
RBAC, ABAC, and mandatory access control fit capOS as grant-time and
mint-time policy. They should mostly live in ordinary userspace services:
AuthorityBroker, PolicyEngine, SessionManager, RoleDirectory,
LabelAuthority, AuditLog, and service-specific attenuators.
The kernel should keep enforcing capability ownership, generation, transfer rules, revocation epochs, resource ledgers, and process isolation. It should not evaluate roles, attributes, or label lattices on every capability call.
RBAC
Role-based access control maps principals or sessions to named role sets. Roles select cap bundles and approval eligibility.
Examples:
developercan receive a launcher for development tools and read-only service logs.net-operatorcan request a leasedServiceSupervisor(net-stack).storage-admincan request repair caps for selected storage volumes.
Implementation shape:
interface RoleDirectory {
rolesFor @0 (principal :Data) -> (roles :List(Text));
}
interface AuthorityBroker {
request @0 (
session :UserSession,
plan :ActionPlan,
requestedCaps :List(CapRequest),
durationMs :UInt64
) -> (grant :ApprovalGrant);
}
Roles do not bypass capabilities. They only let a broker decide whether it may mint or return particular scoped caps.
ABAC
Attribute-based access control evaluates a richer decision context:
- subject attributes: principal kind, roles, auth strength, session age, device posture, locality,
- action attributes: requested method, target service, destructive flag, requested duration,
- object attributes: service name, namespace prefix, data class, owner principal, sensitivity,
- environment attributes: time, boot mode, recovery mode, network location, cloud instance metadata, quorum state.
ABAC is useful for contextual narrowing:
- allow log read only for the caller’s session unless break-glass policy is active,
- issue
ServiceSupervisor(net-stack)only with fresh hardware-key auth, - grant
Namespace("/shared/project")read-write only during a maintenance window, - deny network caps to guest sessions.
ABAC decisions should return capabilities, wrappers, or denials. They should not create hidden ambient checks downstream.
ABAC Policy Engine Choices
Do not invent a policy language first. The capOS-native interface should be small and capability-shaped, while the broker implementation can start with a mainstream engine behind that interface.
Recommended order:
-
Cedar for runtime authorization. Cedar’s request shape is already close to capOS:
principal,action,resource, andcontext. It supports RBAC and ABAC in one policy set, has schema validation, and has a Rust implementation. That makes it the best fit forAuthorityBrokerandMacBrokerservice prototypes. -
OPA/Rego for host-side and deployment policy. OPA is widely used for cloud, Kubernetes, infrastructure-as-code, and admission-control style checks. It is useful for validating manifests, cloud metadata deltas, package/deployment policies, and CI rules. The Wasm compilation path is worth tracking for later capOS-side execution, but OPA should not be the first low-level runtime dependency.
-
Casbin for quick prototypes only. Casbin is useful for simple RBAC/ABAC experiments and has Rust bindings, but its model/matcher style is less attractive as a long-term capOS policy substrate than Cedar’s schema-validated authorization model.
-
XACML for interoperability and compliance, not native policy. XACML remains the classic enterprise ABAC standard. It is useful as a conceptual reference or import/export target, but it is too heavy and XML-centric to be the native capOS policy language.
The capOS service boundary should hide the selected engine:
interface PolicyEngine {
decide @0 (request :PolicyRequest) -> (decision :PolicyDecision);
}
struct PolicyRequest {
principal @0 :PrincipalInfo;
action @1 :Text;
resource @2 :ResourceRef;
context @3 :List(Attribute);
}
struct PolicyDecision {
allowed @0 :Bool;
reason @1 :Text;
leaseMs @2 :UInt64;
constraints @3 :List(Attribute);
}
PolicyDecision is still not authority. It is input to a broker that returns
actual caps, wrapper caps, leased caps, or denial.
References:
- Cedar policy language docs: https://docs.cedarpolicy.com/
- Amazon Verified Permissions concepts: https://docs.aws.amazon.com/verifiedpermissions/latest/userguide/terminology.html
- Open Policy Agent docs: https://www.openpolicyagent.org/docs
- Casbin supported models: https://www.casbin.org/docs/supported-models
- OASIS XACML technical committee: https://www.oasis-open.org/committees/xacml/
Mandatory Access Control
Mandatory access control is non-bypassable policy set by the system owner or deployment, not discretionary sharing by ordinary users. In capOS, MAC should be implemented as mandatory constraints on cap minting, attenuation, transfer, and service wrappers.
Examples:
- a
Secretcap labeledhighcannot be transferred to a workload labeledlow, - a
LogReaderfor security logs cannot be granted to a guest session even if an application asks, - a recovery shell can inspect storage read-only but cannot write without a separate exact-target repair cap,
- cloud user-data can add application services but cannot grant
FrameAllocator,DeviceManager, or raw networking authority.
Implementation components:
enum Sensitivity {
public @0;
internal @1;
confidential @2;
secret @3;
}
struct SecurityLabel {
domain @0 :Text;
sensitivity @1 :Sensitivity;
compartments @2 :List(Text);
}
interface LabelAuthority {
labelOfPrincipal @0 (principal :Data) -> (label :SecurityLabel);
labelOfObject @1 (object :Data) -> (label :SecurityLabel);
canTransfer @2 (
from :SecurityLabel,
to :SecurityLabel,
capInterface :UInt64
) -> (allowed :Bool, reason :Text);
}
For ordinary services, MAC can be enforced by brokers and wrapper caps. For high-assurance boundaries, the remaining question is whether transfer labels need kernel-visible hold-edge metadata. That should be added only for a concrete mandatory policy that cannot be enforced by controlling all grant paths through trusted services.
GOST-Style MAC and MIC
Russian GOST framing is stricter than the generic “MAC means labels” summary. The relevant standards split at least two policies that capOS should keep separate:
-
Mandatory access control for confidentiality. ГОСТ Р 59383-2021 describes mandatory access control as classification labels on resources and clearances for subjects. ГОСТ Р 59453.1-2021 goes further: a formal model that includes users, subjects, objects, containers, access levels, confidentiality levels, subject-control relations, and information flows. The safety goal is preventing unauthorized flow from an object at a higher or incomparable confidentiality level to a lower one.
-
Mandatory integrity control for integrity. ГОСТ Р 59453.1-2021 treats this separately from confidentiality MAC. The integrity model constrains subject integrity levels, object/container integrity levels, subject-control relationships, and information flows so lower-integrity subjects cannot control or corrupt higher-integrity subjects.
For capOS, this should map to labels on sessions, objects, wrapper caps, and eventually hold edges:
struct ConfidentialityLabel {
level @0 :Text; # e.g. public, internal, secret.
compartments @1 :List(Text);
}
struct IntegrityLabel {
level @0 :Text; # ordered by deployment policy.
domains @1 :List(Text);
}
struct MandatoryLabel {
confidentiality @0 :ConfidentialityLabel;
integrity @1 :IntegrityLabel;
}
Capability methods need a declared flow class. capOS cannot rely on generic
read and write syscalls:
- read-like:
File.read,Secret.read,LogReader.read; - write-like:
File.write,Namespace.bind,ManifestUpdater.apply; - control-like:
ProcessSpawner.spawn,ServiceSupervisor.restart; - transfer-like:
CAP_OP_CALL,CAP_OP_RETURN, and result-cap insertion when they carry caps or data across labeled domains.
Initial rules can be expressed as broker/wrapper checks:
read data-bearing cap:
subject.clearance dominates object.classification
write data-bearing cap:
target.classification dominates source.classification
# no write down
control process or supervisor:
controlling subject is same label, or is an explicitly trusted subject
integrity write/control:
writer.integrity >= target.integrity
This is not enough for a GOST-style formal claim, because uncontrolled cap transfer can bypass the broker. A higher-assurance design needs:
- kernel object identity for every labeled object,
- label metadata on kernel objects or per-process hold edges,
- transfer-time checks for copy, move, result caps, and endpoint delivery,
- explicit trusted-subject/declassifier caps,
- an audit trail for every label-changing or declassifying operation,
- a formal state model covering users, subjects, objects, containers, access rights, accesses, and memory/time information flows.
The proposal therefore has two levels:
- Pragmatic capOS MAC/MIC: userspace brokers and wrapper caps enforce labels on grants and method calls.
- GOST-style MAC/MIC: a formal information-flow model plus kernel-visible labels/hold-edge constraints for transfers that cannot be forced through trusted wrappers. See formal-mac-mic-proposal.md for the dedicated abstract-automaton and proof track.
References:
- ГОСТ Р 59383-2021, access-control foundations: https://lepton.ru/GOST/Data/752/75200.pdf
- ГОСТ Р 59453.1-2021, formal access-control model: https://meganorm.ru/Data/750/75046.pdf
Composition Order
When policies compose, use this order:
- Mandatory policy defines the maximum possible authority.
- RBAC selects coarse eligibility and default bundles.
- ABAC narrows the decision for context, freshness, object attributes, and requested duration.
- The broker returns specific capabilities or denies the request.
- Audit records the plan, decision, grant, use, release, and revocation.
The composition result is still a CapSet, leased cap, wrapper cap, or denial.
Service Architecture
The policy stack should be decomposed into ordinary capOS services. Init or a trusted supervisor grants broad authority only to the small services that need to mint narrower caps.
SessionManager
Creates session metadata caps:
guest()for local guest sessions,anonymous(purpose)for ephemeral unauthenticated callers,- later
login(method, proof)for authenticated users.
The first implementation can be boot-config backed. It does not need a
persistent account database. UserSession should describe the principal,
session ID, profile, auth strength, expiry, and audit context. It should not
be a general-purpose authority vending machine unless it was itself minted as
a narrow wrapper around a fixed cap bundle.
Safer first split:
SessionManager -> UserSession metadata cap
AuthorityBroker(session, profile) -> base cap bundle
Supervisor/Launcher -> spawn shell with that bundle
AuthorityBroker
The broker owns or receives powerful caps from init/supervisors and returns narrow caps after RBAC, ABAC, and mandatory checks.
Examples:
- broad
ProcessSpawner->RestrictedLauncher(allowed = ["shell", "editor"]), - broad
NamespaceRoot->Namespace("/users/alice"), - broad
ServiceSupervisor->LeasedSupervisor("net-stack", expires = 60s), - broad
BootPackage->BinaryProvider(allowed = ["shell", "editor"]).
The broker is the normal policy decision and cap minting point.
AuditLog
Append-only audit interface. Initially this can write to serial or a bounded log buffer; later it should be Store-backed.
Record at least:
- session creation,
- cap request,
- policy input summary,
- policy decision,
- cap grant,
- cap release or revocation,
- denial,
- declassification or relabel operation.
Audit entries must not contain raw auth proofs, private keys, bearer tokens, or broad environment dumps.
RoleDirectory
Role lookup should start static and boot-config backed:
guest -> guest-shell
alice -> developer
ops -> net-operator
net-stack -> service:network
This is enough for early RBAC bundles. Dynamic role assignment can wait for persistent storage and administrative tooling.
LabelAuthority
Owns the label lattice and dominance checks. In the pragmatic phase, it is a userspace dependency of brokers and wrappers. In a GOST-style phase, the same lattice needs a kernel-visible representation for transfer checks.
Wrapper Caps
Wrappers are the main mechanism. Prefer them over per-call ACL checks in a central service:
RestrictedLauncherwrapsProcessSpawner.ScopedNamespacewraps a broader namespace/store.ScopedLogReaderfilters by session ID or service subtree.LeasedSupervisorwraps a broader supervisor with expiry and target binding.ApplicationManifestUpdaterrejects kernel/device/service-manager grants.LabelledEndpointenforces declared data-flow and control-flow constraints.
This keeps authority visible in the capability graph.
Bootstrap Sequence
Early boot can be static:
init
-> starts AuditLog
-> starts SessionManager
-> starts AuthorityBroker with broad caps
-> asks broker for a system, guest, or operator shell bundle
-> spawns shell through a restricted launcher
Before durable storage exists, policy config comes from BootPackage /
manifest config. Before real authentication exists, support guest,
anonymous, and localPresence only.
Revocation, Audit, and Quotas
User/session policy depends on the Stage 6 authority graph work:
- badge metadata lets shared services distinguish session/client relations,
- resource ledgers and session quotas prevent denial-of-service through session creation,
CAP_OP_RELEASEand process-exit cleanup reclaim local hold edges,- epoch revocation lets a broker invalidate leased or compromised caps,
- audit logs record the cap grant and release lifecycle.
Audit should record identity and policy metadata, but it should not contain secrets, raw authentication proofs, or broad environment dumps.
Implementation Plan
-
Document the model. Keep user identity out of the kernel architecture and link this proposal from the shell, service, storage, and roadmap docs.
-
Session-aware native shell profile. Treat the shell proposal’s minimal daily cap set as a session bundle. Add
self/sessionintrospection and scopedlogs/homecaps once the underlying services exist. -
Authority broker and audit log. Add
ActionPlan,CapRequest,ApprovalClient, leased grant records, and an append-only audit path. Start with RBAC-style profile bundles and explicit local authentication. -
ABAC policy engine. Extend the broker decision with session freshness, auth strength, object attributes, requested duration, and environment state. Prefer Cedar for the runtime broker interface; use OPA/Rego for host-side manifest and deployment checks. Keep decisions visible in audit records.
-
Mandatory policy labels. Add pragmatic labels to policy-managed services and wrappers. Keep confidentiality and integrity separate. Defer kernel-visible labels until a specific MAC/MIC policy cannot be enforced by trusted grant paths.
-
Guest and anonymous demos. Show a guest shell with
terminal,tmp, and restrictedlauncher, and show an anonymous workload with strict quotas and no persistent storage. -
POSIX profile adapter. Provide synthetic
uid/gid,$HOME,/etc/passwd, and cwd behavior from a session profile and granted namespace caps. -
GOST-style formalization checkpoint. If capOS later claims GOST-style MAC/MIC, write the abstract state model before implementation: users, subjects, objects, containers, access rights, accesses, labels, control relations, and information flows. Then decide which labels must become kernel-visible.
Non-Goals
- No kernel
uid/gid. - No ambient
root. - No global login namespace in the kernel.
- No authorization from serialized identity structs.
- No model-visible authentication secrets.
- No POSIX permission bits as system-wide authority.
- No per-call role/attribute/label interpreter in the kernel fast path.
- No claim of GOST-style MAC/MIC until the formal model and transfer enforcement story exist.
Open Questions
- Which session interfaces are needed before persistent storage exists?
- Should
UserSession.defaultCaps()return actual caps or a plan that must be executed byProcessSpawner? - Which audit store is acceptable before durable storage and replay exist?
- Which MAC policies, if any, justify kernel-visible hold-edge labels?
- How should remote capnp-rpc identities map into local principals?
- Should the first broker prototype embed Cedar directly, or use a simpler hand-written evaluator until the policy surface stabilizes?
Proposal: Cloud Instance Bootstrap
Picking up instance-specific configuration — SSH keys, hostname, network config, user-supplied payload — from cloud provider metadata sources, without porting the Canonical cloud-init stack.
Problem
A capOS ISO built once has to boot on any cloud VM and adapt to its environment: different instance IDs, different public IPs, different operator-supplied SSH keys, different user-data payloads. Without this, every instance needs a custom-baked ISO — and the content-addressed-boot story (“same hash boots identically on N machines”) devalues itself at the point where it would actually matter for operations.
The Linux convention is cloud-init: a Python daemon that reads
metadata from provider-specific sources and applies it by writing
files under /etc, invoking systemctl, creating users, and running
shell scripts. Porting it is a non-starter:
- Python, POSIX, systemd-dependent.
- Runs as root with ambient authority: parses untrusted user-data as shell scripts, mutates arbitrary system state.
- ~100k lines covering hundreds of rarely-used modules (chef, puppet, seed_random, phone_home).
- Assumes a package manager and init system that do not exist on capOS.
capOS needs the pattern — consume provider metadata, use it to bootstrap the instance — reshaped to the capability model.
Metadata Sources
All major clouds expose instance metadata through one or more of:
- HTTP IMDS.
169.254.169.254. AWS IMDSv2 requires aPUTtoken-exchange handshake; GCP and Azure accept directGET. Paths differ per provider. Needs a running network stack. - ConfigDrive. An ISO9660 filesystem attached as a block device,
containing
meta_data.json(or equivalent) and optional user-data file. OpenStack, older Azure. Needs a block driver and filesystem reader, no network. - SMBIOS / DMI. Vendor, product, serial-number, UUID fields populated by the hypervisor. Good for provider detection before networking comes up.
- NoCloud. Seed files baked into the image or on an attached FAT disk. Useful for development and bare-metal.
The bootstrap service should read from whichever source is present rather than hardcoding one. Provider detection via SMBIOS runs first (no dependencies), then the appropriate transport is initialized.
CloudMetadata Capability
A single capnp interface; one or more implementations:
interface CloudMetadata {
# Instance identity
instanceId @0 () -> (id :Text);
instanceType @1 () -> (type :Text);
hostname @2 () -> (name :Text);
region @3 () -> (region :Text);
# Network configuration (primary interface addresses, gateway, DNS)
networkConfig @4 () -> (config :NetworkConfig);
# Authentication material
sshKeys @5 () -> (keys :List(Text));
# User-supplied payload. Opaque to the metadata provider.
userData @6 () -> (data :Data, contentType :Text);
# Vendor-supplied payload. Separate from userData so the
# bootstrap policy can trust them differently.
vendorData @7 () -> (data :Data, contentType :Text);
}
struct NetworkConfig {
interfaces @0 :List(Interface);
struct Interface {
macAddress @0 :Text;
ipv4 @1 :List(IpAddress);
ipv6 @2 :List(IpAddress);
gateway @3 :Text;
dnsServers @4 :List(Text);
mtu @5 :UInt16;
}
}
Implementations:
HttpMetadata— fetches from169.254.169.254; one variant per provider because paths and auth handshakes differ (AWS IMDSv2 token, GCPMetadata-Flavor: Google, Azure API version).ConfigDriveMetadata— reads an ISO9660 seed disk.NoCloudMetadata— reads a seed blob from the initial manifest.
Detection lives in a small probe service that inspects SMBIOS
(System Manufacturer: Google, Amazon EC2, Microsoft Corporation,
…) and grants the cloud-bootstrap service the appropriate
CloudMetadata implementation as part of a manifest delta.
Bootstrap Service
A single service — cloud-bootstrap — runs once per boot:
cloud-bootstrap:
caps:
- metadata: CloudMetadata # from probe service
- manifest: ManifestUpdater # narrow authority to extend the graph
- network: NetworkConfigurator # apply interface addresses
- ssh_keys: KeyStore # target store for authorized keys
user_data_handlers:
- application/x-capos-manifest: ManifestDeltaHandler
# operator-installed handlers for other content types
Sequence:
- Gather identity and declarative config (
instanceId,hostname,networkConfig,sshKeys), apply through the narrow caps above. (data, ct) = metadata.userData()— dispatch by content type. If no handler is registered, log and skip.- Exit.
The service never holds ProcessSpawner directly. It holds
ManifestUpdater, a wrapper that accepts capnp-encoded
ManifestDelta messages and applies them through the existing init
spawn path. The decoder and apply path are shared with the build-time
pipeline (same capos-config crate, same spawn loop). The precise
shape of ManifestDelta is an open question — see “Open Questions”
below — but at minimum it covers hostname, network config, SSH keys,
and authorized application-level service additions:
struct ManifestDelta {
addServices @0 :List(ServiceEntry);
addBinaries @1 :List(NamedBlob);
setHostname @2 :Text;
setNetworkConfig @3 :NetworkConfig;
}
Relationship to the Build-Time Manifest Pipeline
The existing build-time pipeline (system.cue →
tools/mkmanifest → manifest.bin → Limine boot module →
capos-config decoder → init spawn loop) and the cloud-metadata
bootstrap path are not two parallel systems. They are the same
pipeline with different transports and different trust scopes.
| Stage | Build-time (baked ISO) | Runtime (cloud metadata) |
|---|---|---|
| Authoring | system.cue in the repo | user-data.cue on the operator’s host |
| Compile | mkmanifest (CUE → capnp) | same tool, same output |
| Transport | Limine boot module | HTTP IMDS / ConfigDrive / NoCloud disk |
| Wire format | capnp-encoded SystemManifest | capnp-encoded ManifestDelta |
| Decoder | capos-config | capos-config |
| Apply | init spawn loop | same spawn loop, invoked via ManifestUpdater |
Three practical consequences:
- CUE is a host-side authoring convenience, not an on-wire format.
Neither kernel nor init evaluates CUE. An operator supplying
user-data writes
user-data.cue, runs `mkmanifest user-data.cueuser-data.bin
on their host, and ships the capnp bytes (base64 into–metadata user-data=@user-data.bin` for GCP/AWS, or as a file on a ConfigDrive ISO). - NoCloud is a Limine boot module by another name. A NoCloud
seed blob is the same bytes as a baked-in
manifest.bin, attached via a disk or bundled into the ISO instead of handed over by the bootloader. The only difference is who hands the bytes to the parser. - No new schema surface.
ManifestDeltais defined alongsideSystemManifestinschema/capos.capnp, and sharing the decoder meansManifestUpdater’s apply path is a thin merge-and-spawn on top of code that already boots the base system.
The trust model stays clean precisely because ManifestDelta is
not SystemManifest. The base manifest is inside the
content-addressed ISO hash (fully trusted, reproducible). The
runtime delta is applied by a narrowly-permitted service whose caps
define what fields of the delta can actually take effect — the
content-addressed-boot story is preserved because cloud metadata
augments the base graph, it cannot replace it.
User-Data Model
User-data on the wire is a capnp blob, not a shell script. Content
type application/x-capos-manifest identifies the canonical case:
the payload is a ManifestDelta message produced by mkmanifest
on the operator’s host and consumed directly by the bootstrap
service.
For cross-cloud-vendor compatibility, operators can install user-data dispatcher services for other content types (YAML, other capnp schemas, signed manifests, etc.). The bootstrap service holds a handler cap per content type; unknown types are logged and ignored, not executed.
Shell-script user-data — the Linux default — has nowhere to run on
capOS because there is no shell and no ambient-authority process to
execute it under. An operator who insists on this can install a
shell service and a handler that routes text/x-shellscript to it,
but that is a deliberate choice, not a default fallback.
Trust Model
The capability angle earns its keep here.
- The metadata endpoint is assumed as trustworthy as the hypervisor running the VM — the same assumption Linux cloud-init makes.
- The bootstrap service holds narrow caps (
ManifestUpdater,NetworkConfigurator,KeyStore), not ambient root. A bug or a malicious metadata response can at most spawn services theManifestUpdateraccepts, set network config theNetworkConfiguratoraccepts, and drop keys into theKeyStore. It cannot reach for arbitrary system state. vendorDataanduserDataare separated on the wire. A policy that trusts the cloud provider but not the operator (e.g., applyvendorDataas-is, routeuserDatathrough a signature check) is expressible by granting different handler caps to each.- User-data content-type dispatch is capability-mediated: the bootstrap service cannot execute a content type it wasn’t given a handler for. There is no fallback “try to run it as shell.”
Phased Implementation
Most of the manifest-handling machinery already exists from the
build-time pipeline (capos-config, mkmanifest, init’s spawn
loop). The new work is transports, provider detection, and the
ManifestDelta merge semantics.
ManifestDeltaschema andManifestUpdatercap. Add the delta type toschema/capos.capnpalongsideSystemManifest, extendcapos-configwith a merge routine (SystemManifest + ManifestDelta → new services to spawn), and exposeManifestUpdateras a cap in init.NoCloudMetadataseeded from a test fixture is enough to demo the apply path end-to-end without any cloud dependency.- Provider detection via SMBIOS. Kernel-side primitive or capability that reads SMBIOS DMI tables and exposes manufacturer / product strings. No network required.
- ConfigDrive support. ISO9660 reader plus
ConfigDriveMetadata. Gives a working real-transport metadata source with no dependency on userspace networking. QEMU can attach one via-drive file=configdrive.iso,if=virtiofor local testing. - HttpMetadata per provider. Requires the userspace network stack (Stage 6+). GCP first (simplest auth), then AWS (IMDSv2 token flow), then Azure.
- Cross-provider Cloud Metadata demo. Same ISO hash boots under
QEMU, GCP, AWS, and Azure; the only difference is the SMBIOS
manufacturer string, which the probe service uses to pick the
right
HttpMetadatavariant. This is the Cloud Metadata observable milestone.
Open Questions
Which fields of system.cue are runtime-modifiable?
system.cue today is a handful of service entries with kernel Console cap
grants encoded as structured source variants. That will grow. Plausible additions as capOS
matures: driver process definitions (virtio-net, virtio-blk, NVMe) with
device MMIO, interrupt, and frame allocator grants; scheduler tuning
(priority, budget, CPU pinning); filesystem driver services; memory-policy
hooks; ACPI/SMBIOS consumers.
Most of those are either fragile (kernel-adjacent; a bad value bricks
the instance), sensitive (granting kernel:frame_allocator to a
user-data-declared service is effectively root), or both. A
ManifestDelta with full SystemManifest equivalence hands every
such knob to whoever controls user-data.
The narrowing has to happen somewhere, but there are several places it could live:
- Different schema.
ManifestDeltais not structurally a subset ofSystemManifest— it omits driver entries, scheduler config, and kernel cap sources entirely. Schema-level guarantee; rigid but unambiguous. - Shared schema, policy-narrowing cap.
ManifestUpdateraccepts a full delta but validates at apply time: kernel source variants are rejected unless explicitly allow-listed by the cap’s parameters; additions that touch driver-level service entries fail. Flexible, but the narrowing logic is code that has to be audited, not a schema that is self-documenting. - Tiered deltas.
PrivilegedDelta(drivers, scheduler) andApplicationDelta(hostname, SSH keys, app services), minted by different caps. An operator supervisor holdsPrivilegedManifestUpdater;cloud-bootstrapholds onlyApplicationManifestUpdater. Compositional; matches the capability-model grain but doubles the schema surface. - Tag-based field permissions. Fields in
ServiceEntrycarry a privilege tag;ManifestUpdateris parameterized with a permitted-tag set. One schema, orthogonal policy.
Picking one prematurely would either over-constrain the cloud path
(option 1 before we know what apps legitimately need) or
under-constrain it (option 2 without clarity on what to check
against). This proposal commits only to the shared pipeline
(decoder, spawn loop, authoring tool). The shape of the public
type(s) the cap accepts is deferred until system.cue has grown
enough that the privileged vs. application split is visible in
concrete form.
Related open question: whether kernel cap sources should be expressible in
system.cue at all, or whether the build-time manifest should also declare
them through a narrower mechanism so that the same discipline that protects
cloud user-data also protects the baked-in manifest from accidental
over-grants. If they remain expressible, they should be structured enum/union
variants, not free-form strings; the associated interface TYPE_ID is only a
schema compatibility check and does not identify the authority being granted.
Non-Goals
- cloud-init compatibility. No parsing of
#cloud-configYAML, no#!/bin/bashexecution, noinclude-url, no MIME multipart handling. Operators who need these install their own dispatcher services; the base system does not. - Runtime package installation. The capOS equivalent of “install nginx on boot” is “include nginx in the manifest.” User-data can add services to the manifest; it cannot install packages (there is no package manager to install into).
- Re-running on every boot. cloud-init distinguishes
per-boot,per-instance, andper-oncemodules. The capOS bootstrap service runs once per boot; the manifest it produces is cached under the instance ID, and subsequent boots read the cache and skip the metadata round-trip. A full mode matrix is future work. - IPv6-only bring-up in the first iteration. Many clouds expose both; the schema supports both; the first implementations do whichever is easier per provider (typically IPv4).
- Automatic secret rotation. Metadata often exposes short-lived credentials (IAM role tokens on AWS, service-account tokens on GCP). Refresh logic belongs to the service that consumes the credential, not to cloud-bootstrap.
Related Work
- cloud-init (Canonical). The Linux reference. Huge scope, shell-script-centric, assumes root and POSIX. The capOS design intentionally takes the pattern and drops everything that depends on ambient authority.
- ignition (CoreOS/Flatcar). Runs once in initramfs, consumes a JSON spec, fails-fast if the spec can’t be applied. Closer in spirit to the capOS design — small, single-pass, declarative. Worth studying for its rollback and error-handling approach.
- AWS IMDSv2. The token-exchange handshake is the one thing the
HTTP client needs to handle that is not plain
GETs. Designing theHttpMetadatainterface without accounting for it up front leads to a rewrite later.
Proposal: Hardware Abstraction and Cloud Deployment
How capOS goes from “boots in QEMU” to “boots on a real cloud VM” (GCP, AWS, Azure). This covers the hardware abstraction infrastructure missing between the current QEMU-only kernel and real x86_64 hardware, plus the build system changes needed to produce deployable images.
Depends on: Kernel Networking Smoke Test (for PCI enumeration), Stage 5 (for LAPIC timer), Stage 7 / SMP proposal Phase A (for LAPIC init).
Complements: Networking proposal (extends virtio-net to cloud NICs), Storage proposal (extends virtio-blk to NVMe), SMP proposal (LAPIC infrastructure shared).
Current State
The kernel boots via Limine UEFI, outputs to COM1 serial, and uses QEMU-specific features (isa-debug-exit). No PCI, no ACPI, no interrupt controller beyond the legacy PIC (implicitly via Limine setup). The only build artifact is an ISO.
What Cloud VMs Provide
GCP (n2-standard), AWS (m6i/c7i), and Azure (Dv5) all expose:
| Resource | Cloud interface | capOS status |
|---|---|---|
| Boot firmware | UEFI (all three) | Limine UEFI works |
| Serial console | COM1 0x3F8 | Works (serial.rs) |
| Boot media | GPT disk image (raw/VMDK/VHD) | Missing (ISO only) |
| Storage | NVMe (EBS, PD, Managed Disk) | Missing |
| NIC | ENA (AWS), gVNIC (GCP), MANA (Azure) | Missing |
| Virtio NIC | GCP (fallback), some bare-metal | Missing (planned) |
| Timer | LAPIC, TSC, HPET | Missing |
| Interrupt delivery | I/O APIC, MSI/MSI-X | Missing |
| Device discovery | ACPI + PCI/PCIe | Missing |
| Display | None (headless) | N/A |
What Already Works
- UEFI boot – Limine ISO includes
BOOTX64.EFI. The boot path itself is cloud-compatible. - Serial output – all three clouds expose COM1.
gcloud compute instances get-serial-port-output,aws ec2 get-console-output, and Azure serial console all read from it. - x86_64 long mode – cloud VMs are KVM-based x86_64. Architecture matches.
Phase 1: Bootable Disk Image
Goal: Produce a GPT disk image that cloud VMs can boot from, alongside the existing ISO for QEMU.
The Problem
Cloud VMs boot from disk images, not ISOs. Each cloud has a preferred format:
| Cloud | Image format | Import method |
|---|---|---|
| GCP | raw (tar.gz) | gcloud compute images create --source-uri=gs://... |
| AWS | raw, VMDK, VHD | aws ec2 import-image or register-image with EBS snapshot |
| Azure | VHD (fixed size) | az image create --source |
All require a GPT-partitioned disk with an EFI System Partition (ESP) containing the bootloader.
Disk Layout
GPT disk image (64 MB minimum)
Partition 1: EFI System Partition (FAT32, ~32 MB)
/EFI/BOOT/BOOTX64.EFI (Limine UEFI loader)
/limine.conf (bootloader config)
/boot/kernel (capOS kernel ELF)
/boot/init (init process ELF)
Partition 2: (reserved for future use -- persistent store backing)
Build Tooling
New Makefile target make image using standard tools:
IMAGE := capos.img
IMAGE_SIZE := 64 # MB
image: kernel init $(LIMINE_DIR)
# Create raw disk image
dd if=/dev/zero of=$(IMAGE) bs=1M count=$(IMAGE_SIZE)
# Partition with GPT + ESP
sgdisk -n 1:2048:+32M -t 1:ef00 $(IMAGE)
# Format ESP as FAT32, copy files
# (mtools or loop mount + mkfs.fat)
mformat -i $(IMAGE)@@1M -F -T 65536 ::
mcopy -i $(IMAGE)@@1M $(LIMINE_DIR)/BOOTX64.EFI ::/EFI/BOOT/
mcopy -i $(IMAGE)@@1M limine.conf ::/
mcopy -i $(IMAGE)@@1M $(KERNEL) ::/boot/kernel
mcopy -i $(IMAGE)@@1M $(INIT) ::/boot/init
# Install Limine
# bios-install is for hybrid BIOS/UEFI boot in local QEMU testing.
# For cloud-only images (UEFI-only), this line can be omitted.
$(LIMINE_DIR)/limine bios-install $(IMAGE)
New QEMU target to test disk boot locally:
run-disk: $(IMAGE)
qemu-system-x86_64 -drive file=$(IMAGE),format=raw \
-bios /usr/share/edk2/x64/OVMF.4m.fd \
-display none $(QEMU_COMMON); \
test $$? -eq 1
Cloud upload helpers (scripts, not Makefile targets):
# GCP
tar czf capos.tar.gz capos.img
gsutil cp capos.tar.gz gs://my-bucket/
gcloud compute images create capos --source-uri=gs://my-bucket/capos.tar.gz
# AWS
aws ec2 import-image --disk-containers \
"Format=raw,UserBucket={S3Bucket=my-bucket,S3Key=capos.img}"
Dependencies
sgdisk(gdisk package) – GPT partitioningmtools(mformat, mcopy) – FAT32 manipulation without root/loop mount
Scope
~30 lines of Makefile + a helper script for cloud uploads. No kernel changes.
Phase 2: ACPI and Device Discovery
Goal: Parse ACPI tables to discover hardware topology, interrupt routing, and PCI root complexes. This replaces QEMU-specific hardcoded assumptions.
Why ACPI
On QEMU with default settings, you can hardcode PCI config space at
0xCF8/0xCFC and assume legacy interrupt routing. On real cloud hardware:
- PCI root complex addresses come from ACPI MCFG table (PCIe ECAM)
- Interrupt routing comes from ACPI MADT (I/O APIC entries) and _PRT
- CPU topology comes from ACPI MADT (LAPIC entries)
- Timer info comes from ACPI HPET/PMTIMER tables
Limine provides the RSDP (Root System Description Pointer) address via its protocol. From there, the kernel can walk RSDT/XSDT to find specific tables.
Required Tables
| Table | Purpose | Priority |
|---|---|---|
| MADT | LAPIC and I/O APIC addresses, CPU enumeration | High (Phase 2) |
| MCFG | PCIe Enhanced Configuration Access Mechanism base | High (Phase 2) |
| HPET | High Precision Event Timer address | Medium (fallback timer) |
| FADT | PM timer, shutdown/reset methods | Low (future) |
Implementation
#![allow(unused)]
fn main() {
// kernel/src/acpi.rs
/// Minimal ACPI table parser.
/// Walks RSDP -> XSDT -> individual tables.
/// Does NOT implement AML interpretation -- static tables only.
pub struct AcpiInfo {
pub lapics: Vec<LapicEntry>,
pub io_apics: Vec<IoApicEntry>,
pub iso_overrides: Vec<InterruptSourceOverride>,
pub mcfg_base: Option<u64>, // PCIe ECAM base address
pub hpet_base: Option<u64>,
}
pub fn parse_acpi(rsdp_addr: u64, hhdm: u64) -> AcpiInfo { ... }
}
Use the acpi crate (no_std, well-maintained) for parsing rather than
hand-rolling. It handles RSDP, RSDT/XSDT, MADT, MCFG, and HPET.
Limine RSDP
#![allow(unused)]
fn main() {
use limine::request::RsdpRequest;
static RSDP: RsdpRequest = RsdpRequest::new();
// In kmain:
let rsdp_addr = RSDP.response().expect("no RSDP").address() as u64;
let acpi_info = acpi::parse_acpi(rsdp_addr, hhdm_offset);
}
Crate Dependencies
| Crate | Purpose | no_std |
|---|---|---|
acpi | ACPI table parsing (MADT, MCFG, etc.) | yes |
Scope
~200-300 lines of glue code wrapping the acpi crate. The crate does the
heavy lifting.
Phase 3: Interrupt Infrastructure
Goal: Set up I/O APIC for device interrupt routing and MSI/MSI-X for modern PCI devices. This replaces the implicit legacy PIC setup.
I/O APIC
The I/O APIC routes external device interrupts (keyboard, serial, PCI devices) to specific LAPIC entries (CPUs). Its address and configuration come from the ACPI MADT (Phase 2).
#![allow(unused)]
fn main() {
// kernel/src/ioapic.rs
pub struct IoApic {
base: *mut u32, // MMIO registers via HHDM
}
impl IoApic {
/// Route an IRQ to a specific LAPIC/vector.
pub fn route_irq(&mut self, irq: u8, lapic_id: u8, vector: u8) { ... }
/// Mask/unmask an IRQ line.
pub fn set_mask(&mut self, irq: u8, masked: bool) { ... }
}
}
The I/O APIC must respect Interrupt Source Override entries from MADT (e.g., IRQ 0 might be remapped to GSI 2 on real hardware).
MSI/MSI-X
Modern PCI/PCIe devices (NVMe, cloud NICs) use Message Signaled Interrupts instead of pin-based IRQs routed through the I/O APIC. MSI/MSI-X writes directly to the LAPIC’s interrupt command register, bypassing the I/O APIC entirely.
This is critical for cloud deployment because:
- NVMe controllers require MSI or MSI-X (no legacy IRQ fallback on many controllers)
- Cloud NICs (ENA, gVNIC) use MSI-X exclusively
- MSI-X supports per-queue interrupts (one vector per virtqueue/submission queue), enabling better SMP scalability
#![allow(unused)]
fn main() {
// kernel/src/pci/msi.rs
/// Configure MSI for a PCI device.
pub fn enable_msi(device: &PciDevice, vector: u8, lapic_id: u8) { ... }
/// Configure MSI-X for a PCI device.
pub fn enable_msix(
device: &PciDevice,
table_bar: u8,
entries: &[(u16, u8, u8)], // (index, vector, lapic_id)
) { ... }
}
MSI/MSI-X capability structures are found by walking the PCI capability list (already needed for PCI enumeration in the networking proposal).
Integration with SMP
LAPIC initialization is shared between this phase and the SMP proposal (Phase A). If SMP is implemented first, LAPIC is already available. If this phase comes first, it initializes the BSP’s LAPIC and the SMP proposal extends to APs.
Scope
~300-400 lines total:
- I/O APIC driver: ~150 lines
- MSI/MSI-X setup: ~100-150 lines
- Integration/routing logic: ~50-100 lines
Phase 4: PCI/PCIe Infrastructure
Goal: Standalone PCI bus enumeration and device management, usable by all device drivers (virtio-net, NVMe, cloud NICs).
The networking proposal includes PCI enumeration as a substep for finding virtio-net. This phase promotes it to a reusable kernel subsystem that all device drivers build on.
PCI Configuration Access
Two mechanisms, determined by ACPI:
- Legacy I/O ports (0xCF8/0xCFC) – works in QEMU, limited to 256 bytes of config space per function. Insufficient for PCIe extended capabilities.
- PCIe ECAM (Enhanced Configuration Access Mechanism) – memory-mapped config space, 4 KB per function. Base address from ACPI MCFG table. Required for MSI-X capability parsing and NVMe BAR discovery on real hardware.
Start with legacy I/O for QEMU, add ECAM when ACPI parsing (Phase 2) is available.
Device Enumeration
#![allow(unused)]
fn main() {
// kernel/src/pci/mod.rs
pub struct PciDevice {
pub bus: u8,
pub device: u8,
pub function: u8,
pub vendor_id: u16,
pub device_id: u16,
pub class: u8,
pub subclass: u8,
pub bars: [Option<Bar>; 6],
pub interrupt_pin: u8,
pub interrupt_line: u8,
}
pub enum Bar {
Memory { base: u64, size: u64, prefetchable: bool },
Io { base: u16, size: u16 },
}
/// Scan all PCI buses and return discovered devices.
pub fn enumerate() -> Vec<PciDevice> { ... }
/// Find a device by vendor/device ID.
pub fn find_device(vendor: u16, device: u16) -> Option<PciDevice> { ... }
/// Walk the PCI capability list for a device.
pub fn capabilities(device: &PciDevice) -> Vec<PciCapability> { ... }
}
BAR Mapping
Device drivers need MMIO access to BAR regions. The kernel maps BAR physical
addresses into virtual address space (via HHDM for kernel-mode drivers, or
via a DeviceMmio capability for userspace drivers as described in the
networking proposal).
PCI Device IDs for Cloud Hardware
| Device | Vendor:Device | Cloud |
|---|---|---|
| virtio-net | 1AF4:1000 (transitional) or 1AF4:1041 (modern) | QEMU, GCP fallback |
| virtio-blk | 1AF4:1001 (transitional) or 1AF4:1042 (modern) | QEMU |
| NVMe | 8086:various, 144D:various, etc. | All clouds (EBS, PD, Managed Disk) |
| AWS ENA | 1D0F:EC20 / 1D0F:EC21 | AWS |
| GCP gVNIC | 1AE0:0042 | GCP |
| Azure MANA | 1414:00BA | Azure |
Scope
~400-500 lines:
- Config space access (I/O + ECAM): ~100 lines
- Bus enumeration: ~150 lines
- BAR parsing and mapping: ~100 lines
- Capability list walking: ~50-100 lines
Phase 5: NVMe Driver
Goal: Basic NVMe block device driver, sufficient to read/write sectors. This is the storage equivalent of virtio-net for networking – the first real storage driver.
Why NVMe Over virtio-blk
The storage-and-naming proposal mentions virtio-blk for Phase 3 (persistent store). On cloud VMs, all three providers expose NVMe:
- AWS EBS – NVMe interface (even for gp3/io2 volumes)
- GCP Persistent Disk – NVMe or SCSI (NVMe is default for newer VMs)
- Azure Managed Disks – NVMe on newer VM series (Ev5, Dv5)
virtio-blk is QEMU-only. An NVMe driver unlocks persistent storage on all
cloud platforms. For QEMU testing, QEMU also emulates NVMe well:
-drive file=disk.img,if=none,id=d0 -device nvme,drive=d0,serial=capos0.
NVMe Architecture
NVMe is a register-level standard with well-defined queue-pair semantics:
Application
|
v
Submission Queue (SQ) -- ring buffer of 64-byte command entries
|
| doorbell write (MMIO)
v
NVMe Controller (hardware)
|
| DMA completion
v
Completion Queue (CQ) -- ring buffer of 16-byte completion entries
|
| MSI-X interrupt
v
Driver processes completions
Minimum viable driver needs:
- Admin Queue Pair (for identify, create I/O queues)
- One I/O Queue Pair (for read/write commands)
- MSI-X for completion notification (or polling)
Implementation Sketch
#![allow(unused)]
fn main() {
// kernel/src/nvme.rs (or kernel/src/drivers/nvme.rs)
pub struct NvmeController {
bar0: *mut u8, // MMIO registers
admin_sq: SubmissionQueue,
admin_cq: CompletionQueue,
io_sq: SubmissionQueue,
io_cq: CompletionQueue,
namespace_id: u32,
block_size: u32,
block_count: u64,
}
impl NvmeController {
pub fn init(pci_device: &PciDevice) -> Result<Self, NvmeError> { ... }
pub fn read(&self, lba: u64, count: u16, buf: &mut [u8]) -> Result<(), NvmeError> { ... }
pub fn write(&self, lba: u64, count: u16, buf: &[u8]) -> Result<(), NvmeError> { ... }
pub fn identify(&self) -> NvmeIdentify { ... }
}
}
DMA Considerations
NVMe uses DMA for data transfer. The controller reads/writes directly from physical memory addresses provided in commands. Requirements:
- Buffers must be physically contiguous (or use PRP lists / SGLs for scatter-gather)
- Physical addresses must be provided (not virtual)
- Cache coherence is handled by hardware on x86_64 (DMA-coherent architecture)
The existing frame allocator can provide physically contiguous pages. For larger transfers, PRP (Physical Region Page) lists allow scatter-gather.
Crate Dependencies
| Crate | Purpose | no_std |
|---|---|---|
| (none) | NVMe register-level protocol is simple enough to implement directly | N/A |
The NVMe spec is cleaner than virtio and the register interface is straightforward. A minimal driver (admin + 1 I/O queue pair, read/write) is ~500-700 lines without external dependencies.
Integration with Storage Proposal
The storage proposal’s Phase 3 (Persistent Store) specifies virtio-blk as
the backing device. This can be generalized to a BlockDevice trait:
#![allow(unused)]
fn main() {
trait BlockDevice {
fn read(&self, lba: u64, count: u16, buf: &mut [u8]) -> Result<(), Error>;
fn write(&self, lba: u64, count: u16, buf: &[u8]) -> Result<(), Error>;
fn block_size(&self) -> u32;
fn block_count(&self) -> u64;
}
}
Both NVMe and virtio-blk implement this trait. The store service doesn’t care which backing driver it uses.
Scope
~500-700 lines for a minimal in-kernel NVMe driver (admin queue + 1 I/O queue pair, read/write, identify). Userspace decomposition follows the same pattern as the networking proposal (kernel driver first, then extract to userspace process with DeviceMmio + Interrupt caps).
Phase 6: Cloud NIC Strategy
Goal: Define the path to networking on cloud VMs, given that each cloud uses a different proprietary NIC.
The Landscape
| Cloud | Primary NIC | Virtio NIC available? | Open-source driver? |
|---|---|---|---|
| GCP | gVNIC (1AE0:0042) | Yes (fallback option) | Yes (Linux, ~3000 LoC) |
| AWS | ENA (1D0F:EC20) | No (Nitro only) | Yes (Linux, ~8000 LoC) |
| Azure | MANA (1414:00BA) | No (accelerated networking) | Yes (Linux, ~6000 LoC) |
Recommended Strategy
Short term: virtio-net on GCP
GCP allows selecting VIRTIO_NET as the NIC type when creating instances.
This is a first-class option, not a legacy fallback. Combined with the
virtio-net driver from the networking proposal, this gives cloud networking
with zero additional driver work.
gcloud compute instances create capos-test \
--image=capos \
--machine-type=e2-micro \
--network-interface=nic-type=VIRTIO_NET
Medium term: gVNIC driver
gVNIC is a simpler device than ENA or MANA. The Linux driver is ~3000 lines (vs ~8000 for ENA). It uses standard PCI BAR MMIO + MSI-X interrupts. A minimal gVNIC driver (init, link up, send/receive) would be ~800-1200 lines.
gVNIC is worth prioritizing because:
- GCP is the only cloud with a virtio-net fallback, making it the natural first target
- Graduating from virtio-net to gVNIC on the same cloud is a clean progression
- The gVNIC register interface is documented in the Linux driver source
Long term: ENA and MANA
ENA and MANA are more complex and less well-documented outside their Linux drivers. These should be deferred until the driver model is mature (userspace drivers with DeviceMmio caps, as described in the networking proposal Part 2).
At that point, the kernel only needs to provide PCI enumeration + BAR mapping + MSI-X routing. The actual NIC driver logic runs in a userspace process, making it feasible to port from the Linux driver source with appropriate licensing considerations.
Alternative: Paravirt Abstraction Layer
Instead of writing native drivers for each cloud NIC, an alternative is a thin paravirt layer:
Application -> NetworkManager cap -> Net Stack (smoltcp) -> NIC cap -> [driver]
Where [driver] is one of:
virtio-net(QEMU, GCP fallback)gvnic(GCP)ena(AWS)mana(Azure)
All drivers implement the same Nic capability interface from the networking
proposal. The network stack and applications are driver-agnostic.
This is already the architecture described in the networking proposal. The
only addition is recognizing that multiple driver implementations will exist
behind the same Nic interface.
Phase Summary and Dependencies
graph TD
P1[Phase 1: Disk Image Build] --> BOOT[Boots on Cloud VM]
P2[Phase 2: ACPI Parsing] --> P3[Phase 3: Interrupt Infrastructure]
P2 --> P4[Phase 4: PCI/PCIe]
P3 --> P5[Phase 5: NVMe Driver]
P4 --> P5
P4 --> NET[Networking Smoke Test<br>virtio-net driver]
P3 --> NET
P4 --> P6[Phase 6: Cloud NIC Drivers]
P3 --> P6
NET --> P6
S5[Stage 5: Scheduling] --> P3
SMP_A[SMP Phase A: LAPIC] --> P3
style P1 fill:#2d5,stroke:#333
style BOOT fill:#2d5,stroke:#333
| Phase | Depends on | Estimated scope | Enables |
|---|---|---|---|
| 1: Disk image | Nothing | ~30 lines Makefile | Cloud boot |
| 2: ACPI | Nothing (kernel code) | ~200-300 lines | Phases 3, 4 |
| 3: Interrupts | Phase 2, LAPIC (SMP/Stage 5) | ~300-400 lines | NVMe, cloud NICs |
| 4: PCI/PCIe | Phase 2 | ~400-500 lines | All device drivers |
| 5: NVMe | Phases 3, 4 | ~500-700 lines | Cloud storage |
| 6: Cloud NICs | Phases 3, 4, networking smoke test | ~800-1200 lines each | Cloud networking |
Minimum Path to “Boots on Cloud VM, Prints Hello”
Phases 1 only. Everything else (serial, UEFI) already works. This is a build system change, not a kernel change.
Minimum Path to “Useful on Cloud VM”
Phases 1-5 (disk image + ACPI + interrupts + PCI + NVMe) plus the existing roadmap items (Stages 4-6 for capability syscalls, scheduling, IPC). With GCP’s virtio-net fallback, networking can use the existing networking proposal without Phase 6.
QEMU Testing
All phases can be tested in QEMU before deploying to cloud:
| Phase | QEMU flags |
|---|---|
| Disk image | -drive file=capos.img,format=raw -bios OVMF.4m.fd |
| ACPI | Default QEMU provides ACPI tables (MADT, MCFG, etc.) |
| I/O APIC | Default QEMU emulates I/O APIC |
| PCI/PCIe | -device ... adds PCI devices; QEMU has PCIe root complex |
| NVMe | -drive file=disk.img,if=none,id=d0 -device nvme,drive=d0,serial=capos0 |
| MSI-X | Supported by QEMU’s NVMe and virtio-net-pci emulation |
| Multi-CPU | -smp 4 (already works with Limine SMP) |
aarch64 and ARM Cloud Instances
This proposal focuses on x86_64 because that’s the current kernel target, but ARM-based cloud instances are significant and growing:
| Cloud | ARM offering | Instance types |
|---|---|---|
| AWS | Graviton2/3/4 | m7g, c7g, r7g, etc. |
| GCP | Tau T2A (Ampere Altra) | t2a-standard-* |
| Azure | Cobalt 100 (Arm Neoverse) | Dpsv6, Dplsv6 |
ARM cloud VMs have the same general requirements (UEFI boot, ACPI tables, PCI/PCIe, NVMe storage) but different specifics:
- Interrupt controller: GIC (Generic Interrupt Controller) instead of APIC. GICv3 is standard on cloud ARM instances.
- Boot: UEFI via Limine (already targets aarch64). Limine handles the architecture differences at boot time.
- Timer: ARM generic timer (CNTPCT_EL0) instead of LAPIC/PIT/TSC.
- Serial: PL011 UART instead of 16550 COM1. Different register interface.
- NIC: Same PCI devices (ENA, gVNIC, MANA) with the same register interfaces – PCI/PCIe is architecture-neutral.
- NVMe: Same NVMe register interface – PCIe is architecture-neutral.
The arch-neutral parts of this proposal (PCI enumeration, NVMe, disk image format, ACPI table parsing) apply equally to aarch64. The arch-specific parts (I/O APIC, MSI delivery address format, LAPIC) need aarch64 equivalents (GIC, ARM MSI translation).
The existing roadmap lists “aarch64 support” as a future item. For cloud deployment, aarch64 should be considered as soon as the x86_64 hardware abstraction is stable, since:
- Device drivers (NVMe, virtio-net, cloud NICs) are architecture-neutral – they talk to PCI config space and MMIO BARs, which are the same on both architectures
- The
acpicrate handles both x86_64 and aarch64 ACPI tables - Limine already targets aarch64
- AWS Graviton instances are often cheaper than x86_64 equivalents
The main aarch64 kernel work is: exception handling (EL0/EL1 instead of Ring 0/3), GIC driver (instead of APIC), ARM generic timer, PL011 serial, and the MMU setup (4-level page tables exist on both but with different register interfaces).
Open Questions
-
ACPI scope. The
acpicrate can parse static tables (MADT, MCFG, HPET, FADT). Full ACPI requires AML interpretation (for _PRT interrupt routing, dynamic device enumeration). Do we need AML, or are static tables sufficient for cloud VMs? Cloud VM firmware typically provides simple, static ACPI tables – AML interpretation is likely unnecessary initially. -
PCIe ECAM vs legacy. Should we support both config access methods, or require ECAM (which all cloud VMs and modern QEMU provide)? Supporting both adds ~50 lines but makes bare-metal testing on older hardware possible.
-
NVMe queue depth. A single I/O queue pair with depth 32 is sufficient for initial use. Per-CPU queues (leveraging MSI-X per-queue interrupts) improve SMP throughput but add complexity. Defer per-CPU queues to after SMP is working.
-
Driver model unification. Resolved: PCI enumeration is the standalone PCI/PCIe Infrastructure item in the roadmap. The networking smoke test and NVMe driver both consume this shared subsystem. The networking proposal’s Part 1 Step 1 has been updated to reference this phase.
-
GCP vs AWS as first cloud target. GCP has virtio-net fallback, making it the easiest first target. AWS has the largest market share and EBS/NVMe is well-documented. Recommendation: GCP first (virtio-net path), then AWS (requires ENA or a workaround).
References
Specifications
- NVMe Base Specification 2.1 – register interface, queue semantics, command set
- PCI Express Base Specification – ECAM, MSI/MSI-X capability structures
- ACPI Specification 6.5 – MADT, MCFG, HPET table formats
- Intel SDM Vol. 3, Ch. 10 – APIC architecture (LAPIC, I/O APIC)
Crates
- acpi – no_std ACPI table parser
- virtio-drivers – no_std virtio (already in networking proposal)
Prior Art
- Redox PCI – microkernel PCI driver in Rust
- Hermit NVMe – unikernel NVMe driver
- rCore virtio – educational OS with virtio + PCI in Rust
- Linux gVNIC driver – reference for gVNIC register interface (~3000 LoC)
- Linux ENA driver – reference for ENA
Cloud Documentation
- GCP: Creating custom images
- AWS: Importing VM images
- Azure: Creating custom images
- GCP: Choosing a NIC type
Proposal: Live Upgrade
Replacing a running service with a new binary, without dropping outstanding capability references or losing in-flight work.
Problem
In a Linux-like system, “upgrading a service” is one of:
- Restart: stop the old process, start the new one. Clients holding
file descriptors, sockets, or pipes to the old process receive
ECONNRESETorEPIPEand must reconnect. Session state is lost unless clients serialize it themselves. - Graceful restart (nginx
-s reload, unicorn, systemd socket activation): new process starts alongside old, inherits the listening socket, old drains in-flight requests. Works only for request/response protocols where the session is the request. Does nothing for stateful sessions. - Live patch (kpatch, ksplice): binary-level function replacement. Narrow, fragile, no schema for state layout changes.
None of these compose with a capability OS. A CapId held by a client
points at a specific process; if that process exits, the cap is dead.
There is no “the service” abstraction the kernel could re-bind — the
point of capabilities is that they identify a specific reference, not
a name that could be redirected after the fact.
But capOS has a kernel-side primitive the Linux model lacks: the kernel
already owns the authoritative table of every CapId and which process
serves it. Rewriting “cap X is served by process v1” → “cap X is served
by process v2” is a table update. The question is when it is safe, and
how v2 inherits enough state to answer the next call.
Three Cases
Live upgrade has three distinct cost profiles. The right design is to make each one explicit rather than pretend the hard case doesn’t exist.
Case 1: Stateless services
Each SQE is independent; the service holds no state that matters across calls. A request router, a pure codec, a logger that flushes to an external sink.
Upgrade is trivial: start v2, retarget every CapId from v1 to v2,
exit v1. Clients may observe a small latency spike; no DISCONNECTED
CQE fires. Only the kernel primitive is needed.
Case 2: State externalized into other caps
The service’s in-memory data is a cache or dispatch table; durable state
lives behind caps the service holds (Store, SessionMap, Namespace).
v1’s held caps are passed to v2 at spawn time (via the supervisor, per
the manifest), kernel retargets client caps, v1 exits.
Architecturally this is the idiomatic capOS pattern: services stay thin, state is factored into dedicated holders with their own caps. The Fetch/HttpEndpoint split in the service-architecture proposal already pushes in this direction. In that world, most services fall into this bucket by construction.
Case 3: Stateful services requiring migration
The service has in-memory state that matters: a JIT’s code cache, a codec’s ring buffer, a parser’s arena, session data not yet flushed. Upgrade requires v1 to hand its state to v2.
capOS’s contribution here is that the state wire format is already capnp — the same format the service uses for IPC. v1 serializes its state as a capnp message; v2 consumes it. There is no separate serialization layer to build and no opportunity for it to drift from the IPC format.
The contract extends the service’s capnp interface:
interface Upgradable {
# Called on v1 by the supervisor. Returns a snapshot of service
# state and stops accepting new calls. Calls already in flight
# complete before the snapshot returns.
quiesce @0 () -> (state :Data);
# Called on v2 after spawn. Loads state from the snapshot. After
# this returns, v2 is ready to serve calls.
resume @1 (state :Data) -> ();
}
The state schema is service-defined. Schema evolution follows capnp’s standard rules: adding fields is backward-compatible, renaming requires care, removing requires a major version bump.
Kernel Primitive: CapRetarget
The kernel exposes the retarget as a capability method, not a syscall:
interface ProcessControl {
# Atomically redirect every CapId currently served by `old` to
# be served by `new`. Requires: `new` implements a schema
# superset of `old` (schema-id compatibility), `new` is Ready,
# `old` is Quiesced (graceful) or the caller has permission to
# force.
retargetCaps @0 (old :ProcessHandle, new :ProcessHandle,
mode :RetargetMode) -> ();
}
enum RetargetMode {
graceful @0; # old must be Quiesced; in-flight calls drain on old
force @1; # caps redirect immediately; in-flight calls fail
}
Only a process holding a ProcessControl cap to both processes can
perform this — typically the supervisor that spawned them. The kernel
never initiates upgrades.
Atomicity is per-CapId. From a client’s perspective, the retarget is a
single point in time: a CALL SQE submitted before retarget goes to v1;
a CALL SQE submitted after goes to v2. A CALL already dispatched to v1
either completes there (graceful) or returns a DISCONNECTED CQE
(force).
Supervisor-Level Upgrade Protocol
The primitives above compose into a protocol the supervisor runs:
1. spawn v2 from the new binary in the manifest
2. Case 1 & 2: v2.resume(EMPTY_STATE)
Case 3: state = v1.quiesce()
v2.resume(state)
3. kernel.retargetCaps(v1, v2, graceful)
4. wait for v1 to drain (graceful mode)
5. v1.exit()
If any step fails, the supervisor rolls back: kill v2, resume v1 (if quiesced), log the failure. Because the retarget hasn’t happened yet, clients never observe the aborted attempt.
In-Flight Calls
The subtle case is a client that has already posted a CALL SQE to v1 when the retarget happens. Two options:
- Graceful mode. v1 finishes the call, kernel routes the CQE back to the client on v1’s ring. v1 exits only after its ring is empty. This preserves call semantics; v1 and v2 coexist briefly.
- Force mode. The in-flight CALL returns
DISCONNECTED. Client retries against v2. Appropriate when v1 is wedged andquiescewon’t return.
In graceful mode the client cannot distinguish “call landed on v1” from “call landed on v2” — which is the point. Capability identity survives the upgrade; process identity does not.
Relationship to Fault Containment
Live upgrade and fault containment (driver panics → supervisor respawns) share machinery. The difference is one step of the protocol:
- Fault containment: v1 has crashed; kernel has already marked it
dead and epoch-bumped its caps. Supervisor spawns v2, issues a
graceful retarget (no quiesce — v1 is gone; in-flight CALLs already
delivered
DISCONNECTED). Clients reconnect to v2. - Live upgrade: v1 is healthy; supervisor initiates
quiesce→ state transfer → retarget, and no CQE ever reportsDISCONNECTEDto any caller.
The epoch-based revocation work from Stage 6 is the foundation for both. CapRetarget is one additional primitive layered on top.
Security and Trust
Live upgrade does not expand the trust model. The supervisor already holds the authority to kill, restart, and reassign caps for services it spawned — upgrade is a refinement of that authority, not a new principal. Requirements:
- Only a holder of
ProcessControlcaps to botholdandnewcan callretargetCaps. By construction this is the supervisor that spawned them. - The new binary must be legitimately obtained — in practice, loaded from the same content-addressed store as everything else (ties to Content-Addressed Boot).
- Schema compatibility (
newis a superset ofold) is checked by the kernel before retarget. This prevents an upgrade from silently narrowing the interface clients depend on.
Non-Goals
- Code hot-patching. No binary-level function replacement. Upgrade is at the process boundary, not the symbol boundary.
- Kernel live replacement. Covered by Reboot-Proof / process persistence (reboot with state preserved, not live replacement). The kernel is a single trust domain; replacing it in place needs a different design.
- Automatic schema migration across incompatible changes. If v2’s state schema is not a capnp-evolution-compatible superset of v1’s, the service author writes the migration. The kernel does not.
- System-wide registry of upgradable services. The supervisor knows what it spawned; there is no ambient discovery.
Phased Implementation
- CapRetarget primitive. Kernel operation +
ProcessControlcap. Useful immediately for stateless services (Case 1) and as the foundation of Fault Containment (respawn with a new process, point its caps to a fresh instance). - Upgradable interface. Schema, contract documentation, and a
Rust helper in
capos-rtthat services derive. - Graceful drain. Quiesce + in-flight call completion + v1 exit synchronization.
- Stateful demo. A service maintaining session state, upgraded live with zero session loss. This is the Live Upgrade observable milestone.
Related Work
- Erlang/OTP
code_change/3is the closest prior art: processes upgrade their behavior module in place, with a callback to migrate state. capOS differs only in that state transport goes through capnp rather than Erlang term format, and that the process boundary is an OS process rather than a BEAM process. - Fuchsia component updates rebind component instances in the routing graph. Similar primitive in a different mechanism.
- nginx
-s reloadis graceful restart for request/response servers. The design here generalizes it by exposing the state migration point explicitly rather than relying on “the session is the request.”
Proposal: Capability-Oriented GPU/CUDA Integration
Purpose
Define a minimal, capability-safe path to integrate NVIDIA/CUDA-capable GPUs into the capOS architecture without expanding kernel trust.
The kernel keeps direct control of hardware arbitration and trust boundaries. GPU hardware interaction is performed by a dedicated userspace service that is invoked through capability calls.
Positioning Against Current Project State
capOS currently provides:
- Process lifecycle, page tables, preemptive scheduling (PIT 100 Hz, round-robin, context switching).
- A global and per-process capability table with
CapObjectdispatch. - Shared-memory capability ring (io_uring-inspired) with syscall-free SQE
writes.
cap_entersyscall for ordinary CALL dispatch and completion waits. - No ACPI/PCI/interrupt infrastructure yet in-kernel.
That means GPU integration must be staged and should begin as a capability model exercise first, with real hardware I/O added after the underlying kernel subsystems exist.
Design Principles
- Keep policy in kernel, execution in userspace.
- Never expose raw PCI/MMIO/IRQ details to untrusted processes.
- Make GPU access explicit through narrow capabilities.
- Treat every stateful resource (session, buffer, queue, fence) as a capability.
- Require revocability and bounded lifetime for every GPU-facing object.
- Avoid a Linux-driver-in-kernel compatibility dependency.
Proposed Architecture
capOS kernel (minimal) exposes only resource and mediation capabilities.
gpu-device service (userspace) receives device-specific caps and exposes a stable
GPU capability surface to clients.
application receives only GpuSession/GpuBuffer/GpuFence capabilities.
Kernel responsibilities
- Discover GPUs from PCI/ACPI layers.
- Map/register BAR windows and grant a scoped
DeviceMmiocapability. - Set up interrupt routing and expose scoped IRQ signaling capability.
- Enforce DMA trust boundaries for process memory offered to the driver.
- Enforce revocation when sessions are closed.
- Handle all faulting paths that would otherwise crash the kernel.
User-space GPU service responsibilities
- Open/initialize one GPU device from device-scoped caps.
- Allocate and track GPU contexts and queues.
- Implement command submission, buffer lifecycle, and synchronization.
- Translate capability calls into driver-specific operations.
- Expose only narrow, capability-typed handles to callers.
Capability Contract (schema additions)
Add to schema/capos.capnp:
GpuDeviceManagerlistDevices() -> (devices: List(GpuDeviceInfo))openDevice(capabilityIndex :UInt32) -> (session :GpuSession)
GpuSessioncreateBuffer(bytes :UInt64, usage :Text) -> (buffer :GpuBuffer)destroyBuffer(buffer :UInt32) -> ()launchKernel(program :Text, grid :UInt32, block :UInt32, bufferList :List(UInt32), fence :GpuFence) -> ()submitMemcpy(dst :UInt32, src :UInt32, bytes :UInt64) -> ()submitFenceWait(fence :UInt32) -> ()
GpuBuffermapReadWrite() -> (addr :UInt64, len :UInt64)unmap() -> ()size() -> (bytes :UInt64)close() -> ()
GpuFencepoll() -> (status :Text)wait(timeoutNanos :UInt64) -> (ok :Bool)close() -> ()
Exact wire fields are intentionally flexible to keep this proposal at the interface level; method IDs and concrete argument packing should be finalized in the implementation PR.
Implementation Phases
Phase 0 (prerequisite): Stage 4 kernel capability syscalls
- Implement capability-call syscall ABI.
- Add
cap_id,method_id,params_ptr,params_lenpath. - Add kernel/user copy/validation of capnp messages.
- Validate user process permissions before dispatch.
Phase 1: Device mediation foundations
- Add kernel caps:
DeviceManager/DeviceMmio/InterruptHandle/DmaBuffer.
- Add PCI/ACPI discovery enough to identify NVIDIA-compatible functions.
- Add guarded BAR mapping and scoped grant to an init-privileged service.
- Add minimal
GpuDeviceManagerservice scaffold returning synthetic/empty device handles. - Add manifest entries for a GPU service binary and launch dependencies.
Phase 2: Service-based mock backend
- Implement
gpu-mockuserspace service with sameGpu*interface. - Support no-op buffers and synthetic fences.
- Prove end-to-end:
- init spawns driver
- process opens session
- buffer create/map/wait flows via capability calls
- Add regression checks in integration boot path output.
Phase 3: Real backend integration
- Add actual backend adapter for one concrete GPU driver API available in environment.
- Add:
- queue lifecycle
- fence lifecycle
- DMA registration/validation
- command execution path
- interrupt completion path to service and return through caps
- Keep backend replacement possible via trait-like abstraction in userspace service.
Phase 4: Security hardening
- Add per-session limits for mapped pages and in-flight submissions.
- Add bounded queue depth and timeout enforcement.
- Add explicit revocation propagation:
- session close => all child caps revoked.
- driver crash => all active caps fail closed.
- Add explicit audit hooks for submit/launch calls.
Security Model
The kernel does not grant a user process direct MMIO access.
Processes only receive:
GpuSession/GpuBuffer/GpuFencecapabilities.
The service process receives:
DeviceMmio,InterruptHandle, and memory-caps derived from its policy.
This ensures:
- No userland process can program BAR registers.
- No userland process can claim untrusted memory for DMA.
- No userland process can observe or reset another session state.
Dependencies and Alignment
This proposal depends on:
- Stage 4 capability syscalls.
- Kernel networking/PCI/interrupt groundwork from cloud deployment roadmap.
- Stage 6/7 for richer cross-process IPC and SMP behavior.
It complements:
- Device and service architecture proposals.
- Storage/service manifest execution flow.
- In-process threading work (future queue completion callbacks).
Minimal acceptance criteria
make runboots and prints GPU service lifecycle messages.- Init spawns GPU service and grants only device-scoped caps.
- A sample userspace client can:
- create session
- allocate and map a GPU buffer
- submit a synthetic job
- wait on a fence and receive completion
- Attempts to submit unsupported/malformed operations return explicit capnp errors.
- Removing service/session capabilities invalidates descendants without kernel restart.
Risks
- Real NVIDIA closed stack integration may require vendor-specific adaptation.
- Buffer mapping semantics can become complex with paging and fragmentation.
- Interrupt-heavy completion paths require robust scheduling before user-visible completion guarantees.
Open Questions
- Is CUDA mandatory from first integration, or is the initial surface command-focused
(
gpu-kernelpayload as opaque bytes) and CUDA runtime-specific later? - Should memory registration support pinned physical memory only at first?
- Which isolation level is needed for multi-tenant versus single-tenant first phase?
Proposal: Formal MAC/MIC Model and Proof Track
How capOS could move from pragmatic label checks to a formal mandatory access control and mandatory integrity control story suitable for a GOST-style claim.
Problem
Adding a label field to capabilities is not enough to claim formal
mandatory access control. ГОСТ Р 59453.1-2021 frames access control through a
formal model of an abstract automaton: the model describes states, subjects,
objects, containers, rights, accesses, information flows, safety conditions,
and proofs that unsafe accesses or flows cannot arise.
capOS should therefore separate two levels:
- Pragmatic label policy. Userspace brokers and wrapper capabilities enforce labels at trusted grant paths and selected method calls.
- Formal MAC/MIC. A documented abstract state machine, safety predicates, transition rules, proof obligations, and an implementation mapping. Only this second level can support a GOST-style claim.
This proposal defines the path to the second level. It is not a claim that capOS currently satisfies it.
Scope
The first formal target should be narrow:
Confidentiality:
No transition creates an unauthorized information flow from an object at a
higher or incomparable confidentiality label to an object at a lower label,
except through an explicit trusted declassifier transition.
Integrity:
No low-integrity or incomparable subject can control a higher-integrity
subject, and no low-integrity subject can write or transfer influence into a
higher-integrity object, except through an explicit trusted upgrader or
sanitizer transition.
The proof should cover capability authority creation and transfer before it covers every device, filesystem, or POSIX compatibility corner. For capOS, capability transfer is the dangerous boundary.
Terminology
The Russian GOST terms to keep straight:
мандатное управление доступом: mandatory access control for confidentiality.мандатный контроль целостности: mandatory integrity control.целостность: integrity.уровень целостности: integrity level.уровень конфиденциальности: confidentiality level.субъект доступа: access subject.объект доступа: access object.
The standards separate confidentiality MAC from integrity control. capOS should not merge them into one vague label field.
Abstract State
The formal model should be intentionally smaller than the implementation. It models only the security-relevant state.
U set of user accounts / principals
S set of subjects: processes, sessions, services
O set of objects: files, namespaces, endpoints, process handles, secrets
C set of containers: namespaces, directories, stores, service subtrees
E entities = O union C
K kernel object identities
Cap capability handles / hold edges
Hold relation S -> E with metadata
Own subject-control or ownership relation
Ctrl subject-control relation
Flow observed information-flow relation
Rights abstract rights: read, write, execute, own, control, transfer
Access realized accesses: read, write, call, return, spawn, supervise
Hold is central. In capOS, authority is represented by capability table
entries and transfer records, not by global paths. A formal model that does
not model capability hold edges will miss the main authority channel.
Suggested hold-edge metadata:
HoldEdge {
subject
entity
interface_id
badge
transfer_mode
origin
confidentiality_label
integrity_label
}
Label Lattices
Use deployment-defined partial orders, not hardcoded government categories.
Example confidentiality lattice:
public < internal < confidential < secret
compartments = {project-a, project-b, ops, crypto}
dominates(a, b) means:
level(a) >= level(b)
and compartments(a) includes compartments(b)
Integrity should be separate:
untrusted < user < service < trusted
domains = {boot, storage, network, auth}
The model must specify how labels compose across containers:
- contained entity confidentiality cannot exceed what the container policy permits unless the container explicitly supports mixed labels;
- contained entity integrity cannot exceed the container’s integrity policy;
- a subject-associated object such as a process ring, endpoint queue, or process handle needs labels derived from the subject it controls or exposes.
Capability Method Flow Classes
capOS cannot rely on syscall names such as read and write. Each interface
method needs a flow class.
Initial categories:
ReadLike data flows object -> subject
WriteLike data flows subject -> object
Bidirectional data flows both ways
ControlLike subject controls another subject/object lifecycle
TransferLike authority or future data path is transferred
ObserveLike metadata/log/status observation
Declassify trusted downgrade of confidentiality
Sanitize trusted upgrade of integrity after validation
NoFlow lifecycle release or local bookkeeping only
Examples:
File.read ReadLike
File.write WriteLike
Namespace.bind WriteLike + ControlLike
LogReader.read ReadLike
ManifestUpdater.apply WriteLike + ControlLike
ProcessSpawner.spawn ControlLike + TransferLike
ProcessHandle.wait ObserveLike
ServiceSupervisor.restart ControlLike
Endpoint.call depends on endpoint declaration
Endpoint.return depends on endpoint declaration
CAP_OP_RELEASE NoFlow
CAP_OP_CALL transfers TransferLike
CAP_OP_RETURN transfers TransferLike
The flow table is part of the trusted model. Adding a new capability method without classifying its flow should fail review.
Transitions
The abstract automaton should include at least these transitions:
create_session(principal, profile)
spawn(parent, child, grants)
copy_cap(sender, receiver, cap)
move_cap(sender, receiver, cap)
insert_result_cap(sender, receiver, cap)
call(subject, endpoint, payload)
return(server, client, result, result_caps)
read(subject, object)
write(subject, object)
bind(subject, namespace, name, object)
supervise(controller, target, operation)
release(subject, cap)
revoke(authority, object)
declassify(trusted_subject, source, target)
sanitize(trusted_subject, source, target)
relabel(trusted_subject, object, new_label)
Each transition needs preconditions and effects. Example:
copy_cap(sender, receiver, cap):
pre:
Hold(sender, cap.entity)
cap.transfer_mode allows copy
confidentiality_flow_allowed(cap.entity, receiver)
integrity_flow_allowed(sender, cap.entity, receiver)
receiver quota has free cap slot
effect:
Hold(receiver, cap.entity) is added
Flow(cap.entity, receiver, transfer) is recorded when relevant
Move is not a shortcut. It has different authority effects but can still create an information/control flow into the receiver.
Safety Predicates
Confidentiality:
read_allowed(s, e):
clearance(s) dominates classification(e)
write_allowed(s, e):
classification(e) dominates current_confidentiality(s)
flow_allowed(src, dst):
classification(dst) dominates classification(src)
No write down follows from classification(dst) dominates classification(src).
Integrity:
integrity_write_allowed(s, e):
integrity(s) >= integrity(e)
control_allowed(controller, target):
integrity(controller) >= integrity(target)
integrity_flow_allowed(src, dst):
integrity(src) >= integrity(dst)
The exact inequality direction must be validated against the chosen integrity semantics. The intent is that low-integrity subjects cannot modify or control high-integrity subjects or objects.
Subject control:
supervise_allowed(controller, target):
confidentiality/control labels are compatible
and integrity(controller) >= integrity(target)
and Hold(controller, ServiceSupervisor(target)) exists
Authority graph:
all live authority is represented by Hold
every Hold edge has a live cap table slot or trusted kernel root
no transition creates Hold without passing transfer/spawn/broker preconditions
Proof Shape
The proof is an invariant proof over the abstract automaton:
Base:
initial_state satisfies Safety
Step:
for every transition T:
if Safety(state) and Precondition(T, state),
then Safety(apply(T, state))
The transition proof must explicitly cover:
spawngrants,- copy transfer,
- move transfer,
- result-cap insertion,
- endpoint call and return,
- namespace bind,
- supervisor operations,
- declassification,
- sanitization,
- relabel,
- revocation and release preserving consistency.
The proof must also state what it does not cover:
- physical side channels,
- timing channels not modeled by
Flow, - bugs below the abstraction boundary,
- device DMA until
DMAPool/IOMMU boundaries are modeled, - persistence/replay until persistent object identity is modeled.
Tooling Plan
Start with lightweight formal tools, then deepen only if the model stabilizes.
TLA+
Best first tool for capOS because capability transfer, spawn, endpoint delivery, and revocation are state transitions. Use TLA+ to model:
- sets of subjects, objects, labels, and hold edges,
- bounded transfer/spawn/call transitions,
- invariants for confidentiality, integrity, and hold-edge consistency.
TLC can find counterexamples early. Apalache is worth evaluating later for symbolic checking if TLC state explosion becomes painful.
Alloy
Useful for relational counterexample search:
- label lattice dominance,
- container hierarchy invariants,
- hold-edge graph consistency,
- “can a path of transfers create forbidden flow?” queries.
Alloy complements TLA+; it does not replace transition modeling.
Coq, Isabelle, or Lean
Only after the model stops moving. These tools are appropriate for a durable machine-checked proof artifact. They are expensive if the policy surface is still changing.
Kani / Prusti / Creusot
Use these for implementation-level Rust obligations after the abstract model exists:
- cap table generation/index invariants,
- transfer transaction rollback,
- label dominance helper correctness,
- quota reservation/release balance,
- wrapper cap narrowing properties.
They do not replace the abstract automaton proof.
Implementation Mapping
The proof track must produce implementation obligations that code review and tests can check.
Required implementation hooks:
- every kernel object that participates in policy has stable
ObjectId; - every labeled object has
MandatoryLabel; - every hold edge or capability entry records enough label metadata for transfer checks;
- every capability method has a flow class;
- every transfer path calls one shared label/flow checker;
- every spawn grant uses the same checker as transfer;
- every endpoint has declared flow policy;
- every declassifier/sanitizer is an explicit capability and audited;
- every relabel operation is explicit and audited;
- every wrapper cap preserves or narrows authority and labels;
- process exit and release remove hold edges without leaving ghost authority.
The current pragmatic userspace broker model is allowed as an earlier stage, but the implementation mapping must identify where it is bypassable. Any path that lets untrusted code transfer labeled authority without the broker must move into the kernel-visible checked path before a formal MAC/MIC claim.
Testing and Review Gates
Before implementing kernel-visible labels:
- write the TLA+ or Alloy model;
- include at least one counterexample-driven test showing a rejected unsafe transfer in the model;
- document every transition that is intentionally out of scope.
Before claiming pragmatic MAC/MIC:
- broker and wrapper caps enforce labels at grant paths;
- audit records every grant, denial, and relabel/declassify operation;
- QEMU demo shows a denied high-to-low transfer and a permitted trusted declassification.
Before claiming GOST-style MAC/MIC:
- abstract automaton is written;
- safety predicates are explicit;
- all modeled transitions preserve safety;
- implementation obligations are mapped to code paths;
- transfer/spawn/result-cap insertion cannot bypass label checks;
- limitations and non-modeled channels are documented.
Integration With Existing Plans
This proposal depends on:
- authority graph and resource accounting (authority-accounting-transfer-design.md);
- user/session policy services (user-identity-and-policy-proposal.md);
- capability transfer and result-cap insertion (capability-model.md);
- DMA isolation before user drivers become part of the labeled model (dma-isolation-design.md);
- security verification tooling (security-and-verification-proposal.md).
Non-Goals
- No certification claim.
- No claim that current capOS implements GOST-style MAC/MIC.
- No attempt to model all side channels in the first version.
- No kernel policy language interpreter.
- No POSIX
uid/gidauthorization. - No label field without transition rules and proof obligations.
Open Questions
- What is the smallest useful label lattice for the first demo?
- Should labels live on objects, hold edges, or both?
- Should endpoint flow policy be static per endpoint, per method, or per transferred cap?
- How should declassifier and sanitizer capabilities be scoped and audited?
- Which channels must be modeled as memory flows versus time flows?
- Is TLA+ sufficient for the first formal artifact, or should the relational parts start in Alloy?
- Which parts of ГОСТ Р 59453.1-2021 should be treated as direct goals versus inspiration for a capOS-native formal model?
References
- ГОСТ Р 59383-2021, access-control foundations: https://lepton.ru/GOST/Data/752/75200.pdf
- ГОСТ Р 59453.1-2021, formal access-control model: https://meganorm.ru/Data/750/75046.pdf
Proposal: Running capOS in the Browser (WebAssembly, Worker-per-Process)
How capOS goes from “boots in QEMU” to “boots in a browser tab,” with each capOS process executing in its own Web Worker and the kernel acting as the scheduler/dispatcher across them.
The goal is a teaching and demo target, not a production runtime. It should
preserve the capability model — typed endpoints, ring-based IPC, no ambient
authority — while replacing the hardware substrate (page tables, IDT,
preemptive timer, privilege rings) with browser primitives (Worker
boundaries, SharedArrayBuffer, Atomics.wait/notify).
Depends on: Stage 5 (Scheduling), Stage 6 (IPC) — the capability ring is
the only kernel/user interface we want to port. Anything still sitting behind
the transitional write/exit syscalls must migrate to ring opcodes first.
Complements: userspace-binaries-proposal.md (language/runtime story),
service-architecture-proposal.md (process lifecycle). A browser port
stresses both: the runtime must build for wasm32-unknown-unknown, and
process spawn becomes “instantiate a Worker” rather than “map an ELF.”
Non-goals:
- Running the existing x86_64 kernel unmodified in the browser. That’s a separate question (QEMU-WASM / v86) and is a simulator, not a port.
- Emulating the MMU, IDT, or PIT in WASM. The whole point is to replace them with primitives the browser already gives us for free.
- Any persistence, networking, or storage beyond what a hosted demo needs.
Current State
capOS is x86_64-only. Arch-specific code lives under kernel/src/arch/x86_64/
and relies on:
| Mechanism | File | Browser equivalent |
|---|---|---|
| Page tables, W^X, user/kernel split | mem/paging.rs, arch/x86_64/smap.rs | Worker + linear-memory isolation (structural) |
| Preemptive timer (PIT @ 100 Hz) | arch/x86_64/pit.rs, idt.rs | setTimeout/MessageChannel + cooperative yield |
| Syscall entry (SYSCALL/SYSRET) | arch/x86_64/syscall.rs | Direct Atomics.notify on ring doorbell |
| Context switch | arch/x86_64/context.rs | None — each process is its own Worker, OS schedules |
| ELF loading | elf.rs, main.rs | WebAssembly.instantiate from module bytes |
| Frame allocator | mem/frame.rs | memory.grow inside each instance |
| Capability ring | capos-config/src/ring.rs, cap/ring.rs | Reused unchanged — shared via SharedArrayBuffer |
| CapTable, CapObject | capos-lib/src/cap_table.rs | Reused unchanged in kernel Worker |
The capability-ring layer is the only stable interface that survives the port
intact. Everything below cap/ring.rs is arch work; everything above is
schema-driven capnp dispatch that doesn’t care about the substrate.
Architecture
flowchart LR
subgraph Tab[Browser Tab / Origin]
direction LR
Main[Main thread<br/>xterm.js, UI, loader]
subgraph KW[Kernel Worker]
Kernel[capOS kernel<br/>CapTable, scheduler,<br/>ring dispatch]
end
subgraph P1[Process Worker #1<br/>init]
RT1[capos-rt] --> App1[init binary]
end
subgraph P2[Process Worker #2<br/>service<br/>spawned by init]
RT2[capos-rt] --> App2[service binary]
end
SAB1[(SharedArrayBuffer<br/>ring #1)]
SAB2[(SharedArrayBuffer<br/>ring #2)]
Main <-->|postMessage| KW
KW <-->|SAB + Atomics| SAB1
KW <-->|SAB + Atomics| SAB2
P1 <-->|SAB + Atomics| SAB1
P2 <-->|SAB + Atomics| SAB2
P1 -.spawn.-> KW
KW -.new Worker.-> P2
end
One Worker per capOS process. Each process is a WASM instance in its own
Worker, with its own linear memory. Cross-process access is structurally
impossible — postMessage and shared ring buffers are the only channels.
Kernel in a dedicated Worker. Not on the main thread: the main thread is
reserved for UI (terminal, loader, error display). The kernel Worker owns
the CapTable, holds the Arc<dyn CapObject> registry, dispatches SQEs,
and maintains one SharedArrayBuffer per process for that process’s
ring. It directly spawns init; all further processes are created via the
ProcessSpawner cap it serves.
Capability ring over SharedArrayBuffer. The existing
CapRingHeader/CapSqe/CapCqe layout in capos-config/src/ring.rs already
uses volatile access helpers for cross-agent visibility. Mapping it onto a
SharedArrayBuffer is a change of backing store, not of protocol. Both sides
see the same bytes; Atomics.load/Atomics.store replace the volatile reads
on the host side; on the Rust/WASM side the existing read_volatile/
write_volatile lower to plain atomic loads/stores under
wasm32-unknown-unknown with the atomics feature enabled.
cap_enter becomes Atomics.wait. The process Worker calls
Atomics.wait on a doorbell word in the SAB after publishing SQEs. The
kernel Worker (or its scheduler tick) calls Atomics.notify after producing
completions. That is exactly the io_uring-inspired “syscall-free submit,
blocking wait on completion” the ring was designed around — the browser
happens to give us the primitive for free.
No preemption inside a process. A Worker runs to completion on its event
loop turn; the kernel can’t interrupt it. This is fine: each process is
single-threaded in its own isolate, and the scheduler only needs to wake the
next process after Atomics.wait, not forcibly remove the running one.
This is closer to a cooperative capnp-rpc vat model than to the current
timer-preempted kernel, and matches what the capability ring already assumes.
Mapping capOS Concepts to WASM/Browser
Process isolation
The Worker boundary replaces the page table. Two capOS processes cannot
observe each other’s linear memory, cannot jump into each other’s code (code
is out-of-band in WASM — not addressable as data), and cannot share globals.
The SharedArrayBuffer containing the ring is the only intentional shared
region, and it is created by the kernel Worker and transferred to the process
Worker at spawn time.
No W^X enforcement is needed within a Worker because WASM has no writable
code region to begin with — WebAssembly.Module is validated and immutable.
The MMU’s job is done by the WASM type system and validator.
Address space / memory
Each Worker’s WASM instance has one linear memory. capos-rt’s fixed heap
initialization uses memory.grow instead of VirtualMemory::map. The
VirtualMemory capability still exists in the schema, but its
implementation in the browser port is a thin wrapper over memory.grow with
bookkeeping for “logical unmap” (zeroing + tracking a free list — WASM
doesn’t return pages to the host).
Protection flags (PROT_READ/PROT_WRITE/PROT_EXEC) become no-ops with a
documented caveat in the proposal: the browser port does not enforce
intra-process protection. Cross-process protection is structural and
stronger than the native build.
Syscalls
The three transitional syscalls (write, exit, cap_enter) collapse to:
write— already slated for removal once init is cap-native. In the browser port, do not implement it at all. Force the port to drive the existing cap-native Console ring path, which forces the rest of the tree to be cap-native too. A forcing function, not a cost.exit—postMessage({type: 'exit', code})to the kernel Worker, which terminates the Worker viaworker.terminate()and reaps the process entry.cap_enter—Atomics.waiton the ring doorbell after publishing SQEs, with awaitAsyncvariant for cooperative mode if we ever want to avoid blocking the Worker’s event loop.
Scheduler
Round-robin is gone; the browser scheduler is the OS scheduler. The kernel Worker’s “scheduler” is reduced to:
- A poll loop that drains each process’s SQ (the existing
cap/ring.rs::process_sqeslogic, called on everynotifyor on asetTimeout(0)tick). - A completion-fanout step that pushes CQEs and
Atomics.notifys the target Worker.
No context switch, no run queue, no per-process kernel stack. The code
deleted here is exactly the code that smp-proposal.md says needs per-CPU
structures — an orthogonal win: the browser port has no SMP problem because
each process is structurally on its own agent.
Process spawning
The kernel Worker spawns exactly one process Worker directly — init —
with a fixed cap bundle: Console, ProcessSpawner, FrameAllocator,
VirtualMemory, BootPackage, and any host-backed caps (Fetch,
etc.) granted to it.
// Kernel Worker bootstrap
const initMod = await WebAssembly.compileStreaming(fetch('/init.wasm'));
const initRing = new SharedArrayBuffer(RING_SIZE);
const initWorker = new Worker('process-worker.js', {type: 'module'});
kernel.registerProcess(initWorker, initRing, buildInitCapBundle());
initWorker.postMessage(
{type: 'boot', mod: initMod, ring: initRing, capSet: initCapSet,
bootPackage: manifestBytes},
[/* transfer */]);
All further processes come from init invoking ProcessSpawner.spawn.
ProcessSpawner is served by the kernel Worker; each invocation:
- Compiles the referenced binary bytes (
WebAssembly.compileover theNamedBlobfromBootPackage). - Creates a
new Workerand aSharedArrayBufferfor its ring. - Builds the child’s
CapTablefrom theProcessSpecthe caller passed, applying move/copy semantics to caps transferred from the caller’s table. - Returns a
ProcessHandlecap.
Init composes service caps in userspace: hold Fetch, attenuate to
per-origin HttpEndpoint, hand each child only the caps its
ProcessSpec names. Same shape as native after Stage 6.
Host-backed capability services
Some capabilities in the browser port are implemented by talking to the
browser rather than to hardware. Fetch and HttpEndpoint — drafted in
service-architecture-proposal.md —
are the canonical example. On native capOS they run over a userspace
TCP/IP stack on virtio-net/ENA/gVNIC. In the browser port, the service
process is replaced by a thin implementation living in the kernel Worker
(or a dedicated “host bridge” Worker) that dispatches each capnp call
by calling fetch / new WebSocket and returning the response as a
CQE. The attenuation story is unchanged: Fetch can reach any URL,
HttpEndpoint is bound to one origin at mint time, derived from
Fetch by a policy process.
This is not a back door. The capability is granted through the manifest
exactly as on native. Processes without the cap cannot reach the host’s
network, cannot discover it, and cannot forge one. The only difference
from native is the implementation of the service behind the CapObject
trait — same schema, same TYPE_ID, same error model.
The same pattern applies to anything else the browser provides natively. Candidate future interfaces (no schema yet, mentioned so the port is considered when they are designed):
Clipboardovernavigator.clipboardLocalStorage/KvStoreover IndexedDB (naturalStorebackend for the storage proposal in the browser)Display/Canvasover anOffscreenCanvasposted back to the main threadRandomSourceovercrypto.getRandomValues— trivial but needs a cap rather than a syscall
Other drafted network interfaces — TcpSocket, TcpListener,
UdpSocket, NetworkManager from
networking-proposal.md — do not have a clean
browser mapping. The browser exposes no raw-socket primitives, so these
caps cannot be served in the browser port at all. Applications that need
networking in the browser must go through Fetch/HttpEndpoint, and
the POSIX shim’s socket path must detect the absence of NetworkManager
and route connect("http://...") through Fetch instead (or fail
closed for other schemes). CloudMetadata from
cloud-metadata-proposal.md is simply not
granted in the browser; there is no cloud instance to describe.
Each host-backed cap is opt-in per-process via the manifest; each has a native counterpart that the schema is already the contract for. This is a substantial point in favor of the port: host-provided services slot into the existing capability model without widening it.
CapSet bootstrap
The read-only CapSet page at CAPSET_VADDR is replaced by a structured-clone
payload in the initial postMessage. capos-rt::capset::find still parses
the same CapSetHeader/CapSetEntry layout, just out of a Uint8Array
placed at a known offset in the process’s linear memory by the boot shim.
Binary Portability
Source-portable, not binary-portable. An ELF built for x86_64-unknown-capos
does not run; the same source rebuilt for wasm32-unknown-unknown (with the
atomics target feature) does, provided it stays inside the supported API
surface.
Rust binaries on capos-rt
Port cleanly:
- Any binary that uses only
capos-rt’s public API — typed cap clients (ConsoleClient, futureFileClient, etc.), ring submission/completion,CapSet::find,exit,cap_enter,alloc::*. - Pure computation,
core/alloccontainers, serde/capnp message building.
Do not port:
- Anything that uses
core::arch::x86_64, inlineasm!, orglobal_asm!. - Binaries with a custom
_startor a linker script baking in0x200000. capos-rt owns the entry shape; the wasm entry is set by the host (WebAssembly.instantiate+ an exported init), so the prologue differs. #[thread_local]relying on FS base until the wasm TLS story is decided (per-Worker globals, or the wasm threads proposal’s TLS).- Code that assumes a fixed-size static heap region and reaches it with
raw pointers. The wasm arch uses
memory.grow;alloc::*hides this,unsafe { &mut HEAP[..] }does not. - Anything that still calls the transitional
writesyscall shim — the browser build deliberately omits it.
Binaries mixing target features across the workspace produce silently-
broken atomics. A single rustflags set for the browser build is required.
POSIX binaries (when the shim lands)
The POSIX compatibility layer described in
userspace-binaries-proposal.md Part 4
sits on top of capos-rt. If capos-rt builds for wasm, the shim builds for
wasm, and well-behaved POSIX code rebuilt for a wasm-targeted
libcapos (clang --target=wasm32-unknown-unknown + our libc) ports too.
Ports cleanly:
- Pure computation, string/number handling, data-structure libraries.
stdioover Console / future File caps.malloc/free, C++new/delete, static constructors.select/poll/epollimplemented over the ring (ring CQEs are exactly the event source these APIs want).posix_spawnoverProcessSpawner— spawning a new process becomes “instantiate a new Worker,” which is the native shape of the browser anyway.- Networking via
Fetch/HttpEndpoint(drafted in service-architecture-proposal.md) if the manifest grants the cap. The browser port serves these against the host’sfetch/WebSocket — not ambient authority, because only processes granted the cap can invoke it. RawAF_INET/AF_INET6sockets via theTcpSocket/NetworkManagerinterfaces in networking-proposal.md are not available in the browser (no raw-socket primitive); POSIX networking code wants URLs in practice, and a libc shim can mapgetaddrinfo+connect+writeoverFetch/HttpEndpointfor the HTTP(S) case, failing closed otherwise.
Does not port without new work, possibly ever:
fork. Cannot clone a Worker’s linear memory into a new Worker and resume at theforkcall site — there is no COW, no MMU, no way to duplicate an opaque WASM module’s mid-execution state. This is the same reason Emscripten/WASI don’t supportfork. POSIX programs that fork-then-exec can be rewritten toposix_spawn; programs that fork-for-concurrency (Apache prefork, some Redis paths) cannot.- Signals. No preemption inside a Worker means no asynchronous signal
delivery.
SIGALRM,SIGINT,SIGSEGVall need cooperative polling at best;kill(pid, SIGKILL)maps toworker.terminate()and nothing finer.setjmp/longjmpworks within a function call tree;siglongjmpout of a signal handler does not exist. mmapof files withMAP_SHARED. WASM linear memory is not file-backed and cannot be.MAP_PRIVATE | MAP_ANONYMOUSworks trivially (it’s justmemory.grow+ a free list). File-backed mappings require a userspace emulation that reads on fault and writes back on unmap — workable for small files, a lie for the memory- mapped-database case.- Threads without the wasm threads proposal. pthreads over Workers
sharing a memory is the only implementation strategy, and it requires
the wasm
atomics/bulk-memory/shared-memoryfeature set plus careful runtime support. Single-threaded POSIX code works now; multithreaded POSIX code needs the in-process-threading track from the native roadmap and its wasm counterpart. - Address-arithmetic tricks. Wasm validates loads/stores against the linear-memory bounds. Code that relies on unmapped trap pages (guard pages, end-of-allocation sentinels) or on specific virtual addresses fails.
dlopen. A wasm module is immutable after instantiation. Dynamic loading requires loading a second module and linking via exported tables — possible with the component model, nowhere near drop-indlopen. Static linking is the pragmatic answer.
Rough guide: if a POSIX program compiles cleanly under WASI and uses only WASI-supported syscalls, it will almost certainly port to capOS-on-wasm with the shim, because the constraints overlap. If it needs features WASI doesn’t support (fork, signals, shared mmap), the capOS browser port will not magically fix that — the limitations come from the substrate, not from the POSIX shim’s completeness.
Build Path
Three new cargo targets, no workspace restructuring required:
-
capos-libonwasm32-unknown-unknown. Alreadyno_std + alloc, no arch-specific code. Should build as-is; verify undercargo check --target wasm32-unknown-unknown -p capos-lib. -
capos-configonwasm32-unknown-unknown. Same — pure logic, the ring structs and volatile helpers are portable. -
capos-rtonwasm32-unknown-unknownwithatomicsfeature. The standalone userspace runtime currently hard-codes x86_64 syscall instructions. Introduce anarchmodule split:arch/x86_64.rs(existingsyscall.rscontents)arch/wasm.rs(new —Atomics.waitviacore::arch::wasm32::memory_atomic_wait32,exitvia host import)
Gate at the
syscallboundary, not deeper; the ring client above it is arch-agnostic. -
Demos on
wasm32-unknown-unknown. Same arch split applied viacapos-rt. No per-demo changes expected if the split is clean.
The kernel does not build for wasm. Instead, a new crate
capos-kernel-wasm/ (peer to kernel/) reuses capos-lib’s CapTable and
capos-config’s ring structs and implements the dispatch loop against JS
host imports for Worker management. It is, deliberately, not the same kernel
binary. Trying to build kernel/ for wasm would pull in IDT/GDT/paging code
that has no meaning in the browser.
Phased Plan
Phase A: Port the pure crates
- Verify
capos-lib,capos-configbuild clean onwasm32-unknown-unknown. CI job:cargo check --target wasm32-unknown-unknown -p capos-lib -p capos-config. - Add a host-side
ring-tests-jsharness that exercises the same invariants astests/ring_loom.rsbut with a real JS producer and a Rust/wasm consumer, both sharing aSharedArrayBuffer. Proves the volatile access helpers are portable before anything else depends on them.
Phase B: capos-rt arch split
- Introduce
capos-rt/src/arch/{x86_64,wasm}.rsbehind a#[cfg(target_arch)]. - Rewire
syscall/ring/clientto call through the arch module. - Add
make capos-rt-wasm-checktarget. Existingmake capos-rt-checkstays for x86_64.
Phase C: Kernel Worker + init
capos-kernel-wasm/with a Console capability that renders to xterm.js viapostMessageback to the main thread.- Kernel Worker spawns init. Init prints “hello” through Console and exits.
Phase D: ProcessSpawner + Endpoint
ProcessSpawnerserved by the kernel Worker, granted to init.- Init parses its
BootPackageand spawns theendpoint-roundtripandipc-server/ipc-clientdemos viaProcessSpawner.spawn. These stress capability transfer across Workers: does a cap handed from A to B via the ring land correctly in B’s ring, and does B’s subsequent invocation route back to the right holder? - This phase turns the port into a validation surface for the
capability-transfer and badge-propagation invariants in
docs/authority-accounting-transfer-design.md, and a second implementation of the Stage 6 spawn primitive.
Phase E: Integration with demos page
- Hosted page at a project URL; xterm.js terminal; selector for which demo manifest to boot.
- Serve
.wasmartifacts as static assets.
Security Boundary Analysis
The browser port changes what is trusted and what is verified. Summary:
| Boundary | Native (x86_64) | Browser (WASM-Workers) |
|---|---|---|
| Process ↔ process | Page tables + rings | Worker agents + SAB (structural) |
| Process ↔ kernel | Syscall MSRs + SMEP/SMAP | postMessage + validated host imports |
| Code integrity | W^X + NX | WASM validator + immutable Module |
| Capability forgery | Kernel-owned CapTable | Kernel-Worker-owned CapTable |
| Capability transfer | Ring SQE validated in kernel | Ring SQE validated in kernel Worker — same code path |
The capability-forgery story is the same in both: an unforgeable 64-bit
CapId is assigned by the kernel and can only be resolved through the
kernel’s CapTable. A process Worker cannot synthesize a valid CapId
because it never sees the CapTable; it only sees SQEs it submits and CQEs
it receives. This property is what makes the port worth doing — the
capability model is preserved exactly.
What weakens: no SMAP/SMEP equivalent, but also no corresponding attack
surface (the “kernel” Worker has no pointer into process memory; it can only
copy bytes out of the shared ring). No DMA problem. No side-channel parity
with docs/dma-isolation-design.md — Spectre/meltdown in the browser is the
browser’s problem, mitigated by site isolation and COOP/COEP.
Required headers: Cross-Origin-Opener-Policy: same-origin and
Cross-Origin-Embedder-Policy: require-corp — SharedArrayBuffer is gated
on these. A hosted demo page must set them.
What This Port Buys Us
- Shareable demos. A URL that boots capOS in ~1s, with no QEMU, no local install. Valuable for documentation and recruiting.
- A second substrate for the capability model. If the cap-transfer protocol has a bug, reproducing it under Workers (single-threaded, deterministic scheduling) is much easier than under SMP x86_64. A second implementation of the dispatch surface is a correctness asset.
- Forcing function for
writesyscall removal. The browser port cannot support the transitionalwritepath without importing host I/O as a back door, which is exactly the ambient authority we want to avoid. Shipping a browser demo at all requires finishing the migration to the Console capability over the ring. - Teaching surface. Workers give a much clearer visual of “one process, one memory, one cap table” than a bare-metal kernel ever will. The isolation story renders in the DevTools panel.
What It Does Not Buy Us
- Not a validation surface for the x86_64 kernel. Page tables, IDT, context switch, SMP — none of that runs. Bugs in those subsystems will not appear in the browser build.
- Not a performance story. WASM + Workers + SAB is slower than native QEMU-KVM for the parts it does overlap on, and does not exercise the hardware features capOS eventually cares about (IOMMU, NVMe, virtio-net).
- Not a path to “capOS on Cloudflare Workers” or similar. Cloudflare’s runtime is a single isolate per request, no SAB, no threads — a different environment that would need its own proposal.
Open Questions
- Do we ship one
capos-kernel-wasmcrate, or does the kernel Worker run plain JS that imports a thincapos-dispatchwasm? JS-hosted kernel is simpler (no second wasm toolchain for the kernel side) but duplicates cap-dispatch logic. Preferred: Rust/wasm kernel Worker reusingcapos-lib— dispatch code stays single-sourced. - How do we surface kernel panics in the browser? Native capOS halts
the CPU; the browser equivalent is posting an error to the main thread
and tearing down all Workers. Should match the
panic = "abort"contract — no recovery attempted. - Do we implement
VirtualMemoryas a no-op or as a real allocator? No-op is faster to ship; a real allocator overmemory.growexercises more of the capability surface. Lean toward real, gated behind abrowser-shimflag so the demo doesn’t silently diverge from the native semantics. - Manifest format: keep capnp, or add JSON for hand-authored demo configs? Keep capnp. The manifest is already the contract; adding a parallel format is exactly the drift the project has been careful to avoid.
Relationship to Other Proposals
userspace-binaries-proposal.md— the wasm32 runtime story lives there eventually. This proposal is narrower: just enough runtime to boot the existing demo set in a browser. If the userspace proposal lands a richer runtime first, this one adopts it.smp-proposal.md— structurally irrelevant to the browser port (each Worker is its own agent). The browser port does inform SMP testing, because the cap-transfer protocol under Workers is a cleaner model of “messages cross agents asynchronously” than single-CPU preempted kernels.service-architecture-proposal.md— process spawn in the browser becomes Worker instantiation. The lifecycle primitives (supervise, restart, retarget) map naturally. Live upgrade (live-upgrade-proposal.md) is even more natural under Workers than under in-kernel retargeting — swap theWebAssembly.Modulebehind a Worker while the ring stays live.security-and-verification-proposal.md— the browser port adds a CI job (wasm builds + JS-side ring tests) but does not change the verification story for the native kernel.
Rejected Proposal: Cap’n Proto SQE Envelope
Status: rejected.
Proposal
Replace the fixed C-layout CapSqe descriptor with a fixed-size padded
Cap’n Proto message. Each SQ slot would contain a serialized single-segment
Cap’n Proto struct with a union for call, recv, return, release, and
finish, then zero padding to the chosen SQE size.
For a 128-byte slot, the rough layout would be:
+0x00 u32 segment_count_minus_one
+0x04 u32 segment0_word_count
+0x08 word root pointer
+0x10 RingSqe data words, including union discriminant
+0x?? zero padding to 128 bytes
A compact schema would need to keep fields flat to avoid pointer-heavy nested payload structs:
struct RingSqe {
userData @0 :UInt64;
capId @1 :UInt32;
methodId @2 :UInt16;
flags @3 :UInt16;
addr @4 :UInt64;
len @5 :UInt32;
resultAddr @6 :UInt64;
resultLen @7 :UInt32;
callId @8 :UInt32;
union {
call @9 :Void;
recv @10 :Void;
return @11 :Void;
release @12 :Void;
finish @13 :Void;
}
}
Potential Benefits
A Cap’n Proto SQE envelope would make the ring operation shape schema-defined instead of Rust-struct-defined. That has some real advantages:
- The ABI documentation would live in
schema/capos.capnpnext to the capability interfaces. - Future userspace runtimes in Rust, C, Go, or another language could use generated accessors instead of hand-mirroring a packed descriptor layout.
- The operation choice could be represented as a schema union, making it clear that fields meaningful for CALL are not meaningful for RECV or RETURN.
- Cap’n Proto defaulting gives a familiar path for adding optional fields while letting older readers ignore fields they do not understand.
- Ring dumps and traces could be decoded with generic Cap’n Proto tooling.
- A single “everything crossing this boundary is Cap’n Proto” rule is architecturally simpler to explain.
Those benefits are mostly about schema uniformity, generated bindings, and tooling. They do not remove the need for an operation discriminator; they move it from an explicit fixed descriptor field to a Cap’n Proto union tag.
Rationale For Rejection
The SQE is the fixed control-plane descriptor for a hostile kernel boundary. It should be cheap to classify and validate before any operation-specific payload parsing. A Cap’n Proto SQE envelope would still have a discriminator, but would move it into generated reader state and require Cap’n Proto message validation before the kernel even knows whether the entry is a CALL, RECV, or RETURN.
Cap’n Proto framing also consumes slot space: a single-segment message needs a segment table and root pointer before the struct data. A flat 64-byte envelope would be tight and brittle; a 128-byte envelope would spend much of the slot on framing and padding. Nested payload structs are worse because they add pointers inside the ring descriptor.
The accepted split is:
- fixed
#[repr(C)]ring descriptors for SQ/CQ control state; - Cap’n Proto for capability method params, results, and higher-level transport payloads where schema evolution is valuable;
- endpoint delivery metadata in a small fixed
EndpointMessageHeaderfollowed by opaque params bytes.
There is also a layering issue. The capability ring is part of the local Cap’n Proto transport implementation: it is the mechanism that moves capnp calls, returns, and eventually release/finish/promise bookkeeping between a process and the kernel. The SQE itself is therefore below ordinary Cap’n Proto message usage. Making the transport substrate depend on parsing Cap’n Proto messages to discover which transport operation to perform would couple the transport implementation to the protocol it is supposed to carry. Method params and results are proper Cap’n Proto messages; the ring descriptor is the framing/control structure that gets the transport to the point where those messages can be interpreted.
This keeps queue geometry simple, preserves bounded hostile-input handling, and avoids running a Cap’n Proto parser on the hot descriptor path.
Rejected Proposal: Sleep(INF) Process Termination
Status: rejected.
Concern
Unix-style zombies are a poor fit for capOS. A terminated child should not keep
its address space, cap table, endpoint state, or other authority alive merely
because a parent has not waited yet. The remaining observable state should be a
small, capability-scoped completion record, and only holders of the corresponding
ProcessHandle should be able to observe it.
The current ProcessHandle.wait() -> exitCode :Int64 shape is also too weak for
future lifecycle semantics. Raw numeric status cannot distinguish normal
application exit from abandon, kill, fault, startup failure, runtime panic, or
supervisor policy actions without inventing process-wide magic numbers.
Proposal
Introduce a system sleep operation and treat Sleep(INF) as a special terminal
operation. The argument for this spelling is that a process that never wants to
run again can enter an infinite sleep instead of becoming a zombie. The kernel
would recognize the infinite case and handle it specially:
- finite
Sleep(duration)blocks the process and wakes it later; Sleep(INF)never wakes, so the kernel tears down the process;- the process’s authority is released as if it had exited;
- parent-visible process completion is either omitted or reported as a special status.
A variant also removes the dedicated sys_exit syscall and makes
Sleep(INF) the only user-visible process termination primitive.
Candidate Semantics
Sleep(INF) as Exit(0)
The simplest version maps Sleep(INF) to normal successful exit.
This is rejected because it lies about intent. A program that completed successfully, a program that intentionally detached, and a program that chose to disappear without status are not the same lifecycle event. Supervisors would see the same status for all of them.
Sleep(INF) as Abandoned
A less lossy version gives Sleep(INF) a distinct terminal status:
struct ProcessStatus {
union {
exited @0 :ApplicationExit;
abandoned @1 :Void;
killed @2 :KillReason;
faulted @3 :FaultInfo;
startupFailed @4 :StartupFailure;
}
}
struct ApplicationExit {
code @0 :Int64;
}
ProcessHandle.wait() would return status :ProcessStatus instead of a bare
exitCode :Int64. Normal application termination returns exited(code), while
Sleep(INF) returns abandoned.
This fixes the type problem, but leaves the operation name wrong. Sleep normally means the process remains alive and keeps its authority until a wake condition. The infinite special case would instead release authority, reclaim memory, cancel endpoint state, complete process handles, and make the process impossible to wake. That is termination, not sleep.
Sleep(INF) as Detached No-Status Termination
Another version treats Sleep(INF) as detached termination and gives parents no
status. That avoids inventing an exit code, but it weakens supervision. Init and
future service supervisors need a definite terminal event to implement restart
policy, diagnostics, dependency failure reporting, and “wait for all children”
flows. A missing status is not a useful status.
Remove sys_exit Through a Typed Lifecycle Capability
Removing the dedicated sys_exit syscall is a separate, plausible future
direction. The cleaner version is not Sleep(INF), but an explicit lifecycle
operation:
interface ProcessSelf {
terminate @0 (status :ProcessStatus) -> ();
abandon @1 () -> ();
}
interface ProcessHandle {
wait @0 () -> (status :ProcessStatus);
}
The process would receive ProcessSelf only for itself. Calling terminate
would be non-returning in practice: the kernel would process the request,
release process authority, complete any ProcessHandle waiter with the typed
status, and not post an ordinary success completion back to the dying process.
The transport shape needs care. A generic Cap’n Proto call normally expects a completion CQE, but a self-termination operation cannot safely rely on the dying process to consume one. Viable implementations include:
- a dedicated ring operation such as
CAP_OP_EXITtargeting a self-lifecycle cap; - a
ProcessSelf.terminatecall whose method is explicitly non-returning and never posts a CQE to the caller; - keeping
sys_exittemporarily until ring-level non-returning operations have explicit ABI and runtime support.
This path removes the ambient exit syscall without overloading sleep. It also forces terminal status to become typed before kill, abandon, restart policy, or fault reporting are added.
Rationale For Rejection
Sleep(INF) solves the wrong abstraction problem. The zombie problem is not
that a process needs a forever-blocked state. The problem is retaining process
resources after terminal execution. capOS should solve that by separating
process lifetime from process-status observation:
- process termination immediately releases authority and reclaims process resources;
- a
ProcessHandleis only observation authority, not ownership of the live process; - if a handle exists, a small completion record may remain until it is waited or released;
- if no handle exists, terminal status can be discarded;
- no ambient parent process table is needed.
Under that model, a sleeping process remains alive and authoritative, while a
terminated process does not. Special-casing Sleep(INF) to perform teardown
would make the name actively misleading and would create a hidden terminal
operation with different semantics from finite sleep.
The accepted direction is therefore:
- keep explicit process termination semantics;
- replace raw
exitCode :Int64with typedProcessStatusbefore adding more lifecycle states; - keep
exit(code)as the current minimal ABI until a typed self-lifecycle capability or ring operation can replace it cleanly; - add future
Timer.sleep(duration)only for real sleep, where the process remains alive and may wake.
Sleep(INF) remains rejected as a termination primitive. The concern it raises
is valid, but the solution is typed terminal status plus status-record cleanup,
not infinite sleep.
Research: Capability-Based and Microkernel Operating Systems
Survey of existing systems to inform capOS design decisions across IPC, scheduling, capability model, persistence, VFS, and language support.
Design consequences for capOS
- Keep the flat generation-tagged capability table; seL4-style CNode hierarchy is not needed until delegation patterns demand it.
- Treat the typed Cap’n Proto interface as the permission boundary; avoid a parallel rights-bit system that would drift from schema semantics.
- Continue the ring transport plus direct-handoff IPC path, with shared memory
reserved for bulk data once
SharedBuffer/MemoryObjectexists. - Use badge metadata, move/copy transfer descriptors, and future epoch revocation to make authority delegation explicit and reviewable.
- Keep persistence explicit through Store/Namespace capabilities; do not adopt EROS-style transparent global checkpointing as a kernel baseline.
- Push POSIX compatibility and VFS behavior into libraries and services rather than adding a kernel global filesystem namespace.
- Add resource donation, scheduling-context donation, notification objects, and runtime/thread primitives only when the corresponding service or runtime path needs them.
Individual deep-dive reports:
- seL4 – formal verification, CNode/CSpace, IPC fastpath, MCS scheduling
- Fuchsia/Zircon – handles with rights, channels, VMARs/VMOs, ports, FIDL vs Cap’n Proto
- Plan 9 / Inferno – per-process namespaces, 9P protocol, file-based vs capability-based interfaces
- EROS / CapROS / Coyotos – persistent capabilities, single-level store, checkpoint/restart
- Genode – session routing, VFS plugins, POSIX compat, resource trading, Sculpt OS
- LLVM target customization – target triples, TLS models, Go runtime requirements
- Cap’n Proto error handling – protocol, schema, and Rust crate error behavior used by the capOS error model
- OS error handling – error patterns in capability systems and microkernels used by the capOS error model
- IX-on-capOS hosting – clean integration of IX package/build model via MicroPython control plane, native template rendering, Store/Namespace, and build services
- Out-of-kernel scheduling – whether scheduler policy can move to user space, and which dispatch/enforcement mechanisms must stay in kernel
Cross-Cutting Analysis
1. Capability Table Design
All surveyed systems store capabilities as process-local references to kernel objects. The key design variable is how capabilities are organized.
| System | Structure | Lookup | Delegation | Revocation |
|---|---|---|---|---|
| seL4 | Tree of CNodes (power-of-2 arrays with guard bits) | O(depth) | Subtree (grant CNode cap) | CDT (derivation tree), transitive |
| Zircon | Flat per-process handle table | O(1) | Transfer through channels (move) | Close handle; refcount; no propagation |
| EROS | 32-slot nodes forming trees | O(depth) | Node key passing | Forwarder keys (O(1) rescind) |
| Genode | Kernel-enforced capability references | O(1) | Parent-mediated session routing | Session close |
| capOS | Flat table with generation-tagged CapId, hold-edge metadata, and Arc<dyn CapObject> backing | O(1) | Manifest exports plus copy/move transfer descriptors through Endpoint IPC | Local release/process exit; epoch revocation not yet |
Recommendation for capOS: Keep the flat table. It is simpler than seL4’s CNode tree and sufficient for capOS’s use cases. Augment each entry with:
- Badge (from seL4) – u64 value delivered to the server on invocation, allowing a server to distinguish callers without separate capability objects.
- Generation counter (from Zircon) – upper bits of CapId detect stale references after a slot is reused. (Implemented.)
- Epoch (from EROS) – per-object revocation epoch. Incrementing the epoch invalidates all outstanding references. O(1) revoke, O(1) check.
Not adopted: per-entry rights bitmask. Zircon and seL4 use rights bitmasks
(READ/WRITE/EXECUTE) because their handle/syscall interfaces are untyped.
capOS uses Cap’n Proto typed interfaces where the schema defines what methods
exist. Method-level access control is the interface itself – to restrict what
a caller can do, grant a narrower capability (a wrapper CapObject that
exposes fewer methods). A parallel rights system would create an impedance
mismatch: generic flags (READ/WRITE) mapped arbitrarily onto typed methods.
Meta-rights for the capability reference itself (TRANSFER/DUPLICATE) may be
added when Stage 6 IPC needs them. See capability-model.md
for the full rationale.
2. IPC Design
IPC is the most performance-critical kernel mechanism. Every capability invocation across processes goes through it.
| System | Model | Latency (round-trip) | Bulk data | Async |
|---|---|---|---|---|
| seL4 | Synchronous endpoint, direct context switch | ~240 cycles (ARM), ~400 cycles (x86) | Shared memory (explicit) | Notification objects (bitmask signal/wait) |
| Zircon | Channels (async message queue, 64KiB + 64 handles) | ~3000-5000 cycles | VMOs (shared memory) | Ports (signal-based notification) |
| EROS | Synchronous domain call | ~2x L4 | Through address space nodes | None (synchronous only) |
| Plan 9 | 9P over pipes (kernel-mediated) | ~5000+ cycles | Large reads/writes (iounit) | None (blocking per-fid) |
| Genode | RPC objects with session routing | Varies by kernel (uses seL4/NOVA/Linux underneath) | Shared-memory dataspaces | Signal capabilities |
Recommendation for capOS: Continue the dual-path IPC design:
Fast synchronous path (seL4-inspired, for RPC):
- When process A calls a capability in process B and B is blocked waiting, perform a direct context switch (A -> kernel -> B, no unrelated scheduler pick). The current single-CPU direct handoff is implemented.
- Future fastpath work can transfer small messages (<64 bytes) through registers during the switch instead of copying through ring buffers.
Async submission/completion rings (io_uring-inspired, for batching):
- SQ/CQ in shared memory for batched capability invocations. This is the current transport for CALL/RECV/RETURN/RELEASE/NOP.
- Support SQE chaining for Cap’n Proto promise pipelining.
- Signal/notification delivery through CQ entries (from Zircon ports).
- User-queued CQ entries for userspace event loop integration.
Bulk data (Zircon/Genode-inspired):
SharedBuffercapability for zero-copy data transfer between processes.- Capnp messages for control plane; shared memory for data plane.
- Critical for file I/O, networking, and GPU rendering.
3. Memory Management Capabilities
Zircon’s VMO/VMAR model is the most mature capability-based memory design. The Go runtime proposal shows why these primitives are essential.
VirtualMemory capability (baseline implemented; still central for Go and advanced allocators):
interface VirtualMemory {
map @0 (hint :UInt64, size :UInt64, prot :UInt32) -> (addr :UInt64);
unmap @1 (addr :UInt64, size :UInt64) -> ();
protect @2 (addr :UInt64, size :UInt64, prot :UInt32) -> ();
}
MemoryObject capability (needed for IPC bulk data, shared libraries).
Zircon calls this concept a VMO (Virtual Memory Object); capOS uses the name
SharedBuffer – see docs/proposals/storage-and-naming-proposal.md for the canonical
interface definition.
interface MemoryObject {
read @0 (offset :UInt64, count :UInt64) -> (data :Data);
write @1 (offset :UInt64, data :Data) -> ();
getSize @2 () -> (size :UInt64);
createChild @3 (offset :UInt64, size :UInt64, options :UInt32)
-> (child :MemoryObject);
}
4. Scheduling
| System | Model | Priority inversion solution | Temporal isolation |
|---|---|---|---|
| seL4 (MCS) | Scheduling Contexts (budget/period/priority) + Reply Objects | SC donation through IPC (caller’s SC transfers to callee) | Yes (budget enforcement per SC) |
| Zircon | Fair scheduler with profiles (deadline, capacity, period) | Kernel-managed priority inheritance | Profiles provide some isolation |
| Genode | Delegated to underlying kernel (seL4/NOVA/Linux) | Depends on kernel | Depends on kernel |
| Out-of-kernel policy | Kernel dispatch/enforcement + user-space policy service | Scheduling-context donation through IPC | Kernel-enforced budgets, user-chosen policy |
| User-space runtimes | M:N work stealing, fibers, async tasks over kernel threads | Requires futexes, runtime cooperation, and OS-visible blocking events | Usually runtime-local only |
Recommendation for capOS: Start with round-robin (already done). When implementing priority scheduling:
- Add scheduling context donation for synchronous IPC: when process A calls process B, B inherits A’s priority and budget. Prevents inversion through the capability graph.
- Support passive servers (from seL4 MCS): servers without their own scheduling context that only run when called, using the caller’s budget. Natural fit for capOS’s service architecture.
- Add temporal isolation (budget/period per scheduling context) for the cloud deployment scenario.
For moving scheduler policy out of the kernel, see Out-of-kernel scheduling. The key finding is a split between kernel dispatch/enforcement and user-space policy: dispatch, budget enforcement, and emergency fallback remain privileged, while admission control, budgets, priorities, CPU masks, and SQPOLL/core grants can be represented as policy managed by a scheduler service. Thread creation, thread handles, scheduling contexts, and futex authority should be capability-based from the start; the remaining research task is measurement: compare generic capnp/ring calls against compact capability-authorized futex-shaped operations before deciding the futex hot-path encoding.
5. Persistence
| System | Model | Consistency | Application effort |
|---|---|---|---|
| EROS/CapROS | Transparent global checkpoint (single-level store) | Strong (global snapshot) | None (automatic) |
| Plan 9 | User-mode file servers with explicit writes | Per-file server | Full (explicit save/load) |
| Genode | Application-level (services manage own persistence) | Per-component | Full |
| capOS (planned) | Content-addressed Store + Namespace caps | Per-service | Full (explicit capnp serialize) |
Recommendation for capOS: Three phases, as informed by EROS:
- Explicit persistence (current plan) – services serialize state to the Store capability as capnp messages. Simple, gives services control.
- Opt-in Checkpoint capability – kernel captures process state (registers, memory, cap table) as capnp messages stored in the Store. Enables process migration and crash recovery for services that opt in.
- Coordinated checkpointing – a coordinator service orchestrates consistent snapshots across multiple services.
Persistent capability references (from EROS + Cap’n Proto):
struct PersistentCapRef {
interfaceId @0 :UInt64;
objectId @1 :UInt64;
permissions @2 :UInt32;
epoch @3 :UInt64;
}
Do NOT implement EROS-style transparent global persistence. The kernel complexity is enormous, debuggability is poor, and Cap’n Proto’s zero-copy serialization already provides near-equivalent benefits for explicit persistence.
6. Namespace and VFS
Plan 9’s per-process namespace is the closest analog to capOS’s per-process
capability table. The key insight: Plan 9’s bind/mount with union
semantics provides composability that capOS’s current Namespace design lacks.
Recommendation: Extend Namespace with union composition:
enum UnionMode { replace @0; before @1; after @2; }
interface Namespace {
resolve @0 (name :Text) -> (hash :Data);
bind @1 (name :Text, hash :Data) -> ();
list @2 () -> (names :List(Text));
sub @3 (prefix :Text) -> (ns :Namespace);
union @4 (other :Namespace, mode :UnionMode) -> (merged :Namespace);
}
VFS as a library (from Genode): libcapos-posix should be an in-process
library that translates POSIX calls to capability invocations. Each POSIX
process receives a declarative mount table (capnp struct) mapping paths to
capabilities. No VFS server needed.
FileServer capability (from Plan 9): For resources that are naturally
file-like (config trees, debug introspection, /proc-style interfaces),
provide a FileServer interface. Not universal (as in Plan 9) but available
where the file metaphor fits.
7. Resource Accounting
Genode’s session quota model addresses a gap in capOS: without resource accounting, a malicious client can exhaust a server’s memory by creating many sessions.
Recommendation: Session-creating capability methods should accept a resource donation parameter:
interface NetworkManager {
createTcpSocket @0 (bufferPages :UInt32) -> (socket :TcpSocket);
}
The client donates buffer memory as part of the session creation. The server allocates from donated resources, not its own.
8. Language Support Roadmap
From the LLVM research, the recommended order:
| Step | What | Blocks |
|---|---|---|
| 1 | Custom target JSON (x86_64-unknown-capos) | Optional build hygiene; not required by current no_std runtime |
| 2 | VirtualMemory capability | Done for baseline map/unmap/protect; Go allocator glue remains |
| 3 | TLS support (PT_TLS parsing, FS base save/restore) | Done for static ELF processes; runtime-controlled FS base remains |
| 4 | Futex authority capability + measured ABI | Go threads, pthreads |
| 5 | Timer capability (monotonic clock) | Go scheduler |
| 6 | Go Phase 1: minimal GOOS=capos (single-threaded) | CUE on capOS |
| 7 | Kernel threading | Go GOMAXPROCS>1 |
| 8 | C toolchain + libcapos | C programs, musl |
| 9 | Go Phase 2: multi-threaded + concurrent GC | Go network services |
| 10 | Go Phase 3: network poller | net/http on capOS |
Key decisions:
- Keep
x86_64-unknown-nonefor kernel,x86_64-unknown-caposfor userspace. - Use
local-execTLS model (static linking, no dynamic linker). - Implement futex as capability-authorized from the start. Because it operates on memory addresses and must be fast, measure generic capnp/ring calls against a compact capability-authorized operation before fixing the ABI.
- Go can start with cooperative-only preemption (no signals).
Recommendations by Roadmap Stage
Stage 5: Scheduling
| Source | Recommendation | Priority |
|---|---|---|
| Zircon | Generation counter in CapId (stale reference detection) | Done |
| seL4 | Add notification objects (lightweight bitmask signal/wait) | Medium |
| LLVM | Custom target JSON for userspace (x86_64-unknown-capos) | Medium |
| LLVM | Runtime-controlled FS-base operation for Go/threading | Medium |
Stage 6: IPC and Capability Transfer
| Source | Recommendation | Priority |
|---|---|---|
| seL4 | Direct-switch IPC for synchronous cross-process calls | Done baseline |
| seL4 | Badge field on capability entries (server identifies callers) | Done |
| Zircon | Move semantics for capability transfer through IPC | Done |
| Zircon | MemoryObject capability (shared memory for bulk data) | High |
| EROS | Epoch-based revocation (O(1) revoke, O(1) check) | High |
| Zircon | Sideband capability-transfer descriptors and result-cap records | Done baseline |
| Genode | SharedBuffer capability for data-plane transfers | High |
| Plan 9 | Promise pipelining (SQE chaining in async rings) | Medium |
| Genode | Session quotas / resource donation on session creation | Medium |
| seL4 | Scheduling context donation through IPC | Medium |
| Plan 9 | Namespace union composition (before/after/replace) | Low |
Post-Stage 6 / Future
| Source | Recommendation | Priority |
|---|---|---|
| seL4 | MCS scheduling (passive servers, temporal isolation) | When needed |
| EROS | Opt-in Checkpoint capability for process persistence | When needed |
| Genode | Dynamic manifest reconfiguration at runtime | When needed |
| Plan 9 | exportfs-pattern capability proxy for network transparency | When needed |
| EROS | PersistentCapRef struct in capnp for storing capability graphs | When needed |
| seL4 | Rust-native formal verification (track Verus/Prusti) | Long-term |
Design Decisions Validated
Several capOS design choices are validated by this research:
-
Cap’n Proto as the universal wire format. Superior to FIDL (random access, zero-copy, promise pipelining, persistence-ready). The right choice. See zircon.md Section 5.
-
Flat capability table. Simpler than seL4’s CNode tree, sufficient for capOS. Only add complexity (CNode-like hierarchy) if delegation patterns demand it. See sel4.md Section 4.
-
No ambient authority. Every surveyed capability OS confirms this is essential. EROS proved confinement. seL4 proved integrity. capOS has this by design.
-
Explicit persistence over transparent. EROS’s single-level store is elegant but the kernel complexity is enormous. Cap’n Proto zero-copy gives most of the benefits. See eros-capros-coyotos.md Section 6.
-
io_uring-inspired async rings. Better than Zircon’s port model for capOS (operation-based > notification-based). See zircon.md Section 4.
-
VFS as library, not kernel feature. Genode’s approach, matched by capOS’s planned
libcapos-posix. See genode.md Section 3. -
No fork(). Genode has operated without fork() for 15+ years, proving it unnecessary. See genode.md Section 4.
Design Gaps Identified
-
No bulk data path. Copying capnp messages through the kernel works for control but not for file/network/GPU data. SharedBuffer or MemoryObject capability is essential for Stage 6+.
-
Resource accounting is fragmented. The authority-accounting design exists and current code has several local ledgers, but VirtualMemory, FrameAllocator, and process/resource quotas are not yet unified.
-
No notification primitive. seL4 notifications (lightweight bitmask signal/wait) are needed for interrupt delivery and event notification without full capnp message overhead.
-
No runtime-controlled FS-base/thread TLS API. Static ELF TLS and context-switch FS-base state exist, but Go and future user threads still need a way to set FS base per thread.
References
See individual deep-dive reports for full reference lists. Key primary sources:
- Klein et al., “seL4: Formal Verification of an OS Kernel,” SOSP 2009
- Lyons et al., “Scheduling-context capabilities,” EuroSys 2018
- Shapiro et al., “EROS: A Fast Capability System,” SOSP 1999
- Shapiro & Weber, “Verifying the EROS Confinement Mechanism,” IEEE S&P 2000
- Pike et al., “The Use of Name Spaces in Plan 9,” OSR 1993
- Feske, “Genode Foundations” (genode.org/documentation)
- Fuchsia Zircon kernel documentation (fuchsia.dev)
seL4 Deep Dive: Lessons for capOS
Research notes on seL4’s design, covering formal verification, capability model, IPC, scheduling, and applicability to capOS.
Primary sources: “seL4: Formal Verification of an OS Kernel” (Klein et al., SOSP 2009), seL4 Reference Manual (v12.x / v13.x), “The seL4 Microkernel – An Introduction” (whitepaper, 2020), “Towards a Verified, General-Purpose Operating System Kernel” (Klein et al., 2008), “Principled Approach to Kernel Design for MCS” (Lyons et al., 2018), seL4 source code and API documentation.
1. Formal Verification Approach
What seL4 Proves
seL4 is the first general-purpose OS kernel with a machine-checked proof of functional correctness. The verification chain establishes:
-
Functional correctness: The C implementation of the kernel refines (faithfully implements) an abstract specification written in Isabelle/HOL. Every possible execution of the C code corresponds to an allowed behavior in the abstract spec. This is not “absence of some bug class” – it is a complete behavioral equivalence between spec and code.
-
Integrity (access control): The kernel enforces capability-based access control. A process cannot access a kernel object unless it holds a capability to it. This is proven as a consequence of functional correctness: the spec defines access rules, and the implementation provably follows them.
-
Confidentiality (information flow): In the verified configuration, information cannot flow between security domains except through explicitly authorized channels. This proves noninterference at the kernel level.
-
Binary correctness: The proof chain extends from the abstract spec through a Haskell executable model, then to the C implementation, and finally to the compiled ARM binary (via the verified CAmkES/CompCert chain or translation validation against GCC output). On ARM, the compiled binary is proven to behave as the C source specifies.
The Verification Chain
Abstract Specification (Isabelle/HOL)
|
| refinement proof
v
Executable Specification (Haskell)
|
| refinement proof
v
C Implementation (10,000 lines of C)
|
| translation validation / CompCert
v
ARM Binary
Each refinement step proves that the lower-level implementation is a correct realization of the higher-level spec. The Haskell model serves as an “executable spec” – it’s precise enough to run but abstract enough to reason about.
Properties Verified
- No null pointer dereferences – a consequence of functional correctness.
- No buffer overflows – all array accesses are proven in-bounds.
- No arithmetic overflow – all integer operations are proven safe.
- No use-after-free – memory management correctness is proven.
- No memory leaks (in the kernel) – all allocated memory is accounted for.
- No undefined behavior – the C code is proven to avoid all UB.
- Capability enforcement – objects are only accessible through valid capabilities, and capabilities cannot be forged.
- Authority confinement – proven that authority does not leak beyond what capabilities allow.
Practical Implications
What verification buys you:
- Eliminates all implementation bugs in the verified code. Not “most bugs” or “common bug classes” – literally all of them, for the verified configuration.
- The security properties (integrity, confidentiality) hold absolutely, not probabilistically.
- Makes the kernel trustworthy as a separation kernel / isolation boundary.
What verification does NOT cover:
- The specification itself could be wrong (it could specify the wrong behavior). Verification proves “code matches spec,” not “spec is correct.”
- Hardware must behave as modeled. The proof assumes a correct CPU, correct memory, no physical attacks. DMA from malicious devices can break isolation unless an IOMMU is used (and IOMMU management is proven correct).
- Only the verified configuration is covered. seL4 has unverified configurations (e.g., SMP, RISC-V, certain platform features). Using unverified features voids the proof.
- Performance-critical code paths (like the IPC fastpath) were initially outside the verification boundary, though significant progress has been made on verifying them.
- The bootloader and hardware initialization code are outside the proof boundary.
- Compiler correctness: on x86, the proof trusts GCC. On ARM, binary verification closes this gap.
Design Constraints Imposed by Verification
The requirement of formal verification has profoundly shaped seL4’s design:
-
Small kernel: ~10,000 lines of C. Every line must be verified, so the kernel is as small as possible. Drivers, file systems, networking – everything lives in user space.
-
No dynamic memory allocation in the kernel: The kernel does not have a general-purpose heap. All kernel memory is pre-allocated and managed through typed capabilities (Untyped memory). This eliminates an entire class of verification complexity (heap reasoning).
-
No concurrency in the kernel: seL4 runs the kernel as a single- threaded “big lock” model (interrupts disabled in kernel mode). SMP is handled by running independent kernel instances on each core with explicit message passing between them (the “clustered multikernel” approach), or by using a big kernel lock (the current SMP approach, which is NOT covered by the verification proof).
-
C implementation: Written in a restricted subset of C that is amenable to Isabelle/HOL reasoning. No function pointers (mostly), no complex pointer arithmetic, no compiler-specific extensions. This makes the code more rigid than typical C but provable.
-
Fixed system call set: The kernel API is small and fixed. Adding a new syscall requires extending the proofs – a major effort.
-
Platform-specific verification: The proof is per-platform. ARM was verified first; x86 verification came later with additional effort. Each new platform requires new proofs.
2. Capability Transfer Model
Core Concepts
seL4’s capability model descends from the EROS/KeyKOS tradition but with significant innovations driven by formal verification requirements.
Kernel Objects: Everything the kernel manages is an object: TCBs (thread control blocks), endpoints (IPC channels), CNodes (capability storage), page tables, frames, address spaces (VSpaces), untyped memory, and more. The kernel tracks the exact type and state of every object.
Capabilities: A capability is a reference to a kernel object combined with access rights. Capabilities are stored in kernel memory, never directly accessible to user space. User space refers to capabilities by position in its capability space.
CSpaces, CNodes, and CSlots
CSlot (Capability Slot): A single storage location that can hold one capability. A CSlot is either empty or contains a capability (object pointer
- access rights + badge).
CNode (Capability Node): A kernel object that is a power-of-two-sized
array of CSlots. A CNode with 2^n slots has a “guard” and a “radix” of
n. CNodes are the building blocks of the capability addressing tree.
CSpace (Capability Space): The complete capability namespace of a thread. A CSpace is a tree of CNodes, rooted at the thread’s CSpace root (a CNode pointed to by the TCB). Capability lookup traverses this tree.
Thread's TCB
|
+-- CSpace Root (CNode, 2^8 = 256 slots)
|
+-- slot 0: cap to Endpoint A
+-- slot 1: cap to Frame X
+-- slot 2: cap to another CNode (2^4 = 16 slots)
| |
| +-- slot 0: cap to Endpoint B
| +-- slot 1: empty
| +-- ...
+-- slot 3: empty
+-- ...
Capability Addressing (CPtr and Depth)
A CPtr (Capability Pointer) is a word-sized integer used to name a capability within a thread’s CSpace. It is NOT a memory pointer – it is an index that the kernel resolves by walking the CNode tree.
Resolution works bit-by-bit from the most significant end:
- Start at the CSpace root CNode.
- The CNode’s guard is compared against the corresponding bits of the CPtr. If they don’t match, the lookup fails. Guards allow sparse addressing without allocating huge CNode arrays.
- The next
radixbits of the CPtr are used as an index into the CNode array. - If the slot contains a CNode capability, recurse: consume the next bits of the CPtr to walk deeper.
- If the slot contains any other capability, the lookup is complete.
- The depth parameter in the syscall tells the kernel how many bits of the CPtr to consume. This disambiguates between “stop at this CNode cap” and “descend into this CNode.”
Example: A CPtr of 0x4B with a two-level CSpace:
- Root CNode: guard = 0, radix = 4 (16 slots)
- Bits [7:4] = 0x4 -> index into root CNode slot 4
- Slot 4 contains a CNode cap: guard = 0, radix = 4 (16 slots)
- Bits [3:0] = 0xB -> index into second-level CNode slot 11
- Slot 11 contains an Endpoint cap -> lookup complete
Flat Table vs. Hierarchical CSpace
seL4’s hierarchical CSpace has significant implications:
Advantages of hierarchical:
- Sparse capability spaces without wasting memory. A process can have a huge CPtr range with only a few CNodes allocated.
- Subtree delegation: a parent can give a child a CNode cap that grants access to a subset of capabilities. The child can manage its own subtree without affecting the parent’s.
- Guards compress address bits, allowing efficient encoding of large capability identifiers.
Disadvantages of hierarchical:
- Lookup is slower than a flat array index – multiple memory indirections per resolution.
- More complex kernel code (and more complex verification).
- User space must explicitly manage CNode allocation and CSpace layout.
capOS comparison: capOS uses a flat Vec<Option<Arc<dyn CapObject>>>
indexed by CapId (u32). The shared Arc lets a single kernel capability
back multiple per-process slots, which is what makes cross-process IPC work
when another service resolves its CapRef via CapSource::Service. The flat
layout is simpler and faster for lookup (single array index), but cannot
support sparse addressing or subtree delegation.
For capOS’s research goals, the flat approach is adequate initially. If
capOS needs hierarchical delegation later (e.g., a supervisor delegating
a subset of caps to a child without copying), it could add a level of
indirection without adopting seL4’s full tree model.
Capability Operations
seL4 provides these operations on capabilities:
Copy: Duplicate a capability from one CSlot to another. Both the source and destination must be in the caller’s CSpace (or the caller must have CNode caps to the relevant CNodes). The new cap has the same authority as the original, minus any rights the caller chooses to strip.
Mint: Like Copy, but also sets a badge on the new capability. A badge is a word-sized integer embedded in the capability that is delivered to the receiver when the capability is used. Badges allow a server to distinguish which client is calling – each client gets a differently-badged cap to the same endpoint, and the server sees the badge on each incoming message.
Move: Transfer a capability from one CSlot to another. The source slot becomes empty. This is an atomic transfer of authority.
Mutate: Move + modify rights or badge in one operation.
Delete: Remove a capability from a CSlot, making it empty.
Revoke: Delete a capability AND all capabilities derived from it. This is the most powerful operation – it allows a parent to withdraw authority it granted to children, transitively.
Capability Derivation and the CDT
seL4 tracks a Capability Derivation Tree (CDT) – a tree recording which capability was derived from which. When capability A is copied or minted to produce capability B, B becomes a child of A in the CDT.
Revoke(A) deletes all descendants of A in the CDT but leaves A itself.
This gives the holder of A the power to revoke all authority derived from
their own authority.
The CDT is critical for clean revocation but adds significant kernel complexity. It requires maintaining a tree structure across all capability copies throughout the system.
Untyped Memory and Retype
One of seL4’s most distinctive features is that the kernel never allocates
memory on its own. All physical memory is initially represented as
Untyped Memory capabilities. To create any kernel object (endpoint, CNode,
TCB, page frame, etc.), user space must invoke the Untyped_Retype operation
on an untyped cap, which carves out a portion of the untyped memory and
creates a new typed object.
This means:
- User space (specifically, the root task or a memory manager) controls all memory allocation.
- The kernel has zero internal allocation – all memory it uses comes from retyped untypeds.
- Memory exhaustion is impossible in the kernel – if a syscall needs memory, user space must have provided it in advance via retype.
- Revoke on an untyped cap destroys ALL objects created from it, reclaiming the memory. This is the mechanism for wholesale cleanup.
3. IPC Fastpath
Overview
seL4’s IPC is synchronous and endpoint-based. An endpoint is a rendezvous point: the sender blocks until a receiver is ready, or vice versa. There is no buffering in the kernel (unlike Mach ports or Linux pipes).
The IPC fastpath is a highly optimized code path for the common case of a short synchronous call/reply between two threads. It is one of seL4’s signature performance features.
How the Fastpath Works
When thread A calls seL4_Call(endpoint_cap, msg):
-
Capability lookup: Resolve the CPtr to find the endpoint cap. In the fastpath, this is optimized to handle the common case of a direct CSlot lookup (single-level CSpace, no guard traversal needed).
-
Receiver check: Is there a thread waiting on this endpoint? If yes, the fastpath applies. If no (receiver isn’t ready), fall to the slowpath which queues the sender.
-
Direct context switch: Instead of the normal path (save sender registers -> return to scheduler -> pick receiver -> restore receiver registers), the fastpath performs a direct register transfer:
- Save the sender’s register state into its TCB.
- Copy the message registers (a small number, typically 4-8 words) from the sender’s physical registers directly into the receiver’s TCB (or leave them in registers if possible).
- Load the receiver’s page table root (vspace) into CR3/TTBR.
- Switch to the receiver’s kernel stack.
- Restore the receiver’s register state.
- Return to user mode as the receiver.
This is a direct context switch – the kernel goes directly from the sender to the receiver without passing through the scheduler. The IPC operation IS the context switch.
-
Reply cap: The sender’s reply cap is set up so the receiver can reply. In the classic (non-MCS) model, a one-shot reply capability is placed in the receiver’s TCB. The receiver calls
seL4_Reply(msg)to send the response directly back.
Performance Characteristics
seL4 IPC is among the fastest measured:
- ARM (Cortex-A9): ~240 cycles for a Call+Reply round-trip (including two privilege transitions, a full context switch, and message transfer).
- x86-64: ~380-500 cycles for a Call+Reply round-trip depending on hardware generation.
- Message size: The fastpath handles small messages (fits in registers). Longer messages require copying from IPC buffer pages and take the slowpath.
For comparison:
- Linux
pipeIPC: ~5,000-10,000 cycles for a round-trip. - Mach IPC (macOS XNU): ~3,000-5,000 cycles.
- L4/Pistachio: ~700-1,000 cycles (seL4 improved on this).
Fastpath Constraints
The fastpath is only taken when ALL of these conditions hold:
- The operation is
seL4_CallorseL4_ReplyRecv(the two most common IPC operations). - The message fits in message registers (no extra caps, no long messages that require the IPC buffer).
- The capability lookup is “simple” – single-level CSpace, direct slot lookup, no guard bits to check.
- There IS a thread waiting at the endpoint (no need to block the sender).
- The receiver is at sufficient priority (in the non-MCS configuration, higher priority than any other runnable thread – or in MCS, the scheduling context can be donated).
- No capability transfer is happening in this message.
- Certain bookkeeping conditions are met (no pending operations on either thread, no debug traps, etc.).
When any condition fails, the kernel falls through to the slowpath, which handles the general case correctly but with more overhead (~5-10x slower than the fastpath).
Direct Switch Mechanics
The key insight is: when thread A calls thread B synchronously, A is going to block until B replies. There is no scheduling decision to make – the only correct action is to run B immediately. So the kernel skips the scheduler entirely:
Thread A (running) Kernel Thread B (blocked on recv)
| | |
| seL4_Call(ep, msg) ---> | |
| | [fastpath] |
| | Save A's regs |
| | Copy msg A -> B |
| | Switch page tables |
| | Restore B's regs |
| | ---------------------->|
| | | [running, processes msg]
| | |
| | <--- seL4_Reply(reply) |
| | [fastpath again] |
| | Save B's regs |
| | Copy reply B -> A |
| | Switch page tables |
| | Restore A's regs |
| <-----------------------| |
| [running, has reply] | |
The entire round-trip involves exactly two kernel entries and two context switches, with no scheduler invocation.
Implications
-
RPC is the natural IPC pattern: seL4’s IPC is optimized for the client-server call/reply pattern. Fire-and-forget or multicast patterns require different mechanisms (notifications, shared memory).
-
Notifications: For async signaling (like interrupts or events), seL4 provides notification objects – a lightweight word-sized bitmask that can be signaled and waited on without message transfer. These are separate from endpoints.
-
Shared memory for bulk transfer: IPC messages are small (register- sized). For large data transfers, the standard pattern is: set up shared memory, then use IPC to synchronize. This is explicit – the kernel doesn’t transparently copy large buffers.
4. CNode/CSpace Architecture in Detail
CNode Structure
A CNode object is a contiguous array of CSlots in kernel memory. The size is always a power of two. The kernel metadata for a CNode includes:
- Radix bits: log2 of the number of slots (e.g., radix=8 means 256 slots).
- Guard value: a bit pattern that must match the CPtr during resolution.
- Guard bits: the number of bits in the guard.
The total bits consumed during resolution of one CNode level is:
guard_bits + radix_bits.
Multi-Level Resolution Example
Consider a two-level CSpace:
Root CNode: guard=0 (0 bits), radix=8 (256 slots)
Slot 5 -> CNode B: guard=0x3 (2 bits), radix=6 (64 slots)
Slot 42 -> Endpoint X
To reach Endpoint X with a 16-bit CPtr at depth 16:
- CPtr = 0b 00000101 11 101010
- Root CNode consumes 8 bits: 00000101 = 5 -> Slot 5 (CNode B cap)
- CNode B guard: next 2 bits = 11 -> matches guard 0x3 -> OK
- CNode B radix: next 6 bits = 101010 = 42 -> Slot 42 (Endpoint X)
- Total bits consumed: 8 + 2 + 6 = 16 = depth -> resolution complete
CSpace Layout Strategies
Flat: One large root CNode with radix=N, no sub-CNodes. Simple, fast lookup (one level). Wastes memory if the CPtr space is sparse.
Two-level: Small root CNode pointing to sub-CNodes. Common for processes that need moderate capability counts.
Deep: Many levels. Useful for delegation: a supervisor gives a child a cap to a sub-CNode, and the child manages its own CSpace subtree below that point.
Comparison with capOS’s Flat Table
| Aspect | seL4 CSpace | capOS CapTable |
|---|---|---|
| Structure | Tree of CNodes | Flat Vec<Option<Arc<dyn CapObject>>> |
| Lookup cost | O(depth) memory indirections | O(1) array index |
| Sparse support | Yes (guards + tree) | No (dense array, holes via free list) |
| Subtree delegation | Yes (grant CNode cap) | No |
| Memory overhead | CNode objects are power-of-2 | Exact-sized Vec |
| Complexity | High (bit-level CPtr resolution) | Low |
| Capability identity | Position in CSpace | CapId (u32 index) |
| Verification burden | Very high | N/A (Rust safety) |
5. MCS (Mixed-Criticality Systems) Scheduling
Background
The original seL4 scheduling model is a simple priority-preemptive scheduler with 256 priority levels and round-robin within each level. This model has a known flaw: priority inversion through IPC. When a high-priority thread calls a low-priority server, the reply might be delayed indefinitely by medium-priority threads preempting the server. The classic solution (priority inheritance) is complex to verify and doesn’t compose well.
The MCS extensions redesign scheduling to solve this and provide temporal isolation.
Key Concepts
Scheduling Context (SC): A new kernel object that represents the “right to execute on a CPU.” An SC holds:
- A budget (microseconds of CPU time per period)
- A period
- A priority
- Remaining budget in the current period
A thread must have a bound SC to be runnable. Without an SC, a thread cannot execute regardless of its priority.
Reply Object: In the MCS model, the one-shot reply capability from classic seL4 is replaced by an explicit Reply kernel object. When thread A calls thread B:
- A’s scheduling context is donated to B.
- A reply object is created to hold A’s return path.
- B now runs on A’s scheduling context (A’s priority and budget).
- When B replies, the SC returns to A.
This solves priority inversion: the server (B) inherits the caller’s priority and budget automatically.
Passive servers: A server thread can exist without its own SC. It only becomes runnable when a client donates an SC via the Call operation. When it replies, it becomes passive again. This is powerful:
- No CPU time is “reserved” for idle servers.
- The server executes on the client’s budget – the client pays for the work it requests.
- Multiple clients can call the same passive server; each brings its own SC.
Temporal Isolation
MCS SCs provide temporal fault isolation:
- Each SC has a fixed budget/period. A thread cannot exceed its budget in any period. When the budget expires, the thread is descheduled until the next period begins.
- This is enforced by hardware timer interrupts – the kernel programs the timer to fire when the current SC’s budget expires.
- A misbehaving (or compromised) component cannot starve other components because its SC bounds its CPU consumption.
- This works even across IPC: if client A calls server B with A’s SC, the combined execution of A+B is bounded by A’s budget.
Comparison with capOS’s Scheduler
capOS currently has a round-robin scheduler (kernel/src/sched.rs) with no
priority levels and no temporal isolation:
#![allow(unused)]
fn main() {
struct Scheduler {
processes: BTreeMap<Pid, Process>,
run_queue: VecDeque<Pid>,
current: Option<Pid>,
}
}
Timer preemption, cap_enter blocking waits, Endpoint IPC, and a baseline
direct IPC handoff are implemented. The MCS model is relevant for the next
scheduling step because the same priority inversion problem arises when a
high-priority client calls a low-priority server through a capability.
6. Relevance to capOS
6.1 Formal Verification
Applicability: Low in the near term. seL4’s verification is done in Isabelle/HOL over C code, which doesn’t transfer to Rust. However, the constraints that verification imposed are valuable design guidance:
- Minimal kernel: seL4’s ~10K lines of C demonstrate how little code a microkernel actually needs. capOS should resist adding kernel features and instead move them to user space.
- No kernel heap allocation on the critical path: seL4’s “untyped memory” approach where user space provides all memory is worth studying. capOS has removed the earlier allocation-heavy synchronous ring dispatch path, but it still uses owned kernel objects and preallocated scratch rather than a user-supplied untyped-memory model.
- No kernel concurrency: seL4 avoids kernel-level concurrency entirely
(SMP uses separate kernel instances or a big lock). capOS currently uses
spin::Mutexaround the scheduler and capability tables. The seL4 approach suggests this is acceptable until/unless per-CPU kernel instances are needed.
Rust alternative: Rust’s type system provides memory safety guarantees that overlap with some of seL4’s verified properties (no buffer overflows, no use-after-free, no null dereference in safe code). This is not a substitute for functional correctness proofs, but it significantly raises the bar compared to unverified C. Ongoing research in Rust formal verification (e.g., Prusti, Creusot, Verus) may eventually enable seL4-style proofs over Rust kernels.
6.2 Capability Model
CNode tree vs. flat table: capOS’s flat CapTable is the right choice
for now. seL4’s CNode tree exists to support delegation (granting a subtree
of your CSpace to a child) and sparse addressing. capOS’s current model
gives each process its own independent flat table and now supports
manifest-provided caps plus explicit copy/move transfer descriptors through
Endpoint IPC. If capOS later needs fine-grained delegation (a parent granting
access to a subset of its caps without copying), it can add a level of
indirection:
Option A: Proxy capability objects that forward to the parent's table
Option B: A two-level table (small root array -> larger sub-arrays)
Option C: Shared capability objects with refcounting
Badge/Mint pattern: seL4’s badge mechanism is directly applicable to capOS. When multiple clients share a capability to the same server endpoint, the server needs to distinguish callers. In seL4, each client gets a differently-badged copy of the endpoint cap. The badge is delivered with each message.
capOS has implemented this by adding badge metadata to capability references and hold edges. Endpoint CALL delivery reports the invoked hold badge to the receiver, and copy/move transfer preserves badge metadata.
Current ring SQEs carry cap id and method id separately. The cap table stores badge and transfer-mode metadata alongside the object reference:
#![allow(unused)]
fn main() {
struct CapEntry {
object: Arc<dyn CapObject>,
badge: u64,
transfer_mode: CapTransferMode,
}
}
Revocation (CDT): seL4’s Capability Derivation Tree is its most complex internal structure. For capOS, full CDT-style transitive revocation is probably overkill initially. The service-architecture proposal already identifies simpler alternatives:
- Generation counters: Each capability has a generation number. Bumping the generation invalidates all references without traversing a tree.
- Proxy caps: A proxy object that can be invalidated by its creator. Callers hold the proxy, not the real capability.
- Process-lifetime revocation: When a process dies, all caps it held are automatically invalidated (seL4 does this too, but the CDT allows more fine-grained revocation within a living process).
Untyped memory: seL4’s “no kernel allocation” model via untyped memory
is elegant but probably too heavyweight for capOS’s current stage. The key
takeaway is the principle: user space should control resource allocation
as much as possible. capOS’s FrameAllocator capability already moves frame
allocation authority into the capability model.
6.3 IPC Design
This is the most directly actionable area for capOS’s Stage 6.
seL4’s model (synchronous rendezvous + direct switch) vs. capOS’s model (async rings + Cap’n Proto wire format):
| Aspect | seL4 | capOS |
|---|---|---|
| IPC primitive | Synchronous endpoint | Async submission/completion rings |
| Message format | Untyped words in registers | Cap’n Proto serialized messages |
| Bulk transfer | Shared memory (explicit) | TBD (copy in kernel or shared memory) |
| Message size | Small (register-sized, ~4-8 words) | Variable (up to 64KB currently) |
| Scheduling integration | Direct switch (caller -> callee) | Baseline direct IPC handoff implemented |
| Batching | No (one message per syscall) | Yes (io_uring-style ring) |
Key lessons from seL4’s IPC for capOS:
-
Direct switch for synchronous RPC: Even with async rings, capOS needs a synchronous fast path. The baseline single-CPU direct IPC handoff is implemented for the case where process A calls an Endpoint and process B is blocked waiting in RECV. Future work is register payload transfer and measured fastpath tuning.
-
Register-based message transfer for small messages: seL4 avoids copying message bytes through kernel buffers for small messages by transferring them through registers during the context switch. capOS currently moves serialized payloads through ring buffers and bounded kernel scratch. For cross-process IPC, minimizing copies is critical. Options:
- Small messages (<64 bytes) could be transferred in registers during direct switch.
- Large messages could use shared memory regions (mapped into both address spaces) with IPC used only for synchronization.
- The io_uring-style rings are already shared memory – the submission and completion ring buffers could potentially be mapped into both the caller’s and callee’s address spaces for zero-copy IPC.
-
Separate mechanisms for sync and async: seL4 uses endpoints for synchronous IPC and notification objects for async signaling. capOS’s io_uring approach inherently supports batched async operations, but the common case of a simple RPC call-and-wait should have a fast synchronous path too. The two mechanisms complement each other.
-
Notifications for interrupts and events: seL4’s notification objects (lightweight bitmask signal/wait) map well to capOS’s interrupt delivery model. When a hardware interrupt fires, the kernel signals a notification object, and the driver thread waiting on that notification wakes up. This is cleaner than delivering interrupts as full IPC messages.
The Cap’n Proto dimension: capOS’s use of Cap’n Proto wire format for capability messages is a significant divergence from seL4’s untyped word arrays. Tradeoffs:
- Pro: Type safety, schema evolution, language-neutral interfaces, built-in serialization/deserialization, native support for capability references in messages (Cap’n Proto has a “capability table” concept in its RPC protocol).
- Con: Serialization overhead. Even Cap’n Proto’s zero-copy format requires pointer validation and bounds checking that seL4’s raw register transfer does not. For very hot IPC paths, this overhead may be significant.
- Mitigation: For the hot path, capOS could define a “small message” format that bypasses full capnp serialization – just a few raw words, similar to seL4’s register message. Fall back to full capnp for larger or more complex messages.
6.4 MCS Scheduling
Priority donation via IPC: Directly relevant when capOS implements cross-process capability calls. If process A (high priority) calls a capability in process B (low priority), B needs to run at A’s priority to avoid inversion. The seL4 MCS approach of “donating” the scheduling context with the IPC message is clean and composable.
For capOS, the io_uring model complicates this slightly: if submissions are batched, which submitter’s priority should the server inherit? Options:
- Inherit the highest priority among pending submissions.
- Each submission carries its own priority/scheduling context.
- Use the synchronous fast-path (with donation) for priority-sensitive calls, and the async ring for bulk/background operations.
Passive servers: The MCS concept of servers that only consume CPU when called (by borrowing the caller’s scheduling context) maps well to capOS’s capability-based services. A network stack server that only runs when a client sends a request, consuming the client’s CPU budget, is a natural fit for capOS’s service architecture.
Temporal isolation: Budget/period enforcement prevents denial-of-service between capability holders. Even if process A holds a capability to process B, A cannot cause B to consume unbounded CPU time – B’s execution on behalf of A is bounded by A’s scheduling context budget. This is worth considering for capOS’s roadmap, especially for the cloud deployment scenario where isolation is critical.
6.5 Specific Recommendations for capOS
Near-term (Stages 5-6):
-
Badge field on cap holds: Done. Manifest
CapRefbadge metadata is carried into cap-table hold edges, delivered to Endpoint receivers, and preserved across copy/move transfer. -
Implement direct-switch IPC for synchronous calls: Baseline done for Endpoint receivers blocked in RECV. Remaining work is the measured fastpath shape, especially small-message register transfer.
-
Keep the flat CapTable: seL4’s CNode tree complexity is justified by formal verification constraints and subtree delegation. capOS’s flat table is simpler and sufficient. Add proxy/wrapper capabilities for delegation rather than restructuring the table.
-
Add notification objects: A lightweight signaling primitive (word- sized bitmask, signal/wait operations) for interrupt delivery and event notification. Much cheaper than sending a full capnp message for “wake up, there’s work to do.”
Medium-term (post-Stage 6):
-
Scheduling context donation: When implementing priority scheduling, attach a scheduling context to IPC calls so servers inherit caller priority. This prevents priority inversion through the capability graph.
-
Capability rights attenuation: Add a rights mask to capability references so a parent can grant a cap with reduced permissions (e.g., read-only access to a read-write capability). seL4’s rights bits are: Read, Write, Grant (can pass the cap to others), GrantReply (can pass reply cap only).
-
Revocation via generation/epoch counters: Generation-tagged
CapIds catch stale slot reuse today. Object-wide epoch revocation remains future work and is simpler than a derivation tree.
Long-term (research directions):
-
Zero-copy IPC via shared memory: For bulk data transfer between processes, map shared memory regions (Cap’n Proto segments) into both address spaces. Use IPC only for synchronization and capability transfer. This combines seL4’s “shared memory + IPC sync” pattern with capOS’s Cap’n Proto wire format.
-
Rust-native verification: Track developments in Verus, Prusti, and other Rust verification tools. capOS’s Rust implementation is better positioned for future formal verification than a C implementation would be, given the type system guarantees already present.
-
Untyped memory model: Consider moving kernel object allocation entirely into capability-gated operations (like seL4’s Retype). User space provides memory for all kernel objects, ensuring the kernel never runs out of memory on its own. This is a significant architectural change but aligns with the “everything is a capability” principle.
Summary Table
| seL4 Feature | Maturity | capOS Equivalent | Recommended Action |
|---|---|---|---|
| Functional correctness proof | Production | None (Rust type safety) | Track Rust verification tools |
| CNode/CSpace tree | Production | Flat CapTable | Keep flat |
| Capability badge/mint | Production | Hold-edge badge | Done baseline |
| Revocation (CDT) | Production | Generation-tagged CapId; no epoch yet | Use epoch revocation instead of CDT |
| Untyped memory / Retype | Production | FrameAllocator cap | Consider for hardening phase |
| Synchronous IPC endpoints | Production | Endpoint CALL/RECV/RETURN | Done baseline |
| IPC fastpath (direct switch) | Production | Direct IPC handoff | Done baseline; tune register payload later |
| Notification objects | Production | None | Implement as lightweight signal primitive |
| MCS Scheduling Contexts | Production | Round-robin scheduler | Implement SC donation for IPC |
| Passive servers | Production | None | Natural fit with cap-based services |
| Temporal isolation | Production | None | Consider for cloud deployment |
References
- Klein, G., et al. “seL4: Formal Verification of an OS Kernel.” SOSP 2009.
- seL4 Reference Manual, versions 12.1.0 and 13.0.0.
- “The seL4 Microkernel – An Introduction.” seL4 Foundation Whitepaper, 2020.
- Lyons, A., et al. “Scheduling-context capabilities: A principled, light-weight operating-system mechanism for managing time.” EuroSys 2018.
- Heiser, G., & Elphinstone, K. “L4 Microkernels: The Lessons from 20 Years of Research and Deployment.” SOSP 2016.
- seL4 source code: https://github.com/seL4/seL4
- seL4 API documentation: https://docs.sel4.systems/
Fuchsia Zircon Kernel: Research Report for capOS
Research into Zircon’s design for informing capOS capability model, IPC, virtual memory, async I/O, and interface definition decisions.
1. Handle-Based Capability Model
Overview
Zircon implements capabilities as handles. A handle is a process-local integer (similar to a Unix file descriptor) that references a kernel object and carries a bitmask of rights. The kernel maintains a per-process handle table that maps handle values to (kernel_object_pointer, rights) pairs. Processes can only interact with kernel objects through handles they hold.
There is no ambient authority in Zircon. A process cannot address kernel objects by name, path, or global ID – it must possess a handle. The initial set of handles is passed to a process at creation time by its parent (or by the component framework).
Handle Representation
Internally, a handle is:
- A process-local 32-bit integer (the “handle value”). The low two bits encode a generation counter to detect use-after-close.
- A reference to a kernel object (refcounted
Dispatcherin Zircon’s C++). - A rights bitmask (
zx_rights_t, auint32_t).
The handle table is per-process, so handle value 0x1234 in process A and
0x1234 in process B refer to completely different objects (or nothing).
Rights
Rights are a bitmask that constrain what operations a handle can perform. Key rights include:
| Right | Meaning |
|---|---|
ZX_RIGHT_DUPLICATE | Can be duplicated via zx_handle_duplicate() |
ZX_RIGHT_TRANSFER | Can be sent through a channel |
ZX_RIGHT_READ | Can read data (channel messages, VMO bytes) |
ZX_RIGHT_WRITE | Can write data |
ZX_RIGHT_EXECUTE | VMO can be mapped as executable |
ZX_RIGHT_MAP | VMO can be mapped into a VMAR |
ZX_RIGHT_GET_PROPERTY | Can query object properties |
ZX_RIGHT_SET_PROPERTY | Can modify object properties |
ZX_RIGHT_SIGNAL | Can set user signals on the object |
ZX_RIGHT_WAIT | Can wait on the object’s signals |
ZX_RIGHT_MANAGE_PROCESS | Can perform management ops on a process |
ZX_RIGHT_MANAGE_THREAD | Can manage threads |
When a syscall is invoked on a handle, the kernel checks that the handle’s
rights include the rights required by that syscall. For example,
zx_channel_write() requires ZX_RIGHT_WRITE on the channel handle.
Rights can only be reduced, never amplified. zx_handle_duplicate() takes
a rights mask and the new handle gets original_rights & requested_rights.
Handle Lifecycle
Creation: Syscalls that create kernel objects return handles. For example,
zx_channel_create() returns two handles (one for each endpoint).
zx_vmo_create() returns a VMO handle. The initial rights are defined per
object type (e.g., a new channel endpoint gets
READ|WRITE|TRANSFER|DUPLICATE|SIGNAL|WAIT).
Duplication: zx_handle_duplicate(handle, rights) -> new_handle. Creates
a second handle to the same kernel object, possibly with reduced rights. The
original is untouched. Requires ZX_RIGHT_DUPLICATE on the source handle.
Transfer: Handles are transferred through channels. When a message is
written to a channel, handles listed in the message are moved from the
sender’s handle table to a transient state inside the channel message. When the
message is read, those handles are installed into the receiver’s handle table
with new handle values. The original handle values in the sender become invalid.
Transfer requires ZX_RIGHT_TRANSFER on each handle being sent.
Replacement: zx_handle_replace(handle, rights) -> new_handle. Atomically
invalidates the old handle and creates a new one with the specified rights
(must be a subset). This avoids a window where two handles exist simultaneously
(unlike duplicate-then-close). Useful for reducing rights before transferring.
Closing: zx_handle_close(handle). Removes the handle from the process’s
table and decrements the kernel object’s refcount. When the last handle to an
object is closed, the object is destroyed (with some exceptions like the
kernel itself keeping references).
Comparison to capOS
capOS’s current CapTable maps CapId (u32) to an Arc<dyn CapObject>. The
shared Arc lets a single kernel capability (for example, a kernel:endpoint
owned by one service and referenced by another through CapSource::Service)
back multiple per-process CapTable slots for cross-process IPC. This is
conceptually similar to Zircon’s handle table, but with key differences:
| Aspect | Zircon | capOS (current) |
|---|---|---|
| Rights | Bitmask per handle | None (all-or-nothing) |
| Object types | Fixed kernel types (Channel, VMO, etc.) | Extensible via CapObject trait |
| Transfer | Move semantics through channels | Copy/move descriptors through Endpoint IPC |
| Duplication | Explicit with rights reduction | Copy transfer for transferable holds |
| Revocation | Close handle; object dies with last ref | Remove from table; no propagation |
| Interface | Fixed syscall per object type | Cap’n Proto method dispatch |
| Generation counter | Low bits of handle value | Upper bits of CapId |
Recommendations for capOS:
-
Keep method authority in typed interfaces for now. Zircon’s rights bitmask is useful for an untyped syscall surface. capOS currently uses narrow Cap’n Proto interfaces plus hold-edge transfer metadata; generic READ/WRITE flags would duplicate schema-level authority unless a concrete cross-interface need appears.
-
Handle generation counters. Implemented: capOS encodes a generation tag in the upper bits of
CapId, with lower bits selecting the table slot. This catches stale CapId use after slot reuse. -
Move semantics for transfer. Implemented for Endpoint CALL/RETURN sideband descriptors. Copy transfer remains explicit and requires a transferable source hold.
-
replaceoperation. An atomic replace (invalidate old, create new with reduced rights) is cleaner than duplicate-then-close for rights attenuation before transfer.
2. Channels
Overview
Zircon channels are the fundamental IPC primitive. A channel is a bidirectional, asynchronous message-passing pipe with two endpoints. Each endpoint is a separate kernel object referenced by a handle.
Creation and Structure
zx_channel_create(options, &handle0, &handle1) creates a channel and returns
handles to both endpoints. Each endpoint can be independently transferred to
different processes. When one endpoint is closed, the other becomes
“peer-closed” (signaled with ZX_CHANNEL_PEER_CLOSED).
Message Format
A channel message consists of:
- Data: Up to 65,536 bytes (64 KiB) of arbitrary byte payload.
- Handles: Up to 64 handles transferred with the message.
Messages are discrete and ordered (FIFO). There is no streaming or partial reads – you read a complete message or nothing.
Write and Read Syscalls
Write: zx_channel_write(handle, options, bytes, num_bytes, handles, num_handles)
- Copies
bytesinto the kernel message queue. - Moves each handle in the
handlesarray from the caller’s handle table into the message. If any handle is invalid or lacksZX_RIGHT_TRANSFER, the entire write fails and no handles are moved. - The write is non-blocking. If the peer has been closed, returns
ZX_ERR_PEER_CLOSED.
Read: zx_channel_read(handle, options, bytes, handles, num_bytes, num_handles, actual_bytes, actual_handles)
- Dequeues the next message. Copies data into
bytes, installs handles into the caller’s handle table, writing new handle values into thehandlesarray. - If the buffer is too small, returns
ZX_ERR_BUFFER_TOO_SMALLand fillsactual_bytes/actual_handlesso the caller can retry with a larger buffer. - Non-blocking by default.
zx_channel_call: A synchronous call primitive. Writes a message to the
channel, then blocks waiting for a reply with a matching transaction ID. This
is the primary mechanism for client-server RPC. The kernel optimizes this path
to avoid unnecessary scheduling: if the server thread is waiting to read, the
kernel can directly switch to it (similar to L4 IPC optimizations).
Handle Transfer Mechanics
When handles are sent through a channel:
- The kernel validates all handles (exist, have
TRANSFERright). - Handles are atomically removed from the sender’s table.
- Handle objects are stored inside the kernel message structure.
- On read, handles are inserted into the receiver’s table with fresh handle values.
- If the channel is destroyed with unread messages containing handles, those handles are closed (objects’ refcounts decremented).
This is critical: handle transfer is move, not copy. The sender loses the
handle. To keep a copy, the sender must duplicate before sending.
Signals
Each channel endpoint has associated signals:
ZX_CHANNEL_READABLE– at least one message is queued.ZX_CHANNEL_PEER_CLOSED– the other endpoint was closed.
Processes can wait on these signals using zx_object_wait_one(),
zx_object_wait_many(), or by binding to a port (see Section 4).
FIDL Relationship
Channels carry raw bytes + handles. FIDL (Section 5) provides the structured protocol layer on top: it defines how bytes are laid out (message header with transaction ID, ordinal, flags; then the payload) and how handles in the message correspond to protocol-level concepts (client endpoints, server endpoints, VMOs, etc.).
Every FIDL protocol communication happens over a channel. A FIDL “client end” is a channel endpoint handle where the client sends requests and reads responses. A “server end” is the other endpoint where the server reads requests and sends responses.
Comparison to capOS
capOS currently uses shared submission/completion rings with Endpoint objects for cross-process CALL/RECV/RETURN routing. Same-process capabilities dispatch directly through the holder’s table; cross-process Endpoint calls queue to the server ring and can trigger a direct IPC handoff when the receiver is blocked.
| Aspect | Zircon Channels | capOS |
|---|---|---|
| Topology | Point-to-point, 2 endpoints | Endpoint-routed capability calls |
| Async | Non-blocking read/write + signal waits | Shared SQ/CQ rings |
| Handle/cap transfer | Embedded in messages | Sideband transfer descriptors |
| Message format | Raw bytes + handles | Cap’n Proto serialized |
| Size limits | 64 KiB data, 64 handles | 64 KiB params (current limit) |
| Buffering | Kernel-side message queue | Endpoint queues plus per-process rings |
Recommendations for capOS:
-
Capability transfer alongside capnp messages. Zircon embeds handles as out-of-band data alongside message bytes. capOS has adopted the same separation with ring sideband transfer descriptors and result-cap records. That keeps the kernel from parsing arbitrary Cap’n Proto payload graphs.
-
Two-endpoint channels vs. Endpoint calls. Zircon’s channels are general-purpose pipes. capOS uses a lighter Endpoint CALL/RECV/RETURN model where a capability invocation is routed to the serving process rather than requiring a channel object per connection.
-
Message size limits. Zircon’s 64 KiB limit has been a pain point (large data must go through VMOs). capOS’s capnp messages naturally handle this because large data can be a separate VMO-like capability referenced in the message. Keep the per-message limit reasonable (64 KiB is a good default) and use capability references for bulk data.
3. VMARs and VMOs
Virtual Memory Objects (VMOs)
A VMO is a kernel object representing a contiguous region of virtual memory that can be mapped into address spaces. VMOs are the fundamental unit of memory in Zircon.
Types:
- Paged VMO: Backed by the page fault handler. Pages are allocated on demand. This is the default.
- Physical VMO: Backed by a specific contiguous range of physical memory. Used for device MMIO.
- Contiguous VMO: Like a paged VMO but guarantees physically contiguous pages. Used for DMA.
Key operations:
zx_vmo_create(size, options) -> handle: Create a paged VMO.zx_vmo_read(handle, buffer, offset, length): Read bytes from a VMO.zx_vmo_write(handle, buffer, offset, length): Write bytes to a VMO.zx_vmo_get_size()/zx_vmo_set_size(): Query/resize.zx_vmo_op_range(): Operations like commit (force-allocate pages), decommit (release pages back to system), cache ops.
VMOs can be read/written directly via syscalls without mapping them. This is useful for small transfers but less efficient than mapping for large data.
Copy-on-Write (CoW) Cloning
zx_vmo_create_child(handle, options, offset, size) -> child_handle
Creates a child VMO that is a CoW clone of a range within the parent. Several clone types exist:
-
Snapshot (
ZX_VMO_CHILD_SNAPSHOT): Point-in-time snapshot. Both parent and child see CoW pages. Writes to either side trigger page copies. The child is fully independent after creation – closing the parent does not affect committed pages in the child. -
Slice (
ZX_VMO_CHILD_SLICE): A window into the parent. No CoW – writes to the slice are visible through the parent and vice versa. The child cannot outlive the parent. -
Snapshot-at-least-on-write (
ZX_VMO_CHILD_SNAPSHOT_AT_LEAST_ON_WRITE): Like snapshot but allows the implementation to share unchanged pages between parent and child more aggressively (pages only diverge when written).
CoW cloning is central to how Fuchsia implements fork()-like semantics for
memory (though Fuchsia doesn’t have fork()) and how it shares immutable data
(e.g., shared libraries are CoW-cloned VMOs).
Virtual Memory Address Regions (VMARs)
A VMAR represents a contiguous range of virtual address space within a process. VMARs form a tree rooted at the process’s root VMAR, which covers the entire user-accessible address space.
Hierarchy:
Root VMAR (entire user address space)
+-- Sub-VMAR A (e.g., 0x1000..0x10000)
| +-- Mapping of VMO X at offset 0x1000
| +-- Sub-VMAR B (0x5000..0x8000)
| +-- Mapping of VMO Y at offset 0x5000
+-- Sub-VMAR C (0x20000..0x30000)
+-- Mapping of VMO Z at offset 0x20000
Key operations:
zx_vmar_map(vmar, options, offset, vmo, vmo_offset, len) -> addr: Map a VMO (or a range of it) into the VMAR at a specific offset or let the kernel choose (ASLR).zx_vmar_unmap(vmar, addr, len): Remove a mapping.zx_vmar_protect(vmar, options, addr, len): Change permissions (read/write/execute) on a mapped range.zx_vmar_allocate(vmar, options, offset, size) -> child_vmar, addr: Create a sub-VMAR.zx_vmar_destroy(vmar): Recursively unmap everything and destroy all sub-VMARs. Prevents new mappings.
ASLR: Zircon implements address space layout randomization through VMARs.
When ZX_VM_OFFSET_IS_UPPER_LIMIT or no specific offset is given, the kernel
randomizes placement within the VMAR.
Permissions: Mapping permissions (R/W/X) are constrained by the VMO
handle’s rights. A VMO handle without ZX_RIGHT_EXECUTE cannot be mapped
as executable, regardless of what the zx_vmar_map() call requests.
Why VMARs Matter
VMARs provide:
- Sandboxing within a process. A component can be given a sub-VMAR handle instead of the root VMAR, limiting where it can map memory.
- Hierarchical cleanup. Destroying a VMAR recursively unmaps everything beneath it.
- Controlled mapping. The parent decides the address space layout for child components by allocating sub-VMARs and passing only sub-VMAR handles.
Comparison to capOS
capOS currently has AddressSpace plus a VirtualMemory capability for
anonymous map/unmap/protect operations. There is no VMO-like shared memory
object yet; FrameAllocator still exposes raw physical frame grants.
| Aspect | Zircon | capOS (current) |
|---|---|---|
| Memory objects | VMO (paged, physical, contiguous) | Raw frames plus anonymous VirtualMemory mappings |
| CoW | VMO child clones (snapshot, slice) | Not implemented |
| Address space | VMAR tree | Flat AddressSpace plus VirtualMemory cap |
| Sharing | Map same VMO in multiple processes | Not implemented |
| Permissions | Per-mapping + per-handle rights | Per-page flags at mapping time |
Recommendations for capOS:
-
VMO-equivalent capability. A “MemoryObject” capability that represents a range of memory (backed by demand-paging or physical pages). This becomes the unit of sharing: pass a MemoryObject cap through IPC, and the receiver maps it into their address space. Define it in
schema/capos.capnp. -
Sub-VMAR capabilities for sandboxing. When spawning a process, instead of granting access to the full address space, grant a sub-region capability. This limits where the process can map memory.
-
CoW cloning is valuable but not urgent. The primary use case (shared libraries, fork) may not apply to capOS’s early stages. Design the VMO interface to support cloning later.
-
VMO read/write without mapping. Zircon allows reading/writing VMO contents via syscall without mapping. This is useful for small IPC data and avoids TLB pressure. Consider supporting this in capOS’s MemoryObject.
4. Async Model (Ports)
Overview
Zircon’s async I/O model is built around ports – kernel objects that
receive event packets. A port is similar to Linux’s epoll but with important
differences. It is the foundation for all async programming in Fuchsia.
Port Basics
A port is a kernel object with a queue of packets (zx_port_packet_t).
Packets arrive either from signal-based waits or from direct user queuing.
Key operations:
zx_port_create(options) -> handle: Create a port.zx_port_wait(port, deadline) -> packet: Dequeue the next packet, blocking until one is available or the deadline expires.zx_port_queue(port, packet): Manually enqueue a user packet.zx_port_cancel(port, source, key): Cancel pending waits.
Signal-Based Async (Object Wait Async)
zx_object_wait_async(object, port, key, signals, options):
This is the primary mechanism. It tells the kernel: “when object has any of
these signals asserted, deliver a packet to port with this key.”
Two modes:
- One-shot (
ZX_WAIT_ASYNC_ONCE): The wait fires once and is automatically removed. The user must re-register after handling. - Edge-triggered (
ZX_WAIT_ASYNC_EDGE): Fires each time a signal transitions from deasserted to asserted. Stays registered.
Packet Format
typedef struct zx_port_packet {
uint64_t key; // User-defined key (set during wait_async)
uint32_t type; // ZX_PKT_TYPE_SIGNAL_ONE, ZX_PKT_TYPE_USER, etc.
zx_status_t status; // Result status
union {
zx_packet_signal_t signal; // Which signals triggered
zx_packet_user_t user; // User-queued packet payload (32 bytes)
zx_packet_guest_bell_t guest_bell;
// ... other packet types
};
} zx_port_packet_t;
The signal variant includes trigger (which signals were waited on),
observed (current signal state), and a count (for edge-triggered, how many
transitions).
Async Dispatching (libasync)
Fuchsia’s userspace async library (libfidl, async-loop) provides a
higher-level event loop:
async::Loop: An event loop that owns a port and dispatches events to registered handlers.async::Wait: Wrapszx_object_wait_async()with a callback. When the signal fires, the loop calls the handler.async::Task: Runs a closure on the loop’s dispatcher.- FIDL bindings: The async FIDL bindings register channel-readable waits on the loop’s port. When a message arrives, the FIDL dispatcher decodes it and calls the appropriate protocol method handler.
The typical pattern:
loop = async::Loop()
loop.port -> zx_port_create()
// Register interest in channel readability
zx_object_wait_async(channel, loop.port, key, ZX_CHANNEL_READABLE)
// Event loop
while True:
packet = zx_port_wait(loop.port)
handler = lookup(packet.key)
handler(packet)
// Re-register if one-shot
Comparison to Linux io_uring
| Aspect | Zircon Ports | Linux io_uring |
|---|---|---|
| Model | Event notification (signals) | Operation submission/completion |
| Submission | No SQ; operations are separate syscalls | SQ ring: batch operations |
| Completion | Port packet queue | CQ ring in shared memory |
| Kernel transitions | One per wait_async + one per port_wait | One per io_uring_enter (batched) |
| Memory sharing | No shared ring buffers | SQ/CQ are mmap’d shared memory |
| Zero-copy | Not for port packets | Registered buffers, fixed files |
| Batching | No inherent batching | Core design: submit N ops, one syscall |
| Chaining | Not supported | SQE linking (sequential/parallel) |
| Scope | Signal notification only | Full I/O operations (read, write, send, recv, fsync, …) |
Key differences:
-
Ports are notification-based; io_uring is operation-based. A port tells you “something happened” (a signal was asserted), then you do separate syscalls to act on it (read the channel, accept the socket, etc.). io_uring lets you submit the actual I/O operation and the kernel does it asynchronously, returning the result in the completion ring.
-
io_uring avoids syscalls for submission. The submission queue is shared memory – userspace writes SQEs and the kernel reads them without a syscall (in polling mode) or with a single
io_uring_enter()for a batch of operations. Ports require a syscall perwait_asyncregistration. -
io_uring supports chaining. SQE linking allows dependent operations (e.g., “read from file, then write to socket”) without returning to userspace between steps.
-
Ports are simpler. The signal model is straightforward and composes well with Zircon’s object model. io_uring’s complexity (dozens of opcodes, registered buffers, fixed files, kernel-side polling) is much higher.
Performance Tradeoffs
Ports:
- Pro: Simple, well-integrated with kernel object model, easy to reason about.
- Con: Extra syscalls per operation (wait_async to register, port_wait to receive, then the actual operation syscall). At least 3 syscalls per async operation.
io_uring:
- Pro: Can batch many operations in a single syscall. Shared-memory rings avoid copies. Kernel-side polling can eliminate syscalls entirely.
- Con: Complex API surface, security attack surface (many kernel bugs have been in io_uring), complex state management.
Comparison to capOS’s Planned Async Rings
capOS plans io_uring-inspired capability rings: an SQ where userspace submits capnp-serialized capability invocations and a CQ where the kernel posts completions.
| Aspect | Zircon Ports | capOS Planned Rings |
|---|---|---|
| Submission | Separate syscalls | SQ in shared memory |
| Completion | Port packet queue (kernel-owned) | CQ in shared memory |
| Operation scope | Signal notification only | Full capability invocations |
| Batching | None | Natural (fill SQ, single syscall) |
| Wire format | Fixed packet struct | Cap’n Proto messages |
Recommendations for capOS:
-
The io_uring model is better than ports for capOS’s use case. Since every operation in capOS is a capability invocation (not just a signal notification), putting the full operation in the submission ring eliminates the extra round-trip that ports require. This is the right choice.
-
Keep a signal/notification mechanism too. Even with async rings, capOS needs a way to wait for events (e.g., “data available on this channel”, “process exited”). Consider a simple signal/wait mechanism alongside the capability rings – perhaps signal delivery goes through the CQ as a special completion type.
-
Study io_uring’s SQE linking. Chaining dependent capability calls (e.g., “read from FileStore, then write to Console”) without returning to userspace is powerful. This maps naturally to Cap’n Proto promise pipelining: “call method A on cap X, then call method B on the result’s capability” – the kernel can chain these internally.
-
Registered/fixed capabilities. io_uring has “fixed files” (registered fd set for faster lookup). capOS could have a “hot set” of capabilities pinned in the SQ context for faster dispatch (avoid per-call table lookup).
-
Completion ordering. io_uring completions can arrive out of order. capOS’s CQ should also support out-of-order completion (each SQE has a user_data tag echoed in the CQE) to enable true async pipelining.
5. FIDL (Fuchsia Interface Definition Language)
Overview
FIDL is Fuchsia’s IDL for defining protocols that communicate over channels. It serves a similar role to Cap’n Proto schemas in capOS: defining the contract between client and server.
FIDL vs. Cap’n Proto: Schema Language
FIDL example:
library fuchsia.example;
type Color = strict enum : uint32 {
RED = 1;
GREEN = 2;
BLUE = 3;
};
protocol Painter {
SetColor(struct { color Color; }) -> ();
DrawLine(struct { x0 float32; y0 float32; x1 float32; y1 float32; }) -> ();
-> OnPaintComplete(struct { num_pixels uint64; });
};
Equivalent Cap’n Proto:
enum Color { red @0; green @1; blue @2; }
interface Painter {
setColor @0 (color :Color) -> ();
drawLine @1 (x0 :Float32, y0 :Float32, x1 :Float32, y1 :Float32) -> ();
}
Key differences in the schema language:
| Feature | FIDL | Cap’n Proto |
|---|---|---|
| Unions | flexible union, strict union | Anonymous unions in structs |
| Enums | strict enum, flexible enum | enum (always strict) |
| Optionality | box<T>, nullable types | Default values, union with Void |
| Evolution | flexible keyword for forward compat | Field numbering, @N ordinals |
| Tables | table (like protobuf, sparse) | struct with default values |
| Events | -> EventName(...) server-sent | No built-in events |
| Error syntax | -> () error uint32 | Must encode in return struct |
| Capability types | client_end:P, server_end:P | interface P as field type |
FIDL’s table type is analogous to Cap’n Proto structs in terms of
evolvability (can add fields without breaking), but Cap’n Proto structs are
more compact on the wire (fixed-size inline section + pointers) while FIDL
tables use an envelope-based encoding.
Wire Format Comparison
FIDL wire format:
- Little-endian, 8-byte aligned.
- Messages have a 16-byte header:
txid(4 bytes), flags (3 bytes), magic byte (0x01), ordinal (8 bytes). - Structs are laid out inline with natural alignment and explicit padding.
- Out-of-line data (strings, vectors, tables) uses offset-based indirection via “envelopes” (inline 8-byte entry: 4 bytes num_bytes, 2 bytes num_handles, 2 bytes flags).
- Handles are out-of-band. The wire format contains
ZX_HANDLE_PRESENT(0xFFFFFFFF) orZX_HANDLE_ABSENT(0x00000000) markers where handles appear. The actual handles are in the channel message’s handle array, consumed in order of appearance in the linearized message. - Encoding is done into a contiguous byte buffer + a separate handle array, matching the channel write API.
- No pointer arithmetic. FIDL v2 uses a “depth-first traversal order” encoding where out-of-line objects are laid out sequentially. Offsets are not stored; the decoder walks the type schema to find boundaries.
Cap’n Proto wire format:
- Little-endian, 8-byte aligned (word-based).
- Messages have a segment table header listing segment sizes.
- Structs have a fixed data section + pointer section. Pointers are relative offsets (self-relative, in words).
- Uses pointer-based random access: can read any field without parsing the entire message.
- Capabilities are indexed. Cap’n Proto’s RPC protocol assigns capability table indices to interface references in messages. The actual capability (file descriptor, handle, etc.) is transferred out-of-band.
- Supports multi-segment messages (FIDL is always single-segment).
- Zero-copy read: can read directly from the wire buffer without deserialization.
Key wire format differences:
| Property | FIDL | Cap’n Proto |
|---|---|---|
| Random access | No (sequential decode) | Yes (pointer-based) |
| Zero-copy read | Partial (decode-on-access for some types) | Full (read from buffer) |
| Segments | Single contiguous buffer | Multi-segment |
| Pointers | Implicit (traversal order) | Explicit (relative offsets) |
| Size overhead | Smaller (no pointer words) | Larger (pointer section) |
| Decode cost | Must validate sequentially | Can validate lazily |
| Handle/cap encoding | Presence markers + out-of-band array | Cap table indices + out-of-band |
FIDL Capability Transfer
FIDL has first-class syntax for capability transfer in protocols:
protocol FileSystem {
Open(resource struct {
path string:256;
flags uint32;
object server_end:File;
}) -> ();
};
protocol File {
Read(struct { count uint64; }) -> (struct { data vector<uint8>:MAX; });
GetBuffer(struct { flags uint32; }) -> (resource struct { buffer zx.Handle:VMO; });
};
server_end:File– a channel endpoint where the server will serve theFileprotocol. The client creates a channel, keeps the client end, and sends the server end through this call.client_end:File– a channel endpoint for a client of theFileprotocol.zx.Handle:VMO– a handle to a specific kernel object type (VMO).- The
resourcekeyword marks types that contain handles (and thus cannot be copied, only moved).
The FIDL compiler tracks handle ownership: types containing handles are
“resource types” with move semantics. This is enforced at the language binding
level (e.g., in C++, resource types are move-only; in Rust, they implement
Drop but not Clone).
Comparison to capOS’s Cap’n Proto Usage
Cap’n Proto natively supports capability transfer through its interface
types:
interface FileSystem {
open @0 (path :Text, flags :UInt32) -> (file :File);
}
interface File {
read @0 (count :UInt64) -> (data :Data);
getBuffer @1 (flags :UInt32) -> (buffer :MemoryObject);
}
In standard Cap’n Proto RPC, file :File in the return type means “a
capability to a File interface.” The RPC system assigns a capability table
index, transfers it out-of-band, and the receiver gets a live reference to
invoke further methods.
Recommendations for capOS:
-
Use out-of-band capability transfer beside Cap’n Proto payloads. Cap’n Proto RPC has capability descriptors indexed into a capability table, but capOS currently keeps kernel transfer semantics in ring sideband records so the kernel can treat Cap’n Proto payload bytes as opaque. Promise pipelining should build on that sideband result-cap namespace rather than requiring general payload traversal in the kernel.
-
No need to switch to FIDL. Cap’n Proto’s wire format is superior for capOS’s use case:
- Random access means runtimes and services can inspect specific fields without full deserialization. The kernel should keep using bounded sideband metadata for transport decisions.
- Zero-copy read means less allocation in userspace protocol handling.
- Multi-segment messages allow avoiding large contiguous allocations.
- Promise pipelining is native to Cap’n Proto RPC, aligning with capOS’s planned async ring chaining.
-
FIDL’s
resourcekeyword is worth imitating. Mark capnp types that contain capabilities differently from pure-data types. This could be done at the schema level (Cap’n Proto already distinguishesinterfacefields) or as a convention. This enables the kernel to fast-path messages that contain no capabilities (no need to scan for capability descriptors). -
FIDL’s
tabletype for evolution. Cap’n Proto structs already support adding fields, but capOS should be aware that FIDL tables are more explicitly designed for cross-version compatibility. For system interfaces that will evolve, consider using Cap’n Proto groups or designing structs with generous ordinal spacing.
6. Synthesis: Relevance to capOS
Handle Model vs. Typed Capability Dispatch
Zircon’s handle model is untyped at the handle level – a handle is just
(object_ref, rights). The type comes from the object. All operations go through
fixed syscalls (zx_channel_write, zx_vmo_read, etc.).
capOS’s model is typed at the capability level – each capability
implements a Cap’n Proto interface with method dispatch. Operations go through
ring SQEs such as CAP_OP_CALL, with Cap’n Proto params and results carried
in userspace buffers.
Both are valid. Zircon’s approach is lower overhead (no serialization for simple
operations like vmo_read), while capOS’s approach gives uniformity (every
operation has the same wire format, enabling persistence and network
transparency).
Hybrid recommendation: For performance-critical operations (memory mapping, signal waiting), consider adding “fast-path” syscalls that bypass capnp serialization, similar to how Zircon has dedicated syscalls per object type. The capnp path remains the general mechanism and the “canonical” interface.
Async Rings vs. Ports: The Right Call
capOS’s io_uring-inspired async rings are a better fit than Zircon’s port model for a capability OS:
- Ports require separate syscalls for registration, waiting, and the actual operation. Async rings batch everything.
- Cap’n Proto’s promise pipelining maps naturally to SQE chaining.
- The shared-memory ring design avoids kernel-side queuing overhead.
However, learn from ports:
- The signal model (each object has a signal set, watchers are notified) is clean and composable. Consider making “wait for signal” a CQ event type.
zx_port_queue()(user-initiated packets) is useful for waking up event loops from user code. Support user-initiated CQ entries.
VMO/VMAR vs. capOS Memory Model
capOS should implement VMO-equivalent capabilities after the current Endpoint and transfer baseline:
- IPC already has shared rings, but bulk data still needs explicit shared memory objects.
- Capability transfer of memory regions (passing a MemoryObject cap through IPC) is the standard pattern for bulk data transfer.
- CoW cloning enables efficient process creation.
Proposed capability interfaces:
interface MemoryObject {
read @0 (offset :UInt64, count :UInt64) -> (data :Data);
write @1 (offset :UInt64, data :Data) -> ();
getSize @2 () -> (size :UInt64);
setSize @3 (size :UInt64) -> ();
createChild @4 (offset :UInt64, size :UInt64, options :UInt32) -> (child :MemoryObject);
}
interface AddressRegion {
map @0 (offset :UInt64, vmo :MemoryObject, vmoOffset :UInt64, len :UInt64, flags :UInt32) -> (addr :UInt64);
unmap @1 (addr :UInt64, len :UInt64) -> ();
protect @2 (addr :UInt64, len :UInt64, flags :UInt32) -> ();
allocateSubRegion @3 (offset :UInt64, size :UInt64) -> (region :AddressRegion, addr :UInt64);
}
FIDL vs. Cap’n Proto: Stay with Cap’n Proto
Cap’n Proto is the right choice for capOS. The advantages over FIDL:
- Language-independent standard. FIDL is Fuchsia-only. Cap’n Proto has implementations in C++, Rust, Go, Python, Java, etc.
- Zero-copy random access. The kernel can inspect message fields without full deserialization.
- Promise pipelining. Native to capnp-rpc, enabling the async ring chaining that capOS plans.
- Persistence. Cap’n Proto messages are self-describing (with schema) and suitable for on-disk storage – important for capOS’s planned capability persistence.
The one thing FIDL does better: tight integration of handle/capability metadata
in the type system (the resource keyword, client_end/server_end syntax,
handle type constraints). capOS should ensure its capnp schemas clearly
distinguish capability-carrying types and that the kernel enforces capability
transfer semantics.
Concrete Action Items for capOS
Ordered by priority and dependency:
-
Keep typed-interface authority model. Do not add a Zircon-style generic rights bitmask until a concrete method-attenuation need beats narrow wrapper capabilities and transfer-mode metadata.
-
Handle generation counters. Done: upper bits of
CapIddetect stale references. -
Design MemoryObject/SharedBuffer capability. Define and implement the shared-memory object that replaces raw-frame transfer for bulk IPC.
-
Design AddressRegion capability (Stage 5). Sub-VMAR-like sandboxing. The root VMAR handle is part of the initial capability set.
-
Capability transfer sideband. Baseline CALL/RETURN copy and move transfer is implemented; promise-pipelined result-cap mapping still needs a precise rule before pipeline dispatch lands.
-
Async rings with signal delivery. SQ/CQ capability rings are implemented for transport; notification objects and promise pipelining remain future work.
-
User-queued CQ entries (with async rings). Allow userspace to post wake-up events to its own CQ, enabling pure-userspace event loop integration.
Appendix: Key Zircon Syscall Reference
For reference, the most architecturally significant Zircon syscalls:
| Syscall | Purpose |
|---|---|
zx_handle_close | Close a handle |
zx_handle_duplicate | Duplicate with rights reduction |
zx_handle_replace | Atomic replace with new rights |
zx_channel_create | Create channel pair |
zx_channel_read | Read message + handles from channel |
zx_channel_write | Write message + handles to channel |
zx_channel_call | Synchronous write-then-read (RPC) |
zx_port_create | Create async port |
zx_port_wait | Wait for next packet |
zx_port_queue | Enqueue user packet |
zx_object_wait_async | Register signal wait on port |
zx_object_wait_one | Synchronous wait on one object |
zx_vmo_create | Create virtual memory object |
zx_vmo_read / write | Direct VMO access |
zx_vmo_create_child | CoW clone |
zx_vmar_map | Map VMO into address region |
zx_vmar_unmap | Unmap |
zx_vmar_allocate | Create sub-VMAR |
zx_process_create | Create process (with root VMAR) |
zx_process_start | Start process execution |
Genode OS Framework: Research Report for capOS
Research on Genode’s capability-based component framework, session routing, VFS architecture, and POSIX compatibility – with lessons for capOS.
1. Capability-Based Component Framework
Core Abstraction: RPC Objects
Genode’s fundamental abstraction is the RPC object. Every service in the system is implemented as an RPC object that can be invoked by clients holding a capability to it. The capability is an unforgeable reference – a kernel- protected token that names a specific RPC object and grants the holder the right to invoke its methods.
Genode supports multiple microkernels (NOVA, seL4, Fiasco.OC, a custom base-hw kernel). The capability model is consistent across all of them, though the kernel-level implementation details differ. The framework abstracts kernel capabilities into its own uniform model.
Key properties of Genode capabilities:
- Unforgeable. A capability can only be obtained by delegation from a holder or creation by the kernel. There is no mechanism to synthesize a capability from an integer or address.
- Typed. Each capability refers to an RPC object with a specific interface. The C++ type system enforces interface contracts at compile time.
- Delegatable. A capability holder can pass it to another component via RPC arguments, allowing authority to flow through the system graph.
- Revocable. Capabilities can be revoked (invalidated). When an RPC object is destroyed, all capabilities pointing to it become invalid.
Capability Types in Genode
Genode distinguishes several kinds of capabilities based on what they refer to:
-
Session capabilities. The most common type. A session capability refers to a service session – an ongoing relationship between a client and a server. Example: a
Log_sessioncapability lets a client write log messages to a specific log session on a LOG server. -
Parent capability. Every component holds an implicit capability to its parent. This is the channel through which it requests resources and sessions. The parent capability is never explicitly passed – it’s built into the component framework.
-
Dataspace capabilities. Represent shared-memory regions. A
Ram_dataspacecapability grants access to a specific region of physical memory. Dataspaces are the mechanism for bulk data transfer between components (the RPC path is for small messages and control). -
Signal capabilities. Used for asynchronous notifications. A signal source produces signals; holders of the signal capability can register handlers. Signals are Genode’s primary async notification mechanism – they don’t carry data, just wake up the receiver.
Sessions: The Service Contract
A session is the central concept of Genode’s inter-component communication. It represents an established relationship between a client component and a server component, with negotiated resource commitments.
Session lifecycle:
-
Request. A client asks its parent to create a session of a specific type (e.g.,
Gui::Session,File_system::Session,Nic::Session). The request includes a label string and optional session arguments. -
Routing. The parent routes the session request according to its policy (see Section 2). The request may traverse multiple levels of the component tree.
-
Creation. The server creates a session object, allocates resources for it (e.g., a shared-memory buffer), and returns a session capability to the client.
-
Use. The client invokes RPC methods on the session capability. The server handles the calls. Both sides can use shared dataspaces for bulk data.
-
Close. Either side can close the session. Resources committed to the session are released back.
This model is fundamentally different from Unix IPC (anonymous pipes/sockets). Every session is:
- Typed – the interface is known at compile time.
- Named – sessions carry a label used for routing and policy.
- Resource-accounted – the client explicitly donates RAM to the server via a “session quota” to fund the server-side state for this session. This prevents denial-of-service through resource exhaustion.
Resource Trading
Genode’s resource model is unique and worth studying closely. Resources (primarily RAM) flow through the component tree:
- The kernel grants a fixed RAM budget to core (the root component).
- Core grants budgets to its children (typically just init).
- Init grants budgets to its children according to the deployment config.
- Each component can donate RAM to servers when opening sessions.
The session_quota mechanism works as follows: when a client opens a
session, it specifies how much RAM it donates. This RAM transfer goes
from the client’s budget to the server’s budget. The server uses this
donated RAM to allocate server-side state for the session. When the
session closes, the RAM flows back.
This creates a closed accounting system:
- No component can use more RAM than it was granted.
- Servers don’t need their own large budgets – clients fund their sessions.
- Resource exhaustion is contained: a misbehaving client can only exhaust its own budget, not the server’s.
Capability Invocation vs. Delegation
Genode distinguishes two fundamental operations on capabilities:
Invocation: calling an RPC method on the capability. The caller sends a message to the RPC object named by the capability, the server processes it and returns a result. This is synchronous in Genode – the caller blocks until the server replies. (Asynchronous interaction uses signals and shared memory.)
Delegation: passing a capability as an argument in an RPC call. When a capability appears as a parameter or return value, the kernel transfers the capability reference to the receiving component. The receiver now holds an independent reference to the same RPC object. This is how authority propagates through the system.
Example: when a client opens a File_system::Session, the session
creation returns a session capability. If the file system server needs
to allocate memory, it calls back to the client’s RAM service using a
RAM capability that was delegated during session setup.
Capabilities in Genode RPC are transferred by the kernel during the IPC operation – the framework marshals them into a special “capability argument” slot in the IPC message, and the kernel copies the capability reference into the receiver’s capability space. This is transparent to application code: capabilities appear as typed C++ objects in the RPC interface.
2. Session Routing
The Problem Session Routing Solves
In a traditional OS, services are found via well-known names in a global namespace (D-Bus addresses, socket paths, service names). This creates ambient authority – any process can connect to any service if it knows the name.
Genode has no global service namespace. A component can only obtain sessions through its parent. The parent decides which server to route each session request to. This means:
- Service visibility is controlled structurally.
- A component can only reach services its parent explicitly allows.
- Different children of the same parent can be routed to different servers for the same service type.
Parent-Child Relationship
Every Genode component (except core) has exactly one parent. The parent:
- Created the child (spawned it with an initial set of resources).
- Intercepts all session requests from the child.
- Routes requests according to its routing policy.
- Can deny requests entirely (the child gets an error).
This creates a tree structure where authority flows downward. A child cannot bypass its parent to reach a service the parent didn’t approve.
Init’s Routing Configuration
The init process (Genode’s init) reads an XML configuration that
specifies which services to start and how to route their session requests.
This is the core of system policy.
A minimal init config:
<config>
<parent-provides>
<service name="LOG"/>
<service name="ROM"/>
<service name="CPU"/>
<service name="RAM"/>
<service name="PD"/>
</parent-provides>
<start name="timer">
<resource name="RAM" quantum="1M"/>
<provides> <service name="Timer"/> </provides>
<route>
<service name="ROM"> <parent/> </service>
<service name="LOG"> <parent/> </service>
<service name="CPU"> <parent/> </service>
<service name="RAM"> <parent/> </service>
<service name="PD"> <parent/> </service>
</route>
</start>
<start name="test-log">
<resource name="RAM" quantum="1M"/>
<route>
<service name="Timer"> <child name="timer"/> </service>
<service name="LOG"> <parent/> </service>
<!-- remaining services routed to parent by default -->
<any-service> <parent/> </any-service>
</route>
</start>
</config>
Key routing directives:
<parent/>– route to the parent (upward delegation).<child name="x"/>– route to a specific child (sibling routing).<any-child/>– route to any child that provides the service.<any-service>– catch-all for unspecified service types.
Label-Based Routing
Labels are strings attached to session requests. They carry context about who is requesting and what they want, enabling fine-grained routing decisions.
When a client requests a session, it attaches a label. As the request traverses the routing tree, each intermediate component (typically init) can prepend its own label. By the time the request reaches the server, the label encodes the full path through the component tree.
Example: a component named my-app inside an init subsystem named
apps requests a File_system session with label "data". The
composed label arriving at the file system server is:
"apps -> my-app -> data".
The server can use this label for:
- Access control. Grant different permissions based on who is asking.
- Isolation. Store data in different directories per client.
- Logging. Identify which component generated a message.
Label-based routing in init config:
<start name="fs">
<provides> <service name="File_system"/> </provides>
<route> ... </route>
</start>
<start name="app-a">
<route>
<service name="File_system" label="data">
<child name="fs"/>
</service>
<service name="File_system" label="config">
<child name="config-fs"/>
</service>
</route>
</start>
Here, app-a’s file system requests are split: requests labeled "data"
go to one server, requests labeled "config" go to another. The
application code is unchanged – the routing is entirely a deployment
decision.
Routing as Policy
The critical insight is that routing IS access control. There is no separate permission system. If a component’s route config doesn’t include a path to a network service, that component has no network access – period. It cannot discover the network service because it has no way to name it.
This replaces:
- Firewall rules (routing controls which network services are reachable)
- File permissions (routing controls which file system sessions are available)
- Process isolation policies (routing controls everything)
The routing configuration is equivalent to a whitelist of allowed service connections for each component. Adding or removing access means editing the init config, not modifying the component’s code or the server’s access control lists.
Dynamic Routing and Sculpt
In the static case (Genode’s test scenarios), routing is defined once in init’s config. In Sculpt OS (Section 6), the routing configuration can be modified at runtime, allowing users to install applications and connect them to services dynamically.
3. VFS on Top of Capabilities
The VFS Layer
Genode’s VFS (Virtual File System) is a library-level abstraction, not a kernel feature. It provides a path-based file-like interface implemented as a plugin architecture within a component’s address space.
The VFS exists because many existing applications (and libc) expect file-like access patterns. Rather than forcing all code to use Genode’s native session/capability model, the VFS provides a translation layer.
Architecture:
Application code
|
| POSIX: open(), read(), write()
v
libc (Genode's port of FreeBSD libc)
|
| VFS API: vfs_open(), vfs_read(), vfs_write()
v
VFS library (in-process)
|
| Plugin dispatch based on mount point
v
VFS plugins (in-process)
|
+--> ram_fs plugin (in-memory file system)
+--> <fs> plugin (delegates to File_system session)
+--> <terminal> plugin (delegates to Terminal session)
+--> <log> plugin (delegates to LOG session)
+--> <nic> plugin (delegates to Nic session, for socket layer)
+--> <block> plugin (delegates to Block session)
+--> <dir> plugin (combines subtrees)
+--> <tar> plugin (read-only tar archive)
+--> <import> plugin (populate from ROM)
+--> <pipe> plugin (in-process pipe pair)
+--> <rtc> plugin (system clock)
+--> <zero> plugin (/dev/zero equivalent)
+--> <null> plugin (/dev/null equivalent)
...
VFS Plugin Architecture
Each VFS plugin is a dynamically loadable library (or statically linked module) that implements a file-system-like interface. Plugins handle:
- open/close – create/destroy file handles
- read/write – data transfer
- stat – metadata queries
- readdir – directory enumeration
- ioctl – device-specific control (limited)
Plugins are composed by the VFS configuration, which is XML embedded in the component’s config:
<config>
<vfs>
<dir name="dev">
<log/>
<null/>
<zero/>
<terminal name="stdin" label="input"/>
<inline name="rtc">2024-01-01 00:00</inline>
</dir>
<dir name="tmp"> <ram/> </dir>
<dir name="data"> <fs label="persistent"/> </dir>
<dir name="socket"> <lxip dhcp="yes"/> </dir>
</vfs>
<libc stdout="/dev/log" stderr="/dev/log" stdin="/dev/stdin"
rtc="/dev/rtc" socket="/socket"/>
</config>
This config creates a virtual filesystem tree:
/dev/log– writes go to the LOG session/dev/null,/dev/zero– standard synthetic files/dev/stdin– reads from a Terminal session/tmp/– in-memory filesystem (RAM-backed)/data/– delegates to a File_system session labeled “persistent”/socket/– network sockets via lwIP stack (in-process)
The <fs> plugin is the bridge from VFS to Genode’s capability world.
When the application does open("/data/foo.txt"), the <fs> plugin
translates this into a File_system::Session RPC call to the external
file system server that the component’s routing connects to.
File System Components
Genode has several file system server components:
- ram_fs – in-memory file system server. Multiple components can
share files through it by routing their
File_systemsessions to it. - vfs_server (previously
vfs) – a file system server backed by the VFS plugin architecture itself. This enables recursive composition: a VFS server can mount another VFS server. - fatfs – FAT file system driver over a Block session.
- ext2_fs – ext2/3/4 via a ported Linux implementation (rump kernel).
- store_fs / recall_fs – content-hash-based storage (experimental in some Genode releases).
The file system server is a regular Genode component. It receives a Block session (from a block device driver), provides File_system sessions, and the routing determines who can access what:
block_driver -> provides Block session
|
v
fatfs -> consumes Block session, provides File_system session
|
v
application -> consumes File_system session via VFS <fs> plugin
Libc Integration
Genode ports a substantial subset of FreeBSD’s libc. The integration point is the VFS: libc’s file operations are implemented by calling the VFS layer, which dispatches to plugins, which invoke Genode sessions as needed.
The libc port modifies FreeBSD libc minimally. Most changes are in the “backend” layer that replaces kernel syscalls with VFS calls:
open()->vfs_open()-> VFS plugin dispatchread()->vfs_read()-> VFS pluginsocket()-> via VFS socket plugin (<lxip>or<lwip>)mmap()-> supported for anonymous mappings and file-backed read-onlyfork()-> NOT supported (nofork()in Genode)exec()-> NOT supported (no in-place process replacement)pthreads-> supported via Genode’s Thread APIselect()/poll()-> supported via VFS notification mechanismsignal()-> partial support (SIGCHLD, basic signal delivery)
The key architectural decision: libc talks to the VFS library (in-process), the VFS talks to Genode sessions (cross-process RPC). Application code never directly touches Genode capabilities – the VFS mediates everything.
4. POSIX Compatibility
The Noux Approach (Historical)
Genode’s early POSIX approach was Noux, a process runtime that emulated Unix-like process semantics (fork, exec, pipe) on top of Genode. Noux ran as a single Genode component containing multiple “Noux processes” that shared an address space but had separate VFS views.
Noux supported:
fork()via copy-on-write within the Noux address spaceexec()via in-place program replacementpipe()for inter-process communication- A shared file system namespace
Noux was eventually deprecated because:
- It conflated multiple processes in one address space, undermining Genode’s isolation model.
- Fork emulation was fragile and slow.
- The libc-based VFS approach (Section 3) achieved better compatibility with less complexity.
Current Approach: libc + VFS
The current POSIX compatibility strategy:
-
FreeBSD libc port. Provides standard C library functions. Modified to use Genode’s VFS instead of kernel syscalls.
-
VFS plugins as POSIX backends. Each POSIX I/O pattern maps to a VFS plugin:
- File I/O ->
<fs>plugin -> File_system session - Sockets ->
<lxip>or<lwip>plugin -> Nic session (in-process TCP/IP stack) - Terminal I/O ->
<terminal>plugin -> Terminal session - Device access -> custom VFS plugins
- File I/O ->
-
No fork(). The most significant POSIX omission. Programs that require
fork()must be modified to useposix_spawn()or Genode’s native child-spawning mechanism. In practice, many programs use fork() only for daemon patterns or subprocess creation, and can be adapted. -
No exec(). Related to no fork(): there’s no in-place process replacement. New processes are created as new Genode components.
-
Signals. Basic support – enough for SIGCHLD notification and simple signal handling. Complex signal semantics (real-time signals, signal-driven I/O) are not supported.
-
pthreads. Fully supported via Genode’s native threading.
-
mmap. Anonymous mappings and read-only file-backed mappings work. MAP_SHARED with write semantics is limited.
What Works in Practice
Genode has successfully ported:
- Qt5/Qt6 – the full widget toolkit, including QtWebEngine (Chromium). This is the basis of Sculpt’s GUI.
- VirtualBox – full x86 virtualization (runs Windows, Linux guests).
- Mesa/Gallium – GPU-accelerated 3D graphics.
- curl, wget, fetchmail – network utilities.
- GCC toolchain – compiler, assembler, linker running on Genode.
- bash – with limitations (no job control via signals, no fork-heavy patterns). Works for simple scripting.
- vim, nano – terminal editors.
- OpenSSL/LibreSSL – cryptographic libraries.
- Various system utilities – ls, cp, rm, etc. via Coreutils port.
Applications that don’t port well:
- Anything deeply dependent on fork+exec patterns (e.g., traditional Unix shells for complex scripting).
- Programs relying on procfs, sysfs, or Linux-specific interfaces.
- Daemons using inotify or Linux-specific async I/O.
- Programs that assume global file system namespace visibility.
Practical Porting Effort
For most POSIX applications, porting involves:
- Build the application using Genode’s ports system (downloads upstream source, applies patches, builds with Genode’s toolchain).
- Write a VFS configuration that provides the file-like resources the application expects.
- Write a routing configuration that connects the application to required services.
- Patch
fork()calls if present (usually replacing withposix_spawn()or restructuring to avoid subprocess creation).
The VFS configuration is where the “impedance mismatch” between POSIX
expectations and Genode capabilities is resolved. The application thinks
it’s accessing /etc/resolv.conf – the VFS plugin infrastructure
translates this to capability-mediated access.
5. Component Architecture
Core, Init, and User Components
Core (or base-hw/base-nova/etc.): the lowest-level component,
running directly on the microkernel. Core provides the fundamental
services: RAM allocation, CPU time (PD sessions), ROM access (boot
modules), IRQ delivery, and I/O memory access. Core is the only
component with direct hardware access. Everything else goes through core.
Init: the first user-level component, child of core. Init reads its XML configuration and manages the component tree. Init’s responsibilities:
- Parse
<start>entries and spawn components. - Route session requests between components according to
<route>rules. - Manage component lifecycle (restart policies, resource reclamation).
- Propagate configuration changes (dynamic reconfiguration in Sculpt).
User components: all other components. They can be:
- Servers that provide sessions (drivers, file systems, network stacks).
- Clients that consume sessions (applications).
- Both simultaneously (a network stack consumes NIC sessions and provides socket-level sessions).
- Sub-inits – components that run their own init-like management for a subtree of components.
Resource Trading in Practice
Resources in Genode flow through the tree. A concrete example:
- Core has 256 MB RAM total.
- Core grants 250 MB to init, keeps 6 MB for kernel structures.
- Init grants 10 MB to the timer driver, 50 MB to the GUI subsystem, 20 MB to the network subsystem, 5 MB to a log server.
- When the GUI subsystem starts a framebuffer driver, it donates 8 MB from its 50 MB budget to the driver as a session quota.
- The framebuffer driver uses this donated RAM for the frame buffer allocation.
If the GUI subsystem wants more RAM for a new application, it can reclaim RAM by closing sessions (getting donated RAM back) or requesting more from its parent (init).
The accounting is strict: at any point, the sum of all RAM budgets across all components equals the total system RAM. There is no over-commit. This prevents the “OOM killer” problem – each component knows exactly how much RAM it can use.
Practical Component Patterns
Driver components follow a common pattern:
- Receive: Platform session (for I/O port/memory access), IRQ session
- Provide: A device-specific session (NIC, Block, GPU, Audio, etc.)
- Stateless: all per-client state funded by session quota
Multiplexer components:
- Receive: one instance of a service
- Provide: multiple instances to clients
- Example: NIC router receives one NIC session, provides multiple sessions with packet routing between clients
Proxy components:
- Forward one session type, possibly filtering or transforming
- Example: nic_bridge, nitpicker (GUI multiplexer), VFS server
Subsystem inits:
- A component running its own init for a group of related components
- Isolates the subtree: crash of the subsystem doesn’t affect siblings
- Example: Sculpt’s drivers subsystem, network subsystem
6. Sculpt OS
What Sculpt Demonstrates
Sculpt OS is Genode’s demonstration desktop operating system. It turns the component framework into a usable system where:
- Users install and run applications at runtime.
- Each application runs in its own isolated component with explicitly configured capabilities.
- A GUI lets users connect applications to services (routing).
- The entire system is reconfigurable without reboot.
Architecture
Sculpt’s component tree:
core
|
init
|
+--> drivers subsystem (sub-init)
| +--> platform_drv (PCI, IOMMU)
| +--> fb_drv (framebuffer)
| +--> usb_drv (USB host controller)
| +--> wifi_drv (wireless)
| +--> ahci_drv (SATA)
| +--> nvme_drv (NVMe)
| +--> ...
|
+--> runtime subsystem (sub-init, user-managed)
| +--> (user-installed applications)
|
+--> leitzentrale (management GUI)
| +--> system shell
| +--> config editor
|
+--> nitpicker (GUI multiplexer)
+--> nic_router (network multiplexer)
+--> ram_fs (shared file system)
+--> ...
User Experience of Capabilities
In Sculpt, installing an application means:
- Download the package (a Genode component archive).
- Edit a “deploy” configuration that specifies which services the application can access (routing rules).
- The runtime subsystem spawns the component with the specified routing.
A text editor gets: File_system session (to read/write files), GUI
session (for display), Terminal session (optionally). It does NOT get:
network access, block device access, or access to other applications’
file systems.
A web browser gets: GUI session, Nic session (for network), GPU
session (for rendering), File_system session (for downloads). Each
service connection is an explicit choice.
The deploy config is the security policy. A user can see exactly what authority each application has, and can change it by editing the config.
Lessons from Sculpt
-
Capabilities need a management UI. Raw capability graphs are incomprehensible to users. Sculpt provides a GUI that presents service connections in an understandable way (though it’s still oriented toward power users).
-
Routing is the killer feature. Being able to route the same session type to different servers for different clients is extremely powerful. One application’s “file system” is local storage; another’s is a network share – same code, different routing.
-
Sub-inits provide failure isolation. The drivers subsystem can crash and restart without affecting applications. Sculpt’s robustness comes from this hierarchical isolation.
-
Dynamic reconfiguration is essential. A static boot config (like capOS’s current manifest) is fine for servers and embedded systems, but a general-purpose OS needs to add/remove/reconfigure components at runtime.
-
Package management is a routing problem. Installing an application in Sculpt is not “copy binary to disk” – it’s “add a component to the runtime subsystem with specific routing rules.” The binary is almost secondary to the routing.
-
POSIX compat through VFS works. Sculpt runs real desktop applications (Qt-based apps, VirtualBox, web browser) using the VFS-mediated POSIX layer. The capability model doesn’t prevent running complex existing software – it just requires explicit service configuration.
7. Relevance to capOS
VFS Capability Design
Genode’s approach: The VFS is an in-process library with a plugin architecture. It mediates between libc/POSIX and Genode sessions. The VFS configuration is per-component XML.
Lessons for capOS:
-
Don’t put the VFS in the kernel. Genode’s VFS is entirely userspace, which is correct for a capability OS. capOS should do the same – the VFS is a library linked into processes that need POSIX compatibility, not a kernel subsystem.
-
Plugin model maps well to Cap’n Proto. Each Genode VFS plugin bridges to a specific session type. In capOS, each VFS “backend” would bridge to a specific capability interface:
Genode VFS plugin capOS VFS backend <fs>-> File_system sessionFsBackend-> Namespace + Store caps<terminal>-> Terminal sessionTerminalBackend-> Console cap<lxip>-> Nic sessionNetBackend-> TcpSocket/UdpSocket caps<log>-> LOG sessionLogBackend-> Console cap<ram>-> in-process RAMRamBackend-> in-process (no cap needed) -
VFS config should be declarative. Rather than hardcoding mount points, capOS processes using
libcapos-posixshould receive a VFS mount table as part of their initial capability set. This could be a Cap’n Proto struct:struct VfsMountTable { mounts @0 :List(VfsMount); } struct VfsMount { path @0 :Text; # mount point, e.g. "/data" union { namespace @1 :Void; # use the Namespace cap named in capName console @2 :Void; # use a Console cap ram @3 :Void; # in-memory filesystem socket @4 :Void; # socket interface } capName @5 :Text; # name of the cap in CapSet backing this mount }This separates the VFS topology (a deployment decision) from the application code (which just calls
open()). -
Genode’s
<fs>plugin is the key analog. capOS’s Namespace capability is equivalent to Genode’s File_system session. Thelibcapos-posixpath resolution layer (open()->namespace.resolve()) is exactly Genode’s<fs>VFS plugin. The existing capOS design indocs/proposals/userspace-binaries-proposal.mdis already on the right track. -
Consider streaming for large files. Genode uses shared-memory dataspaces for bulk data transfer in file system sessions. capOS’s current Store interface returns
Data(a capnp blob), which means the entire object is copied perget()call. For large files, a streaming interface (with a shared-memory buffer and cursor) would be more efficient. This is capOS’s Open Question #4.
Session Routing Patterns
Genode’s approach: XML-configured routing in init, label-based dispatch, parent mediates all session requests.
Lessons for capOS:
-
The manifest IS the routing config. capOS’s
SystemManifestwith structuredCapRefsource entries such as{ service = { service = "net-stack", export = "nic" } }is functionally equivalent to Genode’s init routing config. The capOS design already handles the static case well. -
Label-based routing is valuable. Genode’s ability to route different requests from the same client to different servers (based on labels) maps directly to capOS’s capability naming. capOS already does this implicitly – a process can receive separate
Namespacecaps for “config” and “data”. The key insight is that this should be a deployment-time decision, not an application-time decision. -
Consider dynamic routing. capOS’s current manifest is static (baked into the ISO). For a more flexible system, init should support runtime reconfiguration:
- Reload the manifest from a Store cap.
- Add/remove services without reboot.
- Re-route sessions when services restart.
Genode achieves this via init’s config ROM, which can be updated at runtime. capOS could achieve it by having init watch a
Namespacecap for manifest updates. -
Parent-mediated routing has costs. In Genode, every session request traverses the component tree. This adds latency and complexity. capOS’s direct capability passing (a process holds a cap directly, not through its parent) avoids this overhead. The tradeoff: capOS has less runtime control over routing (once a cap is passed, the parent can’t intercept invocations on it).
This is a deliberate design choice. capOS favors direct caps (lower overhead, simpler) over proxied caps (more control). Genode’s session routing is powerful but adds a layer of indirection that may not be worth it for capOS’s use case.
-
Service export needs a protocol. Genode’s session model has server components explicitly
announcewhat services they provide. capOS’sProcessHandle.exported()mechanism serves the same purpose. The manifest’sexportsfield pre-declares what a service will export, which helps init plan the dependency graph before spawning anything.
POSIX Compatibility Without Compromising Capabilities
Genode’s approach: libc port + VFS + per-component VFS config. No global namespace. No fork(). Applications see a curated file tree, not the real system.
Lessons for capOS:
-
The VFS is a capability adapter, not a capability. The VFS library runs inside the process that needs POSIX compatibility. It doesn’t weaken the capability model because it can only access capabilities the process was granted. This matches capOS’s
libcapos-posixdesign exactly. -
musl over FreeBSD libc. Genode uses FreeBSD libc because of its clean backend interface. capOS plans to use musl, which has an even cleaner
__syscall()interface. This is a good choice. Genode’s experience shows that the libc implementation matters less than the VFS/backend layer quality. -
No fork() is fine. Genode has operated without fork() for over 15 years and runs complex software (Qt, VirtualBox, Chromium). The applications that truly need fork() are rare and usually need only
posix_spawn()semantics. capOS should not attempt to implement fork() – focus onposix_spawn()backed byProcessSpawnercap. -
Sockets via in-process TCP/IP stack. Genode’s
<lxip>VFS plugin runs an lwIP TCP/IP stack inside the application process, using the NIC session for raw packet I/O. This avoids the overhead of routing every socket call through a separate network stack component.capOS could offer a similar choice:
- Out-of-process: socket calls go to the network stack
component via
TcpSocket/UdpSocketcaps (safer, more isolated, more overhead). - In-process: an lwIP/smoltcp library runs inside the
application, consuming a raw
Niccap (less isolation, less overhead, more authority).
For most applications, out-of-process sockets via caps are fine. For high-performance networking (database, web server), an in-process stack over a raw NIC cap may be needed.
- Out-of-process: socket calls go to the network stack
component via
-
select/poll/epoll need async caps. Genode implements select/poll via VFS notifications (signals on file readiness). capOS needs the async capability rings (io_uring-inspired) from Stage 4 before select/poll can work. This is a natural fit: each polled fd maps to a pending capability invocation in the completion ring.
Component Patterns for Cap’n Proto Interfaces
Genode’s patterns and their capOS/Cap’n Proto equivalents:
-
Session creation = factory method on a capability.
Genode: client requests a
Nic::Sessionfrom its parent, which routes to a NIC driver server.capOS: client holds a
NetworkManagercap and callscreate_tcp_socket()to get aTcpSocketcap. The factory pattern is the same, but capOS does it via direct cap invocation instead of parent-mediated session requests.Cap’n Proto naturally supports this via interfaces that return interfaces:
interface NetworkManager { createTcpSocket @0 () -> (socket :TcpSocket); createUdpSocket @1 () -> (socket :UdpSocket); createTcpListener @2 (addr :IpAddress, port :UInt16) -> (listener :TcpListener); } -
Resource quotas in session creation.
Genode: session requests include a RAM quota donated from client to server.
capOS should consider this pattern. Currently, capOS processes receive a
FrameAllocatorcap for memory. If a server needs to allocate memory per-client, the client should fund it. Cap’n Proto schema could encode this:interface FileSystem { open @0 (path :Text, bufferPages :UInt32) -> (file :File); # bufferPages: number of pages the client donates for # server-side buffering. Server allocates from a shared # FrameAllocator or the client passes frames explicitly. }This prevents the denial-of-service problem where a client opens many sessions, exhausting the server’s memory.
-
Multiplexer components.
Genode:
nic_routertakes one NIC session, provides many.nitpickertakes one framebuffer, provides many GUI sessions.capOS equivalent: a process that consumes a
Niccap and provides multipleTcpSocket/UdpSocketcaps. This is already what the network stack component does in capOS’s service architecture proposal. Cap’n Proto’s interface model makes this natural – the multiplexer implements one interface (NetworkManager) using another (Nic). -
Attenuation = capability narrowing.
Genode: servers can return restricted capabilities (e.g., a read-only file handle from a read-write file system session).
capOS: already planned via Fetch -> HttpEndpoint narrowing, Store -> read-only Store, Namespace -> scoped Namespace. The pattern is sound. Cap’n Proto interfaces make the attenuation explicit in the schema.
-
Dataspace pattern for bulk data.
Genode uses shared-memory dataspaces for efficient bulk transfer (file contents, network packets, framebuffers). The RPC path carries only small control messages and capability references.
capOS currently moves Cap’n Proto control messages through capability rings and bounded kernel scratch, with no zero-copy bulk-data object yet. For bulk data, capOS should add a
SharedBuffercapability:interface SharedBuffer { # Map a shared memory region into caller's address space map @0 () -> (addr :UInt64, size :UInt64); # Notify that data has been written to the buffer signal @1 (offset :UInt64, length :UInt64) -> (); }File system and network operations would use SharedBuffer for data transfer and capability invocations for control, matching Genode’s split between RPC and dataspaces.
-
Sub-init pattern for failure domains.
Genode: a sub-init manages a subtree of components. If the subtree crashes, only the sub-init restarts it.
capOS: a supervisor process (not necessarily init) holds a
ProcessSpawnercap and manages a group of services. This is already described in the service architecture proposal’s supervision tree. The key addition from Genode: make sub- supervisors a first-class pattern with their own manifest fragments, not just ad-hoc supervision loops.
Summary of Key Takeaways for capOS
| Area | Genode approach | capOS adaptation |
|---|---|---|
| Capability model | Kernel-enforced caps to RPC objects | Kernel-enforced caps to Cap’n Proto objects (aligned) |
| Service discovery | Parent-mediated session routing | Manifest-driven cap passing at spawn (simpler, less dynamic) |
| VFS | In-process library with plugin architecture | libcapos-posix with mount table from CapSet (same pattern) |
| POSIX | FreeBSD libc + VFS backends | musl + libcapos-posix backends (same architecture) |
| fork() | Not supported | Not supported (use posix_spawn -> ProcessSpawner) |
| Bulk data | Shared-memory dataspaces | SharedBuffer design exists; implementation pending |
| Resource accounting | Session quotas (RAM donated per session) | Authority-accounting design exists; unified ledgers pending |
| Routing labels | String labels on session requests, routed by init | Cap naming in manifest serves same purpose |
| Dynamic reconfig | Init config ROM updated at runtime | Manifest reload via Store cap (future) |
| Failure isolation | Sub-inits as failure domains | Supervisor processes (same concept, different mechanism) |
| Async notification | Signal capabilities | Async cap rings / io_uring model (more general) |
Top Recommendations
-
Add session quotas / resource trading. This is the most important Genode pattern capOS hasn’t adopted yet. Without it, a malicious client can exhaust a server’s memory by opening many capability sessions. Design resource donation into the Cap’n Proto schema for session-creating interfaces.
-
Design a SharedBuffer capability. Copying capnp messages through the kernel works for control messages but not for bulk data. A shared-memory mechanism (like Genode’s dataspaces) is essential for file I/O, networking, and GPU rendering.
-
Keep VFS as a library, not a service. Genode’s in-process VFS is the right pattern. capOS’s
libcapos-posixshould work the same way – a library that translates POSIX calls to capability invocations within the process. No VFS server component needed (though a file system server implementing the Namespace/Store interface is separate). -
Add a declarative VFS mount table to process init. Each POSIX-compat process should receive a mount table (as a capnp struct) that maps paths to capabilities. This separates deployment policy from application code, matching Genode’s per-component VFS config.
-
Plan for dynamic reconfiguration. The static manifest is fine for now, but Sculpt shows that a usable capability OS needs runtime service management. Design init so it can accept manifest updates through a cap, not just from the boot image.
-
Don’t over-engineer routing. Genode’s parent-mediated session routing is powerful but complex. capOS’s direct capability passing is simpler and sufficient for most use cases. Add proxy/mediator patterns only when specific needs arise (e.g., capability revocation, load balancing).
References
- Genode Foundations book (genode.org/documentation/genode-foundations/) – the authoritative source for architecture, session model, routing, VFS, and component composition.
- Norman Feske, “Genode Operating System Framework” (2008-2025) – release notes and design documentation at genode.org.
- Sculpt OS documentation at genode.org/download/sculpt – practical deployment of the capability model.
- Genode source repository: github.com/genodelabs/genode – reference implementations of VFS plugins, file system servers, libc port.
Research: Plan 9 from Bell Labs and Inferno OS
Lessons for a capability-based OS using Cap’n Proto wire format.
Table of Contents
- Per-Process Namespaces
- The 9P Protocol
- File-Based vs Capability-Based Interfaces
- 9P as IPC
- Inferno OS
- Relevance to capOS
1. Per-Process Namespaces
Overview
Plan 9’s most significant architectural contribution is per-process namespaces.
Every process has its own view of the file hierarchy – not a shared global
filesystem tree as in Unix. A process’s namespace is a mapping from path names
to file servers (channels to 9P-speaking services). Two processes running on
the same machine can see completely different contents at /dev, /net,
/proc, or any other path.
Namespaces are inherited by child processes (fork copies the namespace) but can be modified independently afterward. This provides a form of resource isolation that is orthogonal to traditional access control: a process simply cannot name resources that aren’t in its namespace.
The Three Namespace Operations
Plan 9 provides three system calls for namespace manipulation:
bind(name, old, flags) – Takes an existing file or directory name
already visible in the namespace and makes it also accessible at path old.
This is purely a namespace-level alias – no new file server is involved. The
name argument must resolve to something already in the namespace.
Example: bind("#c", "/dev", MREPL) makes the console device (#c is a
kernel device designator) appear at /dev. The # prefix addresses kernel
devices directly before they have been bound into the namespace.
mount(fd, old, flags, aname) – Like bind, but the source is a file
descriptor connected to a 9P server rather than an existing namespace path.
The kernel speaks 9P over fd to serve requests for paths under old. The
aname parameter selects which file tree the server should export (a single
server can serve multiple trees).
Example: mount(fd, "/net", MREPL, "") where fd is a connection to the
network stack’s file server, makes the TCP/IP interface appear at /net.
unmount(name, old) – Removes a previous bind or mount from the
namespace.
Flags and Union Directories
The flags argument to bind and mount controls how the new binding
interacts with existing content at the mount point:
MREPL(replace) – The new binding completely replaces whatever was at the mount point. Only the new server’s files are visible.MBEFORE(before) – The new binding is placed before the existing content. When looking up a name, the new binding is searched first. If not found there, the old content is searched.MAFTER(after) – The new binding is placed after the existing content. The old content is searched first.MCREATE– Combined withMBEFOREorMAFTER, controls which component of the union receives create operations.
Union directories are the result of stacking multiple bindings at one mount point. When a directory has multiple bindings, a directory listing returns the union of all names from all components. A lookup walks the bindings in order and returns the first match.
This is how Plan 9 constructs /bin: multiple directories (for different
architectures, local overrides, etc.) are union-mounted at /bin. The
shell finds commands by simple path lookup – no $PATH variable needed.
bind /rc/bin /bin # shell built-ins (MAFTER)
bind /386/bin /bin # architecture binaries (MAFTER)
bind $home/bin/386 /bin # personal overrides (MBEFORE)
A lookup for /bin/ls searches the personal directory first, then the
architecture directory, then the shell builtins – all via a single path.
Namespace Inheritance and Isolation
The rfork system call controls what the child inherits:
RFNAMEG– Child gets a copy of the parent’s namespace. Subsequent modifications by either side are independent.RFCNAMEG– Child starts with a clean (empty) namespace.- Without either flag, parent and child share the namespace (modifications by one affect the other).
This gives fine-grained control: a shell can construct a restricted namespace for a sandboxed command, or a server can create an isolated namespace for each client connection.
Namespace Construction at Boot
Plan 9’s boot process constructs the initial namespace step by step:
- The kernel provides “kernel devices” accessed via
#designators:#c(console),#e(environment),#p(proc),#I(IP stack), etc. - The boot script binds these into conventional paths:
bind "#c" /dev,bind "#p" /proc, etc. - Network connections mount remote file servers: the CPU server’s file system, the user’s home directory, etc.
- Per-user profile scripts further customize the namespace.
The result is that the “standard” file hierarchy is a convention, not a kernel requirement. Any process can rearrange it.
Namespace as Security Boundary
Plan 9 namespaces provide a form of capability-like access control:
- A process cannot access resources outside its namespace
- A parent can restrict a child’s namespace before exec
- There is no way to “escape” a namespace – there is no
..that crosses a mount boundary unexpectedly, and#designators can be restricted
However, this is not a formal capability system:
- The namespace contains string paths, which are ambient authority within the namespace
- Any process can
open("/dev/cons")if/dev/consis in its namespace – there is no per-open-call authorization - The isolation depends on correct namespace construction, not structural properties
2. The 9P Protocol
Overview
9P (and its updated version 9P2000) is the protocol spoken between clients and file servers. Every resource in Plan 9 is accessed through 9P – local kernel devices, remote file systems, user-space services, and network resources all speak the same protocol.
9P is a request-response protocol with fixed message types. It is connection-oriented: a client establishes a session, authenticates, walks paths to obtain file handles (fids), and then reads/writes through those handles.
Message Types (9P2000)
9P2000 defines the following message pairs (T = request from client, R = response from server):
Session management:
Tversion/Rversion– Negotiate protocol version and maximum message size. Must be the first message. The client proposes a version string (e.g.,"9P2000") and amsize(maximum message size in bytes). The server responds with the agreed version and msize.Tauth/Rauth– Establish an authentication fid. The client provides a user name and ananame(the file tree to access). The server returns anafidthat the client reads/writes to complete an authentication exchange.Tattach/Rattach– Attach to a file tree. The client provides theafidfrom authentication, a user name, and theaname. The server returns aqid(unique file identifier) for the root of the tree. Thisfidbecomes the client’s handle for the root directory.
Navigation:
Twalk/Rwalk– Walk a path from an existing fid. The client provides a starting fid and a sequence of name components (up to 16 per walk). The server returns a new fid pointing to the result and the qids of each intermediate step. Walk is how you traverse directories – there is noopen-by-pathoperation.
File operations:
Topen/Ropen– Open an existing file (by fid, obtained via walk). The client specifies a mode (read, write, read-write, exec, truncate). The server returns the qid and aniounit(maximum I/O size for atomic operations).Tcreate/Rcreate– Create a new file in a directory fid. The client specifies name, permissions, and mode.Tread/Rread– Readcountbytes atoffsetfrom an open fid. The server returns the data.Twrite/Rwrite– Writecountbytes atoffsetto an open fid. The server returns the number of bytes actually written.Tclunk/Rclunk– Release a fid. The server frees associated state. Equivalent toclose().Tremove/Rremove– Remove the file referenced by a fid and clunk the fid.Tstat/Rstat– Get file metadata (name, size, permissions, access times, qid, etc.).Twstat/Rwstat– Modify file metadata.
Error handling:
Rerror– Any T-message can receive anRerrorinstead of its normal response. Contains a text error string (9P2000) or an error number (9P2000.u).
Message Format
Every 9P message starts with a 4-byte length (little-endian, including the length field itself), a 1-byte type, and a 2-byte tag. The tag is chosen by the client and echoed in the response, enabling multiplexed operations over a single connection.
[4 bytes: size][1 byte: type][2 bytes: tag][... type-specific fields ...]
Field types are simple: 1/2/4/8-byte integers (little-endian), counted strings (2-byte length prefix + UTF-8), and counted data blobs (4-byte length prefix + raw bytes).
Qids and File Identity
A qid is a server-assigned 13-byte file identifier:
[1 byte: type][4 bytes: version][8 bytes: path]
- type – Bits indicating directory, append-only, exclusive-use, authentication file, etc.
- version – Incremented when the file is modified. The client can detect changes by comparing versions.
- path – A unique identifier for the file within the server. Typically a hash or inode number.
Qids allow clients to detect file identity (same path through different walks = same qid) and staleness (version changed = re-read needed).
Authentication
9P2000 authentication is pluggable. The protocol provides the Tauth/Rauth
mechanism to establish an authentication fid, but the actual authentication
exchange happens by reading and writing this fid – the protocol itself is
agnostic to the authentication method.
Plan 9’s standard mechanism is p9sk1, a shared-secret protocol using an authentication server. The flow:
- Client sends
Tauthto get anafid - Client and server exchange challenge-response messages by reading/writing
the
afid, mediated by the authentication server - Once authentication succeeds, the client uses the
afidinTattach
The key insight: authentication is just another read/write conversation over a special fid. New authentication methods can be implemented without changing the protocol.
Concurrency
9P supports concurrent operations through tags. A client can send multiple T-messages without waiting for responses. Each has a unique tag, and the server can respond out of order. The client matches responses to requests by tag.
A special tag value NOTAG (0xFFFF) is used for Tversion, which must
complete before any other messages.
The OEXCL open mode provides exclusive access to a file – only one client
can open it at a time. This is used for locking (e.g., the #l lock device
in some Plan 9 variants).
Fids are per-connection, not global. Different clients on different connections have independent fid spaces. A server maintains per-connection state.
Maximum Message Size
The msize negotiated in Tversion bounds all subsequent messages. A
typical default is 8192 or 65536 bytes. The iounit returned by Topen
tells the client the maximum useful count for read/write on that fid,
which may be less than msize minus the message header overhead.
This bounding is important for resource management – a server can limit memory consumption per connection.
3. File-Based vs Capability-Based Interfaces
Plan 9: Everything is a File
Plan 9 takes Unix’s “everything is a file” philosophy further than Unix itself ever did:
- Network stack – TCP connections are managed by reading/writing files
in
/net/tcp:clone(allocate a connection),ctl(write commands likeconnect 10.0.0.1!80),data(read/write payload),status(read connection state). - Window system – The
riowindow manager exports a file system: each window has acons,mouse,winname, etc. A program draws by writing to/dev/draw/*. - Process control –
/proc/<pid>/containsctl(writekillto signal),status(read state),mem(read/write process memory),text(read executable),note(signals). - Hardware devices – Kernel devices export file interfaces directly. The audio device is files, the graphics framebuffer is files, etc.
The interface contract is: open a file, read/write bytes, stat for metadata.
The semantics of those bytes are defined by the file server – there is no
ioctl().
Strengths of the file model:
- Universal tools work everywhere:
cat /net/tcp/0/status,echo kill > /proc/1234/ctl - Shell scripts can compose services trivially
- Network transparency is automatic: mount a remote file server, same tools work
- The interface is self-documenting:
lsshows available operations - Simple tools like
cat,echo,grepbecome universal adapters
Weaknesses of the file model:
- Type erasure. Everything is bytes. The protocol cannot express
structured data without conventions layered on top (text formats, fixed
layouts, etc.). A
read()returns raw bytes – the client must know the expected format. - Limited operation set. The only verbs are open, read, write, stat,
create, remove. Complex operations must be encoded as write-command /
read-response sequences (e.g.,
echo "connect 10.0.0.1!80" > /net/tcp/0/ctl). Error handling is ad-hoc. - No schema or type checking. Nothing prevents writing garbage to a ctl file. Errors are detected at runtime, often with cryptic messages.
- No structured errors. 9P errors are text strings. No error codes, no machine-parseable error metadata.
- Byte-stream orientation. 9P read/write are offset-based byte operations. This fits files naturally but is awkward for RPC-style request/response interactions. File servers work around this with conventions (write a command, read the response from offset 0).
- No pipelining of operations. You cannot say “open this file, then read it, and if that succeeds, write to this other file” atomically. Each step is a separate round-trip (though 9P’s tag multiplexing helps amortize latency).
Capability Systems: Everything is a Typed Interface
In a capability system like capOS, resources are accessed through typed interface references:
interface Console {
write @0 (data :Data) -> ();
writeLine @1 (text :Text) -> ();
}
interface NetworkManager {
createTcpSocket @0 (addr :Text, port :UInt16) -> (socket :TcpSocket);
}
interface TcpSocket {
read @0 (count :UInt32) -> (data :Data);
write @1 (data :Data) -> (written :UInt32);
close @2 () -> ();
}
Strengths of the capability model:
- Type safety. The interface contract is machine-checked. You cannot
call
writeon aNetworkManager– the type system prevents it. - Rich operations. Interfaces can define arbitrary methods with typed parameters and return values. No need to encode everything as byte read/writes.
- Structured errors. Return types can include error variants. Capabilities can define error enums in the schema.
- Schema evolution. Cap’n Proto supports backwards-compatible schema changes (adding fields, adding methods). Both old and new clients/servers interoperate.
- No ambient authority. A process has precisely the capabilities it
was granted. No path-based discovery, no
/procto enumerate. - Attenuation. A broad capability can be narrowed to a restricted
version (e.g.,
Fetch->HttpEndpoint). The restriction is structural, not a permission check.
Weaknesses of the capability model:
- No universal tools.
catandechodo not work on capabilities. Each interface needs its own client tool or library. Debugging requires interface-aware tools. - Harder composition. Shell pipes compose byte streams trivially. Capability composition requires typed adapters or a capability-aware shell.
- Discovery problem.
lsshows files. What shows capabilities? A management-onlyCapabilityManager.list()call, but that requires holding the manager cap and a tool that can render the result. - Steeper learning curve. A new developer can
ls /netto understand the network stack. Understanding a capability interface requires reading the schema definition. - Verbosity. Opening a TCP connection in Plan 9 is four file operations (clone, ctl, data, status). In a capability system, it is one typed method call. But defining the interface in the schema is more upfront work than just exporting files.
Synthesis
The file model and the capability model are not opposed – they are different points on a trade-off curve between universality and type safety. Plan 9 chose maximal universality (everything reduces to bytes + paths). Capability systems choose maximal type safety (everything has a schema).
The interesting question is whether a capability system can recover the ergonomic benefits of the file model while maintaining type safety. This is addressed in section 6.
4. 9P as IPC
File Servers as Services
In Plan 9, a “service” is simply a process that speaks 9P. When a client mounts a file server’s connection at some path, all file operations on that path become 9P messages to the server. This is the universal IPC mechanism – there are no Unix-domain sockets, no D-Bus, no shared memory primitives for service communication. Everything goes through 9P.
Examples of services-as-file-servers:
exportfs– Re-exports a subtree of the current namespace over a network connection, letting remote clients mount it.ramfs– A RAM-backed file server. Mount it and you have a tmpfs.ftpfs– Mounts a remote FTP server as a local directory. Programs read/write files; the file server translates to FTP protocol.mailfs– Presents a mail spool as a directory of messages. Each message is a directory withheader,body,rawbody, etc.plumber– The inter-application message router exports a file interface: write a message to/mnt/plumb/send, and it arrives in the target application’s plumb port.acme– The Acme editor exports its entire UI as a file system: windows, buffers, tags, event streams. External programs can control Acme by reading/writing these files.
The srv Device and Connection Passing
The kernel #s (srv) device provides a namespace for posting file
descriptors. A server process creates a pipe, starts serving 9P on one end,
and posts the other end as /srv/myservice. Other processes open
/srv/myservice to get a connection to the server, then mount it into
their namespace.
# Server side:
pipe = pipe()
post(pipe[0], "/srv/myfs")
serve_9p(pipe[1])
# Client side:
fd = open("/srv/myfs", ORDWR)
mount(fd, "/mnt/myfs", MREPL, "")
# Now /mnt/myfs/* are served by the server process
This decouples service registration from namespace mounting. Multiple clients can mount the same service at different paths in their own namespaces.
Performance and Overhead
9P’s overhead compared to direct function calls or shared memory:
- Serialization – Every operation is a 9P message: header parsing, field encoding/decoding. Messages are simple binary (not XML/JSON), so this is fast but nonzero.
- Copying – Data passes through the kernel (pipe or network): user buffer -> kernel pipe buffer -> server process buffer (and back for responses). This is at least two copies per direction.
- Context switches – Each request/response is a write (client) + read (server) + write (server) + read (client) = four context switches for a round-trip.
- No zero-copy – 9P does not support shared memory or page remapping. Large data transfers pay the full copy cost.
For metadata-heavy operations (stat, walk, open/close), the overhead is dominated by context switches, not data copying. Plan 9 is designed for networks where latency matters – the protocol’s simplicity and multiplexability help here.
For bulk data, the overhead is significant. Plan 9 compensates somewhat with
the iounit mechanism (encouraging large reads/writes to amortize per-call
costs) and the fact that most I/O is streaming (sequential reads/writes, not
random access).
In practice, Plan 9 systems are not optimized for raw throughput on local IPC. The design prioritizes simplicity and network transparency over local performance. The assumption is that the network is the bottleneck, so local protocol overhead is acceptable.
Network Transparency
9P’s power lies in its network transparency. The same protocol runs over:
- Pipes – Local IPC between processes on the same machine.
- TCP connections – Remote file access across the network.
- Serial lines – Early Plan 9 terminals connected to CPU servers.
- TLS/SSL – Encrypted connections (added later).
A CPU server is accessed by mounting its file system over the network. The
Plan 9 cpu command:
- Connects to a remote CPU server over TCP
- Authenticates
- Exports the local namespace (via
exportfs) to the remote side - The remote side mounts the local namespace, overlaying it with its own kernel devices
- A shell runs on the remote CPU, but with access to local files
The result: you work on the remote machine but your files, windows, and devices are local. This is more powerful than SSH because the integration is at the namespace level, not the terminal level.
Factoid: In the Plan 9 computing model, terminals were intentionally underpowered. The expensive hardware was the CPU server. Users mounted the CPU server’s filesystem and ran programs there, with the terminal providing I/O devices (keyboard, mouse, display) exported as files back to the CPU server.
5. Inferno OS
What Inferno Adds Beyond Plan 9
Inferno (also from Bell Labs, originally by the same team) took the Plan 9 architecture and adapted it for portable, networked computing. It can run as a native OS on bare hardware, as a hosted application on other OSes (Linux, Windows, macOS), or as a virtual machine.
Key additions and differences:
- Dis virtual machine – All user-space code runs on a register-based VM, not native machine code.
- Limbo language – A type-safe, garbage-collected, concurrent language (influenced Plan 9 C, CSP, Newsqueak, and Alef). All applications are written in Limbo.
- Styx protocol – Inferno’s name for its 9P variant (functionally identical to 9P2000 with minor encoding differences in early versions, later fully aligned with 9P2000).
- Portable execution – The same Limbo bytecode runs on any platform where the Dis VM is available. No recompilation needed.
- Built-in cryptography – TLS, certificate-based authentication, and signed modules are integrated into the system, not bolted on.
The Dis Virtual Machine
Dis is a register-based virtual machine (unlike the JVM, which is stack-based). Key characteristics:
- Memory model – Dis uses a module-based memory model. Each loaded module has its own data segment (frame). Instructions reference memory operands by offset within the current module’s frame, the current function’s frame, or a literal (mp, fp, or immediate addressing).
- Instruction set – CISC-inspired, with three-address instructions:
add src1, src2, dst. Opcodes cover arithmetic, comparison, branching, string operations, channel operations, and system calls. Around 80-90 opcodes. - Type descriptors – Each allocated block has a type descriptor that identifies which words are pointers. This enables exact garbage collection (no conservative scanning).
- Garbage collection – Reference counting with cycle detection. Deterministic deallocation for acyclic structures (important for resource management), with periodic cycle collection.
- Module loading – Dis modules are loaded on demand. A module declares its type signature (exported functions and their types), and the loader verifies type compatibility at link time.
- JIT compilation – On supported architectures (x86, ARM, MIPS, SPARC, PowerPC), Dis bytecode is compiled to native code at load time. This removes the interpretation overhead for hot code.
- Concurrency – Dis natively supports concurrent threads of execution within a module. Threads communicate via typed channels (from CSP/Limbo).
The Limbo Language
Limbo is Inferno’s application language. Its design reflects the system’s values:
- Type-safe – No pointer arithmetic, no unchecked casts, no buffer overflows. The type system is enforced at compile time and verified at module load time.
- Garbage collected – Programmers do not manage memory. Reference counting provides deterministic resource cleanup.
- Concurrent – First-class
chantypes (typed channels) andspawnfor creating threads. This is CSP-style concurrency, predating (and influencing) Go’s goroutines and channels. - Module system – Modules declare interfaces (like header files with
type signatures). A module
imports another module’s interface, and the runtime verifies type compatibility at load time. - ADTs – Algebraic data types with
pick(tagged unions). Pattern matching over variants. - Tuples – First-class tuple types for returning multiple values.
- No inheritance – Limbo has ADTs and modules, not objects and classes.
Example – a simple file server in Limbo:
implement Echo;
include "sys.m";
include "draw.m";
include "styx.m";
sys: Sys;
Echo: module {
init: fn(nil: ref Draw->Context, argv: list of string);
};
init(nil: ref Draw->Context, argv: list of string)
{
sys = load Sys Sys->PATH;
# ... set up Styx server, handle read/write on echo file
}
Limbo and the Namespace Model
Limbo programs interact with the namespace through the Sys module’s file
operations (open, read, write, mount, bind, etc.) – the same
operations as in Plan 9. The namespace model is identical:
- Each process group has its own namespace
bindandmountmanipulate the namespace- File servers (Styx servers) provide services
- Union directories compose multiple servers
The difference is that Limbo’s type safety extends to the file descriptors
and channels used to communicate. A Sys->FD is a reference type, not a
raw integer. You cannot fabricate a file descriptor from nothing.
Limbo’s channel type (chan of T) provides typed communication between
concurrent threads within a process. Channels are a local IPC mechanism
complementary to Styx, which handles inter-process and inter-machine
communication.
Styx (Inferno’s 9P)
Styx is Inferno’s name for the 9P2000 protocol. In the current version of Inferno, Styx and 9P2000 are wire-compatible – the same byte format, the same message types, the same semantics. The renaming reflects Inferno’s origin as a commercial product from Vita Nuova (and before that, Lucent Technologies) with its own branding.
The Inferno kernel includes a Styx library (Styx and Styxservers
modules) that makes implementing file servers straightforward in Limbo.
The Styxservers module provides a framework: you implement a navigator
(for walk/stat) and a file handler (for read/write), and the framework
handles the protocol boilerplate.
include "styx.m";
include "styxservers.m";
styx: Styx;
styxservers: Styxservers;
Srv: adt {
# ... file tree definition
};
# The framework calls navigator.walk(), navigator.stat() for metadata
# and file.read(), file.write() for data operations.
Inferno also provides the 9srvfs utility for mounting external 9P servers
and the mount command for attaching Styx servers to the namespace – the
same patterns as Plan 9.
Security Model
Inferno’s security model builds on namespaces with additional mechanisms:
- Signed modules – Dis modules can be cryptographically signed. The loader can verify signatures before executing code.
- Certificate-based authentication – Inferno uses a certificate infrastructure (not Kerberos like Plan 9) for authenticating connections.
- Namespace restriction – The
wm/shshell and other supervisory programs can construct restricted namespaces for untrusted code. - Type safety as security – Since Limbo prevents pointer forgery and buffer overflows, type safety is a security boundary. A Limbo program cannot escape its type system to forge file descriptors or access arbitrary memory.
6. Relevance to capOS
6.1 Namespace Composition via Capabilities
Plan 9 lesson: Per-process namespaces are a powerful isolation and composition mechanism. A process’s “view of the world” is constructed by its parent through bind/mount operations. The child cannot escape this view.
capOS parallel: Per-process capability tables serve an analogous role. A process’s “view of the world” is its set of granted capabilities. The child cannot discover or access capabilities outside its table.
What capOS could adopt:
The existing Namespace interface in the storage proposal
(docs/proposals/storage-and-naming-proposal.md) already captures some of this –
resolve, bind, list, and sub provide name-to-capability mappings.
But Plan 9’s namespace model suggests a more dynamic composition pattern:
interface Namespace {
# Resolve a name to a capability reference
resolve @0 (name :Text) -> (capId :UInt32, interfaceId :UInt64);
# Bind a capability at a name in this namespace
bind @1 (name :Text, capId :UInt32) -> ();
# Create a union: multiple capabilities behind one name
union @2 (name :Text, capId :UInt32, position :UnionPosition) -> ();
# List available names
list @3 () -> (entries :List(NamespaceEntry));
# Get a restricted sub-namespace
sub @4 (prefix :Text) -> (ns :Namespace);
}
enum UnionPosition {
before @0; # searched first (like Plan 9 MBEFORE)
after @1; # searched last (like Plan 9 MAFTER)
replace @2; # replaces existing (like Plan 9 MREPL)
}
struct NamespaceEntry {
name @0 :Text;
interfaceId @1 :UInt64;
label @2 :Text;
}
The key insight from Plan 9 is union composition – multiple capabilities can be bound at the same name, searched in order. This is useful for overlay patterns: a local cache capability layered before a remote store capability, or a per-user config namespace layered before a system-wide default.
Differences from Plan 9:
Plan 9 namespaces map names to file servers. capOS namespaces map names to typed capabilities. The advantage: capOS can verify at bind time that the capability matches the expected interface. Plan 9 cannot – you mount a file server and discover at runtime whether it exports the files you expect.
6.2 Cap’n Proto RPC vs 9P
Protocol comparison:
| Aspect | 9P2000 | Cap’n Proto RPC |
|---|---|---|
| Message format | Fixed binary fields, counted strings/data | Capnp wire format (pointer-based, zero-copy decode) |
| Operations | Fixed set (walk, open, read, write, stat, …) | Arbitrary per-interface (schema-defined methods) |
| Typing | Untyped bytes | Strongly typed (schema-checked) |
| Multiplexing | Tag-based (16-bit tags) | Question ID-based (32-bit) |
| Pipelining | Not supported (each op is independent) | Promise pipelining (call method on not-yet-returned result) |
| Authentication | Pluggable via auth fid | Application-level (not protocol-specified) |
| Capabilities | No (file fids are unforgeable handles, but no transfer/attenuation) | Native capability passing and attenuation |
| Maximum message | Negotiated msize | No inherent limit (segmented messages) |
| Schema evolution | N/A (fixed protocol) | Forward/backward compatible schema changes |
| Network transparency | Native design goal | Native design goal |
Key differences for capOS:
-
Promise pipelining – This is capnp RPC’s strongest advantage over 9P. In 9P, opening a TCP connection requires: walk to
/net/tcp-> walk toclone-> open clone -> read (get connection number) -> walk toctl-> open ctl -> write “connect …” -> walk todata-> open data. Eight round-trips minimum. With capnp pipelining:net.createTcpSocket("10.0.0.1", 80)returns a promise, and you can immediately call.write(data)on the promise – the runtime chains the calls without waiting for the first to complete. One logical round-trip. -
Typed interfaces – 9P’s strength is that
catworks on any file. Capnp’s strength is that the compiler catchesconsole.allocFrame()at compile time. capOS should not try to make everything a “file” – typed interfaces are the right abstraction for a capability system. But aFileServercapability interface could provide Plan 9-like flexibility where needed (see below). -
Capability passing – 9P has no way to pass a fid through a file server to a third party. (The
srvdevice is a workaround, not a protocol feature.) Capnp RPC natively supports passing capability references in messages. This is fundamental to capOS’s model.
6.3 File Server Pattern as a Capability
Plan 9’s file server pattern is useful and should not be discarded just
because capOS is capability-based. Instead, define a generic FileServer
capability interface:
interface FileServer {
walk @0 (names :List(Text)) -> (fid :FileFid);
list @1 (fid :FileFid) -> (entries :List(DirEntry));
}
interface FileFid {
open @0 (mode :OpenMode) -> (iounit :UInt32);
read @1 (offset :UInt64, count :UInt32) -> (data :Data);
write @2 (offset :UInt64, data :Data) -> (written :UInt32);
stat @3 () -> (info :FileInfo);
close @4 () -> ();
}
A FileServer capability enables:
/proc-like introspection – A debugging service exports process state as a file tree. Tools read files to inspect state.- Config storage – A configuration namespace can be exposed as files for tools that work with text.
- POSIX compatibility – The POSIX shim layer maps
open()/read()/write()toFileServercapability calls. - Shell scripting – A capability-aware shell could mount
FileServercaps and usecat/echo-style tools on them.
The point: FileServer is one capability interface among many. It is not
the universal abstraction (as in Plan 9), but it is available where the
file metaphor is natural.
6.4 IPC Lessons
Plan 9 lesson: 9P works as universal IPC because the protocol is simple and the kernel handles the plumbing (mount, pipe, network). The cost is per-message overhead (copies, context switches).
capOS implications:
-
Minimize copies. 9P’s two-copies-per-direction (user -> kernel pipe buffer -> server) is acceptable for networks but expensive for local IPC. capOS should investigate shared-memory regions for bulk data transfer between co-located processes, with capnp messages as the control plane. The roadmap’s io_uring-inspired submission/completion rings already point in this direction.
-
Direct context switch. The L4/seL4 IPC fast-path (direct switch from caller to callee without choosing an unrelated runnable process) now exists as a baseline for blocked Endpoint receivers. Plan 9 does not do this – every 9P round-trip goes through the kernel’s pipe/network layer. capOS can tune this further because capability calls have a known target process.
-
Batching. Plan 9 mitigates round-trip costs through large reads/ writes (the iounit mechanism). Capnp’s promise pipelining is the typed equivalent – batch multiple logical operations into a dependency chain that executes without intermediate round-trips.
6.5 Inferno Lessons
Dis VM / type safety: Inferno’s bet on a managed runtime (Dis + Limbo) gives it type safety as a security boundary. capOS, being written in Rust for kernel code and targeting native binaries, does not have this luxury for arbitrary user-space code. However:
- WASI support (on the roadmap) provides a sandboxed execution environment with type-checked interfaces, similar in spirit to Dis.
- Cap’n Proto schemas provide interface-level type safety even for native code. The schema is the contract, enforced at message boundaries.
Channel-based concurrency: Limbo’s chan of T type is a local IPC
mechanism within a process. capOS does not currently have this (it relies
on kernel-mediated capability calls for all IPC). For in-process threading
(on the roadmap), typed channels between threads could be useful –
implemented as a library on top of shared memory + futex, without kernel
involvement.
Portable execution: Inferno’s ability to run the same bytecode everywhere is appealing but orthogonal to capOS’s goals. The WASI runtime item on the roadmap serves this purpose for capOS.
6.6 Concrete Recommendations
Based on this research, the following items are most relevant to capOS development:
-
Add a
Namespacecapability with union semantics. Extend the existing Namespace design (from the storage proposal) with Plan 9-style union composition (before/after/replace). This enables overlay patterns for configuration, caching, and modularity. -
Implement a
FileServercapability interface. Not as the universal abstraction, but as one interface for resources that are naturally file-like (config trees, debug introspection, POSIX compatibility). AFileServercap is just another capability – no special kernel support needed. -
Prioritize promise pipelining. This is capnp’s killer feature over 9P and the biggest performance advantage for IPC-heavy workloads. Multiple logical operations collapse into one network/IPC round-trip. Async rings are in place; the remaining work is the Stage 6 pipeline dependency/result-cap mapping rule.
-
Plan 9-style namespace construction in init. The boot manifest already describes which capabilities each service receives. Consider adding namespace-level composition to the manifest: “this service sees capability X as
data/primaryand capability Y asdata/cache, with cache searched first” – union directory semantics expressed in capability terms. -
Study 9P’s
exportfspattern for network transparency. Plan 9’sexportfsre-exports a namespace subtree over the network. The capOS equivalent would be a proxy service that takes a set of local capabilities and makes them available as capnp RPC endpoints on the network. This is the “network transparency” roadmap item – 9P’s design proves it is achievable, and capnp’s richer type system makes it more robust. -
Do not replicate 9P’s weaknesses. The untyped byte-stream interface, the lack of structured errors, and the fixed operation set are 9P’s costs for universality. capOS pays none of these costs with Cap’n Proto. The temptation to “make everything a file for simplicity” should be resisted – typed capabilities are strictly more powerful, and the
FileServerinterface provides the file metaphor where needed without compromising the rest of the system.
Summary
| Plan 9 / Inferno Concept | capOS Equivalent | Gap / Action |
|---|---|---|
| Per-process namespace (bind/mount) | Per-process capability table | Add Namespace cap with union semantics |
| 9P protocol (file operations) | Cap’n Proto RPC (typed method calls) | capnp is strictly superior for typed IPC; FileServer cap provides file semantics where needed |
| Union directories | No current equivalent | Add union composition to Namespace interface |
| File servers as services | Capability-implementing processes | Already the model; manifest-driven service graph is close to Plan 9’s boot namespace construction |
| Network transparency via 9P | Network transparency via capnp RPC | Same goal, capnp adds promise pipelining and typed interfaces |
exportfs (namespace re-export) | Capability proxy service | Not yet designed; high-value future work |
| Styx/9P as universal IPC | Capnp messages as universal IPC | Already the model; prioritize fast-path and pipelining |
| Dis VM (portable, type-safe execution) | WASI runtime (roadmap) | Same goal, different mechanism |
| Limbo channels (typed local IPC) | Not yet present | Consider for in-process threading |
| Authentication via auth fid | Not yet designed | Cap’n Proto RPC has no built-in auth; needs design |
References
- Rob Pike, Dave Presotto, Sean Dorward, Bob Flandrena, Ken Thompson, Howard Trickey, Phil Winterbottom. “Plan 9 from Bell Labs.” Computing Systems, Vol. 8, No. 3, Summer 1995, pp. 221-254.
- Rob Pike, Dave Presotto, Ken Thompson, Howard Trickey, Phil Winterbottom. “The Use of Name Spaces in Plan 9.” Operating Systems Review, Vol. 27, No. 2, April 1993, pp. 72-76.
- Plan 9 Manual: intro(1), bind(1), mount(1), intro(5) (the 9P manual section).
- Russ Cox, Eric Grosse, Rob Pike, Dave Presotto, Sean Quinlan. “Security in Plan 9.” USENIX Security 2002.
- Sean Dorward, Rob Pike, Dave Presotto, Dennis Ritchie, Howard Trickey, Phil Winterbottom. “The Inferno Operating System.” Bell Labs Technical Journal, Vol. 2, No. 1, Winter 1997.
- Phil Winterbottom, Rob Pike. “The Design of the Inferno Virtual Machine.” Bell Labs, 1997.
- Vita Nuova. “The Dis Virtual Machine Specification.” 2003.
- Vita Nuova. “The Limbo Programming Language.” 2003.
- Sape Mullender (editor). “The 9P2000 Protocol.” Plan 9 manual, section 5 (intro(5)).
- Kenichi Okada. “9P Resource Sharing Protocol.” IETF Internet-Draft, 2010.
Research: EROS, CapROS, and Coyotos
Deep analysis of persistent capability operating systems and their relevance to capOS.
1. EROS (Extremely Reliable Operating System)
1.1 Overview
EROS was designed and implemented by Jonathan Shapiro and collaborators at the University of Pennsylvania, starting in the late 1990s. It is a pure capability system descended from KeyKOS (developed at Key Logic in the 1980s). EROS’s defining feature is orthogonal persistence: the entire system state – processes, memory, capabilities – is transparently persistent. There is no distinction between “in memory” and “on disk.”
Key papers:
- Shapiro, J. S., Smith, J. M., & Farber, D. J. “EROS: A Fast Capability System” (SOSP 1999)
- Shapiro, J. S. “EROS: A Capability System” (PhD dissertation, 1999)
- Shapiro, J. S. & Weber, S. “Verifying the EROS Confinement Mechanism” (IEEE S&P 2000)
1.2 The Single-Level Store
In a conventional OS, memory and storage are separate address spaces with different APIs (read/write vs mmap/file I/O). The programmer is responsible for explicitly loading data from disk into memory, modifying it, and writing it back. This creates an impedance mismatch that is the source of enormous complexity (serialization, caching, crash consistency, etc.).
EROS eliminates this distinction with a single-level store:
- All objects (processes, memory pages, capability nodes) exist in a unified persistent object space.
- There is no “file system” and no “load/save.” Objects simply exist.
- The system periodically checkpoints the entire state to disk. Between checkpoints, modified pages are held in memory. After a crash, the system restores to the last consistent checkpoint.
- From the application’s perspective, memory IS storage. There is no API for persistence – it happens automatically.
The single-level store in EROS operates on two primitive object types:
- Pages – 4KB data pages (the equivalent of both memory pages and file blocks).
- Nodes – 32-slot capability containers (the equivalent of both process state and directory entries).
Every page and node has a persistent identity (an Object ID, or OID). The kernel maintains an in-memory object cache and demand-pages objects from disk as needed. Modified objects are written back during checkpoints.
1.3 Checkpoint/Restart
EROS uses a consistent checkpoint mechanism inspired by KeyKOS:
How it works:
- The kernel periodically initiates a checkpoint (KeyKOS used a 5-minute interval; EROS used a configurable interval, typically seconds to minutes).
- All processes are momentarily frozen.
- The kernel snapshots the current state:
- All dirty pages are marked for write-back.
- All node state (capability tables, process descriptors) is serialized.
- A consistent snapshot of the entire system is captured.
- Processes resume immediately – they continue modifying their own copies of pages (copy-on-write semantics ensure the checkpoint image is stable while new modifications accumulate).
- The snapshot is written to disk asynchronously while processes continue running.
- Once the write completes, the checkpoint is atomically committed (a checkpoint header on disk is updated).
What state is captured:
- All memory pages (dirty pages since last checkpoint).
- All nodes (capability slots, process registers, scheduling state).
- The kernel’s object table (mapping OIDs to disk locations).
- The capability graph (which process holds which capabilities).
Recovery after crash:
- On boot, the kernel reads the last committed checkpoint header.
- The system resumes from that exact state. All processes continue as if nothing happened (they may have lost a few seconds of work since the last checkpoint).
- No fsck, no journal replay, no application-level recovery logic.
Performance characteristics:
- Checkpoint cost is proportional to the number of dirty pages since the last checkpoint, not total system size.
- Copy-on-write minimizes pause time – processes are frozen only long enough to mark pages, not to write them.
- EROS achieved checkpoint times of a few milliseconds for the freeze phase, with asynchronous write-back taking longer depending on dirty set size.
- The 1999 SOSP paper reported IPC performance within 2x of L4 (the fastest microkernel at the time) despite the persistence overhead.
1.4 Capabilities: Keys, Nodes, and Domains
EROS (following KeyKOS) uses a specific capability model with three fundamental concepts:
Keys (capabilities):
A key is an unforgeable reference to an object. Keys are the ONLY way to access anything in the system. There are several types:
- Page keys – reference a persistent page. Can be read-only or read-write.
- Node keys – reference a node (a 32-slot capability container). Can be read-only.
- Process keys (called “domain keys” in KeyKOS) – reference a process, allowing control operations (start, stop, set registers).
- Number keys – encode a 96-bit value directly in the key (no indirection). Used for passing constants through the capability mechanism.
- Device keys – reference hardware device registers.
- Forwarder keys – indirection keys used for revocation (see below).
- Void keys – null/invalid keys, used as placeholders.
Nodes:
A node is a persistent container of exactly 32 key slots (in KeyKOS; EROS varied this slightly). Nodes serve multiple purposes:
- Address space description: A tree of nodes with page keys at the leaves defines a process’s virtual address space. The kernel walks this tree to resolve virtual addresses to physical pages (analogous to page tables, but persistent and capability-based).
- Capability storage: A process’s “capability table” is a node tree.
- General-purpose data structure: Any capability-based data structure (directories, lists, etc.) is built from nodes.
Domains (processes):
A domain is EROS’s equivalent of a process. It consists of:
- A domain root node with specific slots for:
- Slot 0-15: general-purpose key registers (the process’s capability table)
- Address space key (points to the root of the address space node tree)
- Schedule key (determines CPU time allocation)
- Brand key (identity for authentication)
- Other control keys
- The domain’s register state (general-purpose registers, IP, SP, flags)
- A state (running, waiting, available)
The entire domain state is captured during checkpoint because it’s all stored in persistent nodes and pages.
1.5 The Keeper Mechanism
Each domain has a keeper key – a capability to another domain that acts as its fault handler. When a domain faults (page fault, capability fault, exception), the kernel invokes the keeper:
- The faulting domain is suspended.
- The kernel sends a message to the keeper describing the fault.
- The keeper can inspect and modify the faulting domain’s state (via the domain key), fix the fault (e.g., map a page, supply a capability), and restart it.
This is EROS’s equivalent of signal handlers or exception ports, but capability-mediated and fully general. Keepers enable:
- Demand paging (the space bank keeper maps pages on fault)
- Capability interposition (a keeper can wrap/restrict capabilities)
- Process supervision (restart on crash)
1.6 Capability Revocation
Capability revocation – the ability to invalidate all copies of a capability – is one of the hardest problems in capability systems. EROS solves it with forwarder keys (called “sensory keys” in some descriptions):
How forwarders work:
- Instead of giving a client a direct key to a resource, the server creates a forwarder node.
- The forwarder contains a key to the real resource in one of its slots.
- The client receives a key to the forwarder, not the resource.
- When the client invokes the forwarder key, the kernel transparently redirects to the real resource.
- To revoke: the server rescinds the forwarder (sets a bit on the forwarder node). All outstanding forwarder keys become void keys. Invocations fail immediately.
Properties:
- Revocation is O(1) – flip a bit on the forwarder node. No need to scan all processes for copies.
- Revocation is transitive – if the revoked key was used to derive other keys (via further forwarders), those are also invalidated.
- The client cannot distinguish a forwarder key from a direct key (the kernel handles the indirection transparently).
- Revocation is immediate and irrevocable.
Space banks and revocation:
EROS uses space banks (inspired by KeyKOS) to manage resource allocation. A space bank is a capability that allocates pages and nodes. When a space bank is destroyed, ALL objects allocated from it are reclaimed. This provides bulk revocation of an entire subsystem.
1.7 Confinement
EROS provides a formally verified confinement mechanism. A confined subsystem cannot leak information to the outside world except through channels explicitly provided to it. Shapiro and Weber (IEEE S&P 2000) proved that EROS can construct a confined subsystem using:
- A constructor creates the confined process.
- The confined process receives ONLY the capabilities explicitly granted to it. It has no ambient authority, no access to timers (to prevent timing channels), and no access to storage (to prevent storage channels).
- The constructor verifies that no covert channels exist in the granted capability set.
This is relevant to capOS’s capability model: the same structural properties that make EROS confinement possible (no ambient authority, capabilities as the only access mechanism) are present in capOS’s design.
2. CapROS
2.1 Relationship to EROS
CapROS (Capability-based Reliable Operating System) is the direct successor to EROS. It was started by Charles Landau (who also worked on KeyKOS) and continues development based on the EROS codebase. CapROS is essentially “EROS in production” – the same architecture with engineering improvements.
2.2 Improvements Over EROS
Practical engineering focus:
- EROS was a research system; CapROS aims to be deployable.
- CapROS added support for modern hardware (PCI, USB, networking).
- Improved build system and development toolchain.
Persistence improvements:
- CapROS refined the checkpoint mechanism for better performance with modern disk characteristics (SSDs change the cost model significantly – random writes are cheap, so the checkpoint layout can be optimized differently than for spinning disks).
- Added support for larger persistent object spaces.
- Improved crash recovery speed.
Device driver model:
- CapROS runs device drivers as user-space processes (like EROS), each receiving only the device capabilities they need.
- A device driver receives: device register keys (MMIO access), interrupt keys (to receive interrupts), and DMA buffer keys.
- The driver CANNOT access other devices, other processes’ memory, or arbitrary I/O ports. It is confined to its specific device.
- This is directly analogous to capOS’s planned device capability model (see the networking and cloud deployment proposals).
Linux compatibility layer:
- CapROS includes a partial Linux kernel compatibility layer that allows some Linux device drivers to be compiled and run as CapROS user-space drivers. This pragmatically addresses the “driver availability” problem without compromising the capability model.
2.3 Current Status
CapROS development continued into the 2010s but has been relatively quiet. The codebase exists and runs on real x86 hardware. It is not widely deployed and remains primarily a research/demonstration system. The key contribution is demonstrating that the EROS/KeyKOS persistent capability model is viable on modern hardware and can support real device drivers and applications.
2.4 Device Drivers and Hardware Access
CapROS’s device driver isolation is worth examining in detail because capOS faces the same design decisions:
Device capability model:
Kernel
│
├── DeviceManager capability
│ │
│ ├── grants DeviceMMIO(base, size) to driver
│ ├── grants InterruptCap(irq_number) to driver
│ └── grants DMAPool(phys_range) to driver
│
└── Driver process
│
├── uses DeviceMMIO to read/write registers
├── uses InterruptCap to wait for interrupts
├── uses DMAPool to allocate DMA-safe buffers
└── exports higher-level capability (e.g., NIC, Block)
The driver has no way to access memory outside its granted ranges. A buggy NIC driver cannot corrupt disk I/O or access other processes’ pages.
3. Coyotos
3.1 Design Philosophy
Coyotos was Jonathan Shapiro’s next-generation project after EROS, started around 2004. Where EROS was an implementation of the KeyKOS model in C, Coyotos aimed to be a formally verifiable capability OS from the ground up.
Key differences from EROS:
- Verification-oriented design: Every kernel mechanism was designed to be amenable to formal verification. If a feature couldn’t be verified, it was redesigned or removed.
- BitC language: A new programming language (BitC) was designed specifically for writing verified systems software.
- Simplified object model: Coyotos reduced the number of primitive object types compared to EROS, making the verification target smaller.
- No inline assembly in the verified core: The verified kernel core was to be written entirely in BitC, with a thin hardware abstraction layer underneath.
3.2 BitC Language
BitC was an ambitious attempt to create a language suitable for both systems programming and formal verification:
Design goals:
- Type safety: Sound type system that prevents memory errors at compile time.
- Low-level control: Direct memory layout control, no garbage collector, suitable for kernel code.
- Formal reasoning: Type system designed so that proofs about programs could be mechanically checked.
- Mutability control: Explicit distinction between mutable and immutable references (predating Rust’s borrow checker by several years).
Relationship to capability verification:
The key insight was that if the kernel is written in a language with a sound type system, and capabilities are represented as typed references in that language, then many capability safety properties (no forgery, no amplification) follow from type safety rather than requiring separate proofs.
Specifically:
- Capabilities are opaque typed references – the type system prevents construction of capabilities from raw integers.
- The lack of arbitrary pointer arithmetic prevents capability forgery.
- Type-based access control means a read-only capability reference cannot be cast to a read-write one.
Outcome:
BitC was never completed. The language design proved extremely difficult – combining low-level systems programming with formal verification requirements created unsolvable tensions in the type system. Shapiro eventually acknowledged that the BitC approach was overambitious and shelved the project. (Rust, which appeared later, solved many of the same problems with a different approach – borrowing and lifetimes rather than full dependent types.)
3.3 Formal Verification Approach
Coyotos aimed to verify several key properties:
- Capability safety: No process can forge, modify, or amplify a capability. This was to be proved as a consequence of BitC’s type safety.
- Confinement: A confined subsystem cannot leak information except through authorized channels. EROS proved this informally; Coyotos aimed for machine-checked proofs.
- Authority propagation: Formal model of how authority flows through the capability graph, allowing static analysis of security policies.
- Memory safety: The kernel never accesses memory it shouldn’t, never double-frees, never uses after free. Type safety + linear types in BitC were intended to guarantee this.
The verification approach influenced later work on seL4, which successfully achieved formal verification of a capability microkernel (though in C with Isabelle/HOL proofs, not in a verification-oriented language).
3.4 Coyotos Memory Model
Coyotos simplified the EROS memory model while retaining persistence:
Objects:
- Pages: 4KB data pages (same as EROS).
- CapPages: Pages that hold capabilities instead of data. This replaced EROS’s fixed-size nodes with variable-size capability containers.
- GPTs (Guarded Page Tables): A unified abstraction for address space construction. Instead of EROS’s separate node trees for address spaces, Coyotos uses GPTs that combine guard bits (for sparse address space construction, similar to Patricia trees) with page table semantics.
- Processes: Similar to EROS domains but with a cleaner structure.
- Endpoints: IPC communication endpoints (similar to L4 endpoints, replacing EROS’s direct domain-to-domain calls).
GPTs (Guarded Page Tables):
This was Coyotos’s most innovative memory model contribution. A GPT node has:
- A guard value and guard length (for address space compression).
- Multiple capability slots pointing to sub-GPTs or pages.
- Hardware-independent address space description that the kernel translates to actual page tables on TLB miss.
The guard mechanism allows sparse address spaces without allocating intermediate page table levels. For example, a process that uses only two memory regions at addresses 0x1000 and 0x7FFF_F000 needs only a few GPT nodes, not a full 4-level page table tree.
Persistence:
Coyotos retained EROS’s checkpoint-based persistence but with a cleaner separation between the persistent object store and the in-memory cache. The simpler object model (fewer object types) made the checkpoint logic easier to verify.
3.5 Current Status
Coyotos was never completed. The BitC language proved too difficult, and Shapiro moved on to other work. However, Coyotos’s design documents and specifications remain valuable as a carefully reasoned evolution of the EROS model. The key ideas (GPTs, endpoint-based IPC, verification-oriented design) influenced other systems work.
4. Single-Level Store: Deep Dive
4.1 The Core Concept
The single-level store unifies two traditionally separate abstractions:
| Traditional OS | Single-Level Store |
|---|---|
| Virtual memory (RAM, volatile) | Unified persistent object space |
| File system (disk, persistent) | Same unified space |
| mmap (bridge between the two) | No bridge needed |
| Serialization (convert objects to bytes for storage) | Objects are always in storable form |
| Crash recovery (fsck, journal replay) | Checkpoint restore |
In a single-level store, the programmer never thinks about persistence. Objects are created, modified, and eventually garbage collected. The system ensures they survive power failure without any explicit save operation.
4.2 Implementation in EROS
EROS’s single-level store works as follows:
Object storage on disk:
- The disk is divided into two regions: the object store and the checkpoint log.
- The object store holds the canonical copy of all objects (pages and nodes), indexed by OID.
- The checkpoint log holds the most recently checkpointed versions of modified objects.
Object lifecycle:
- An object is created (allocated from a space bank). It receives a unique OID.
- The object exists in the in-memory object cache. It may be modified arbitrarily.
- During checkpoint, if the object is dirty, its current state is written to the checkpoint log.
- After the checkpoint commits, the logged version may be migrated to the object store (or left in the log until the next checkpoint).
- If the object is evicted from memory (memory pressure), it can be demand-paged back from disk.
Demand paging:
When a process accesses a virtual address that isn’t currently in physical memory:
- Page fault occurs.
- The kernel looks up the OID for that virtual page (by walking the address space capability tree).
- If the object is on disk, the kernel reads it into the object cache.
- The page is mapped into the process’s address space.
- The process continues, unaware that I/O occurred.
This is similar to demand paging in a conventional OS, but with a critical difference: the “backing store” is the persistent object store, not a swap partition. There is no separate swap space.
4.3 Performance Implications
Advantages:
- No serialization overhead for persistence. Objects are stored in their in-memory format.
- No double-buffering. A conventional OS may have a page in both the page cache and a file buffer; EROS has one copy.
- Checkpoint cost is proportional to mutation rate, not data size.
- Recovery is instantaneous – resume from last checkpoint, no log replay.
Disadvantages:
- Checkpoint pause: Even with copy-on-write, there is a brief pause to snapshot the system state. KeyKOS/EROS measured this at milliseconds, but it can grow with the number of dirty pages.
- Write amplification: Every modified page must be written to the checkpoint log, even if only one byte changed. This is worse than a log-structured filesystem that can coalesce small writes.
- Memory pressure: The object cache competes with application working sets. Under heavy memory pressure, the system may thrash between paging objects in and checkpointing them out.
- Large object stores: The OID-to-disk-location mapping must be kept in memory (or itself paged, adding complexity). For very large stores, this overhead grows.
- No partial persistence: You can’t choose to make some objects transient and others persistent. Everything is persistent. This wastes disk bandwidth on objects that don’t need persistence (temporary buffers, caches, etc.).
4.4 Relationship to Persistent Memory (PMEM/Optane)
Intel Optane (3D XPoint, now discontinued but conceptually important) and other persistent memory technologies provide byte-addressable storage that survives power loss. This is remarkably close to what EROS simulates in software:
| EROS Single-Level Store | PMEM Hardware |
|---|---|
| Software checkpoint to disk | Hardware persistence on every write |
| Object cache in DRAM | Data in persistent memory |
| Demand paging from disk | Direct load/store to persistent media |
| Crash = lose since last checkpoint | Crash = lose in-flight stores (cache lines) |
PMEM makes the single-level store cheaper:
- No checkpoint writes needed for objects stored in PMEM – they’re already persistent.
- No demand paging from disk – PMEM is directly addressable.
- Consistency requires cache line flush + fence (much cheaper than disk I/O).
But PMEM doesn’t eliminate the need for the store abstraction:
- PMEM capacity is limited (compared to SSDs/HDDs). The object store may still need to tier between PMEM and block storage.
- PMEM has higher latency than DRAM. The object cache still has value as a fast-path.
- Crash consistency with PMEM requires careful ordering of writes (cache line flushes). The checkpoint model actually simplifies this – you don’t need per-object crash consistency, just per-checkpoint consistency.
Relevance to capOS:
Even without PMEM hardware, understanding the single-level store model informs how capOS can design its persistence layer. The key insight is that separating “in-memory format” from “on-disk format” creates unnecessary complexity. Cap’n Proto’s zero-copy serialization already blurs this line – a capnp message in memory has the same byte layout as on disk.
5. Persistent Capabilities
5.1 How Persistent Capabilities Survive Restarts
In EROS/KeyKOS, capabilities survive restarts because they are part of the checkpointed state:
- A capability is stored as a key in a node slot.
- The key contains: (object type, OID, permissions, other metadata).
- During checkpoint, all nodes (including their key slots) are written to disk.
- On restart, nodes are restored. Keys reference objects by OID. Since objects are also restored, the key resolves to the same object.
The critical property: capabilities are named by the persistent identity of their target, not by a volatile address. A key says “page #47293” not “memory address 0x12345.” Since page #47293 is persistent, the key is meaningful across restarts.
5.2 Consistency Model
EROS guarantees checkpoint consistency: the entire system is restored to the state at the last committed checkpoint. This means:
- If process A sent a message to process B, and both the send and receive completed before the checkpoint, both see the message after restart.
- If the send completed but the receive didn’t (checkpoint happened between them), both are rolled back to before the send. The message is lost, but the system is consistent.
- There is no scenario where A thinks it sent a message but B never received it (or vice versa). The checkpoint captures a consistent global snapshot.
This is analogous to database transaction atomicity but applied to the entire system state.
5.3 Volatile State and Capabilities
Some capabilities reference inherently volatile state. EROS handles this through the object re-creation pattern:
Hardware devices:
- Device keys reference hardware registers that don’t survive reboot.
- On restart, the kernel re-initializes device state and re-creates device keys.
- Processes that held device keys get valid keys again (pointing to the re-initialized device), but the device state itself is reset.
- The process’s device driver is responsible for re-initializing the device to the desired state (this is application logic, not kernel logic).
Network connections:
- EROS doesn’t have a native networking stack in the kernel, so this is handled at the application level.
- A network service process re-establishes connections on restart.
- Clients that held capabilities to network endpoints would invoke them, and the network service would transparently reconnect.
- The capability abstraction hides the reconnection – the client’s code doesn’t change.
General pattern:
When a capability references state that can’t survive restart:
- The capability itself persists (it’s in a node slot, checkpointed).
- On restart, invoking the capability may trigger re-initialization.
- The keeper mechanism handles this: the target object’s keeper detects the stale state and re-initializes before completing the call.
- The client is unaware of the restart (or sees a transient error if re-initialization fails).
5.4 The Space Bank Model
Persistent capabilities create a garbage collection problem: when is it safe to reclaim a persistent object? EROS solves this with space banks:
- A space bank is a capability that allocates objects (pages and nodes).
- Every object is allocated from exactly one space bank.
- Space banks can be hierarchical (a bank allocates from a parent bank).
- Destroying a space bank reclaims ALL objects allocated from it.
This provides:
- Bulk deallocation: Terminate a subsystem by destroying its bank.
- Resource accounting: Each bank tracks how much space it has consumed.
- Revocation: Destroying a bank revokes all capabilities to objects allocated from it (the objects cease to exist).
The space bank model avoids the need for a global garbage collector scanning the capability graph. Instead, resource lifetimes are explicitly managed through the bank hierarchy.
6. Relevance to capOS
6.1 Cap’n Proto as Persistent Capability Format
EROS stores capabilities as (type, OID, permissions) tuples in fixed-size node slots. capOS can do something analogous but more naturally, because Cap’n Proto already provides a serialization format for structured data:
A persistent capability in capOS could be a capnp struct:
struct PersistentCapRef {
interfaceId @0 :UInt64; # which capability interface
objectId @1 :UInt64; # persistent object identity
permissions @2 :UInt32; # bitmask of allowed methods
epoch @3 :UInt64; # revocation epoch (see below)
}
Why this works well with Cap’n Proto:
- Zero-copy persistence: A capnp message in memory has the same byte layout as on disk. No serialization/deserialization step for persistence. This is the closest a modern system can get to EROS’s single-level store without hardware support.
- Schema evolution: Cap’n Proto’s backwards-compatible schema evolution means persistent capability formats can evolve without breaking existing stored capabilities.
- Cross-machine references: The same
PersistentCapRefcan reference a local or remote object. TheobjectIdcan include a machine/node identifier for distributed capabilities. - Type safety: The
interfaceIdfield provides runtime type checking that EROS’s keys lacked (EROS keys are untyped references; the type is determined by the target object, not the key).
Difference from EROS:
EROS capabilities are kernel objects – the kernel knows about every key and
mediates every invocation. In capOS, PersistentCapRef could be a
user-space construct – a serialized reference that is resolved by the
kernel (or a userspace capability manager) when invoked. This is a
deliberate trade-off: less kernel complexity, more flexibility, but the
kernel must validate references on use rather than at creation time.
6.2 Checkpoint/Restart Patterns for capOS
EROS’s checkpoint model provides several patterns capOS could adopt:
Pattern 1: Application-Level Checkpointing (Recommended as Phase 1)
This is what capOS’s storage proposal already describes: services serialize their own state to the Store capability. This is simpler than EROS’s transparent persistence but requires application cooperation.
Service state → capnp serialize → Store.put(data) → persistent hash
On restart: Store.get(hash) → capnp deserialize → restore state
Advantages over EROS transparent persistence:
- No kernel complexity for checkpointing.
- Services control what is persistent and what is transient.
- No “checkpoint pause” – services choose when to persist.
- Natural fit with Cap’n Proto (state is already capnp).
Disadvantages:
- Every service must implement save/restore logic.
- No automatic consistency across services (each saves independently).
- Programmer error can lead to inconsistent state after restart.
Pattern 2: Kernel-Assisted Checkpointing (Phase 2)
Add a Checkpoint capability that captures process state:
interface Checkpoint {
# Save the calling process's state (registers, memory, cap table)
save @0 () -> (handle :Data);
# Restore a previously saved state
restore @1 (handle :Data) -> ();
}
This is analogous to CRIU (Checkpoint/Restore in Userspace) on Linux but capability-mediated:
- The kernel captures the process’s address space, register state, and capability table.
- State is serialized as capnp messages and stored via the Store capability.
- Restore creates a new process from the saved state.
Advantages:
- Transparent to the application (no save/restore logic needed).
- Can capture the full capability graph of a process.
- Enables process migration between machines.
Disadvantages:
- Kernel complexity for state capture.
- Must handle capabilities that reference volatile state (open network connections, device handles).
- Memory overhead for copy-on-write snapshots.
Pattern 3: Consistent Multi-Process Checkpointing (Phase 3)
EROS’s global checkpoint extended to capOS:
- A
CheckpointCoordinatorservice initiates a distributed snapshot. - All participating services freeze, checkpoint their state, then resume.
- The coordinator records a consistent cut across all services.
- Recovery restores all services to the same consistent point.
This requires:
- A coordination protocol (similar to distributed database commit).
- Services must participate in the protocol (register with the coordinator, respond to freeze/checkpoint/resume signals).
- The coordinator must handle failures during the checkpoint itself.
This is the most complex option but provides the strongest consistency guarantees. It’s appropriate for capOS’s later stages when multi-service reliability matters.
6.3 Capability-Native Filesystem Design
EROS’s model and capOS’s Store proposal can be synthesized into a capability-native filesystem design:
Hybrid approach: Content-Addressed Store + Capability Metadata
capOS’s current Store proposal uses content-addressed storage (hash-based). This is good for immutable data but awkward for capability references (a capability’s target may change without the capability itself changing).
A better model, informed by EROS:
Persistent Object = (ObjectId, Version, CapnpData, CapSlots[])
Where:
ObjectIdis a persistent identity (like EROS’s OID).Versionis a monotonic counter (for optimistic concurrency).CapnpDatais the object’s data payload as a capnp message.CapSlots[]is a list of capability references embedded in the object (like EROS’s node slots).
This separates data from capability references, which is important because:
- Data can be content-addressed (deduplicated by hash).
- Capability references must be identity-addressed (two identical-looking references to different objects are different).
- Revocation operates on capability references, not data.
The Namespace as Directory
capOS’s Namespace capability is the capability-native equivalent of a
directory:
| Unix | EROS | capOS |
|---|---|---|
| Directory (inode + dentries) | Node with keys in slots | Namespace capability |
| Path traversal | Node tree walk | Namespace.resolve() chain |
| Permission bits | Key type + slot permissions | Capability attenuation |
| Hard links | Multiple keys to same object | Multiple refs to same hash |
| Symbolic links | Forwarder keys | Redirect capabilities |
Journaling and Crash Consistency
EROS avoids journaling by using checkpoint-based consistency. capOS’s Store service needs its own consistency story:
Option A: Checkpoint-based (EROS-style)
- Store service maintains an in-memory cache of recent modifications.
- Periodically flushes a consistent snapshot to disk.
- On crash, recovers to last flush point.
- Simple but may lose recent writes.
Option B: Log-structured (modern)
- All writes go to an append-only log.
- A background compaction process builds indexed snapshots from the log.
- On crash, replay the log from the last snapshot.
- More complex but no data loss window.
Option C: Hybrid
- Capability metadata (the namespace bindings) uses a write-ahead log for crash consistency.
- Object data (capnp blobs in the content-addressed store) uses checkpoint-based consistency (losing a few blobs is tolerable; losing a namespace binding is not).
Option C is recommended for capOS: it provides strong consistency for the critical metadata while keeping the data path simple.
6.4 Transparent vs Explicit Persistence: Tradeoffs
| Aspect | EROS Transparent | capOS Explicit | Hybrid |
|---|---|---|---|
| Application complexity | None (automatic) | High (must implement save/restore) | Medium (opt-in transparency) |
| Kernel complexity | Very high (checkpoint, COW, object store) | Low (just IPC and memory) | Medium (checkpoint capability) |
| Consistency | Strong (global checkpoint) | Weak (per-service) | Medium (coordinator) |
| Control | None (everything persists) | Full (choose what to save) | Selective |
| Performance | Checkpoint pauses | No pauses, explicit I/O cost | Configurable |
| Volatile state | Keeper mechanism handles | Service handles reconnection | Annotated capabilities |
| Debuggability | Hard (system is a black box) | Easy (state is explicit capnp) | Medium |
| Cap’n Proto fit | Neutral | Excellent (state = capnp) | Good |
Recommendation for capOS:
Start with explicit persistence (Phase 1 in the storage proposal) because:
- It’s dramatically simpler to implement.
- Cap’n Proto makes serialization nearly free anyway.
- It gives services control over what is persistent.
- It aligns with capOS’s existing Store/Namespace design.
- The kernel stays simple.
Then add opt-in kernel-assisted checkpointing (like the Checkpoint capability described above) for services that want transparent persistence. This gives the benefits of EROS’s model without forcing it on everything.
Never implement EROS’s fully transparent global persistence – the kernel complexity is enormous, the debugging experience is poor, and modern systems (with fast SSDs and capnp zero-copy serialization) don’t need it. The explicit model with good tooling is strictly better for a research OS.
6.5 Capability Revocation in capOS
EROS’s forwarder key model translates directly to capOS:
Epoch-based revocation:
Each capability includes a revocation epoch. The kernel (or capability manager) maintains a per-object epoch counter. When a capability is invoked:
- Check that the capability’s epoch matches the object’s current epoch.
- If it doesn’t match, the capability has been revoked – return an error.
- To revoke all capabilities to an object, increment the object’s epoch.
This is O(1) revocation (increment a counter) with O(1) check per invocation (compare two integers). It’s simpler than EROS’s forwarder mechanism and fits naturally into a capnp-serialized capability reference:
struct CapRef {
objectId @0 :UInt64;
epoch @1 :UInt64; # revocation epoch
permissions @2 :UInt32; # method bitmask
interfaceId @3 :UInt64; # type of the capability
}
Space bank analog:
capOS can implement EROS’s space bank pattern using the Store:
- Each “bank” is a Namespace prefix in the Store.
- Objects allocated by a service are stored under its namespace.
- Destroying the service’s namespace revokes access to all its objects.
- Resource accounting is done by the Store service (track bytes per namespace).
6.6 Summary of Recommendations
| EROS/CapROS/Coyotos Concept | capOS Recommendation |
|---|---|
| Single-level store | Don’t implement (too complex for research OS). Use Cap’n Proto zero-copy as a lightweight equivalent. |
| Checkpoint/restart | Phase 1: application-level (explicit capnp save/restore). Phase 2: Checkpoint capability for opt-in transparent persistence. |
| Persistent capabilities | Use capnp PersistentCapRef struct with objectId + epoch. Store capability graph in the Store service. |
| Capability revocation | Epoch-based revocation (increment counter, check on invocation). Simpler than EROS forwarders, same O(1) cost. |
| Space banks | Map to Store namespaces. Destroying a namespace reclaims all objects. |
| Keeper/fault handler | Map to capOS’s supervisor mechanism (service-architecture proposal). Supervisor receives fault notifications and can restart/repair. |
| GPTs (Coyotos) | Not needed – capOS uses hardware page tables directly. The sparse address-space idea remains relevant for future SharedBuffer/AddressRegion work beyond the current VirtualMemory cap. |
| Confinement | capOS already has the structural prerequisites (no ambient authority). Formal confinement proofs are a future research direction. |
| Device isolation | Already planned in capOS (device capabilities with MMIO/interrupt/DMA grants). CapROS validates this approach works in practice. |
Key References
- Shapiro, J. S., Smith, J. M., Farber, D. J. “EROS: A Fast Capability System.” Proceedings of the 17th ACM Symposium on Operating Systems Principles (SOSP), 1999.
- Shapiro, J. S. “EROS: A Capability System.” PhD dissertation, University of Pennsylvania, 1999.
- Shapiro, J. S. & Weber, S. “Verifying the EROS Confinement Mechanism.” IEEE Symposium on Security and Privacy, 2000.
- Hardy, N. “The Confused Deputy.” ACM SIGOPS Operating Systems Review, 1988. (Motivates capability-based access control.)
- Hardy, N. “KeyKOS Architecture.” Operating Systems Review, 1985.
- Landau, C. R. “The Checkpoint Mechanism in KeyKOS.” Proceedings of the Second International Workshop on Object Orientation in Operating Systems, 1992.
- Shapiro, J. S. et al. “Coyotos Microkernel Specification.” Technical report, Johns Hopkins University, 2004-2008.
- Shapiro, J. S. et al. “BitC Language Specification.” Technical report, Johns Hopkins University, 2004-2008.
- Dennis, J. B. & Van Horn, E. C. “Programming Semantics for Multiprogrammed Computations.” Communications of the ACM, 1966. (Original capability concept.)
- Levy, H. M. “Capability-Based Computer Systems.” Digital Press, 1984. (Comprehensive survey of capability systems including CAP, Hydra, iAPX 432, StarOS.)
LLVM Target Customization for capOS
Deep research report on creating custom LLVM/Rust/Go targets for a capability-based OS.
Status as of 2026-04-22: capOS still builds kernel and userspace with
x86_64-unknown-none plus linker-script/build flags. A checked-in
x86_64-unknown-capos custom target does not exist yet. Since this report was
first written, PT_TLS parsing, userspace TLS block setup, FS-base
save/restore, the VirtualMemory capability, and a #[thread_local] QEMU
smoke have landed. Thread creation, a user-controlled FS-base syscall, futexes,
a timer capability, and a Go port remain future work.
Table of Contents
- Custom OS Target Triple
- Calling Conventions
- Relocations
- TLS (Thread-Local Storage) Models
- Rust Target Specification
- Go Runtime Requirements
- Relevance to capOS
1. Custom OS Target Triple
Target Triple Format
LLVM target triples follow the format <arch>-<vendor>-<os> or
<arch>-<vendor>-<os>-<env>:
- arch:
x86_64,aarch64,riscv64gc, etc. - vendor:
unknown,apple,pc, etc. (oftenunknownfor custom OSes) - os:
linux,none,redox,hermit,fuchsia, etc. - env (optional):
gnu,musl,eabi, etc.
For capOS, the eventual userspace target triple should be
x86_64-unknown-capos. The kernel should keep using a freestanding target
(x86_64-unknown-none) unless a kernel-specific target file becomes useful
for build hygiene.
What LLVM Needs
LLVM’s target description consists of:
- Target machine: Architecture (instruction set, register file, calling conventions). x86_64 already exists in LLVM.
- Object format: ELF, COFF, Mach-O. capOS uses ELF.
- Relocation model: static, PIC, PIE, dynamic-no-pic.
- Code model: small, kernel, medium, large.
- OS-specific ABI details: Stack alignment, calling convention defaults, TLS model, exception handling mechanism.
LLVM does NOT need kernel-level knowledge of your OS. It needs to know how to generate correct object code for the target environment. The OS name in the triple primarily affects:
- Default calling convention selection
- Default relocation model
- TLS model selection
- Object file format and flags
- C library assumptions (relevant for C compilation, less for Rust no_std)
Creating a New OS in LLVM (Upstream Path)
To add capos as a recognized OS in LLVM itself:
- Add the OS to
llvm/include/llvm/TargetParser/Triple.h(theOSTypeenum) - Add string parsing in
llvm/lib/TargetParser/Triple.cpp - Define ABI defaults in the relevant target (
llvm/lib/Target/X86/) - Update Clang’s driver for the new OS
(
clang/lib/Driver/ToolChains/,clang/lib/Basic/Targets/)
This is significant upstream work and not necessary initially. The pragmatic path is using Rust’s custom target JSON mechanism (see Section 5).
What Other OSes Do
| OS | LLVM status | Approach |
|---|---|---|
| Redox | Upstream in Rust; no dedicated LLVM OS enum in current LLVM | Full triple x86_64-unknown-redox, Tier 2 in Rust |
| Hermit | Upstream in LLVM and Rust | x86_64-unknown-hermit, Tier 3, unikernel |
| Fuchsia | Upstream in LLVM and Rust | x86_64-unknown-fuchsia, Tier 2 |
| Theseus | Custom target JSON | Uses x86_64-unknown-theseus JSON spec, not upstream |
| Blog OS (phil-opp) | Custom target JSON | Uses JSON target spec, targets x86_64-unknown-none base |
| seL4/Robigalia | Custom target JSON | Modified from x86_64-unknown-none |
Recommendation for capOS: keep the kernel on x86_64-unknown-none.
Introduce a userspace-only custom target JSON when cfg(target_os = "capos")
or toolchain packaging becomes valuable. Do not upstream a capos OS triple
until the userspace ABI is stable.
2. Calling Conventions
LLVM Calling Conventions
LLVM supports numerous calling conventions. The ones relevant to capOS:
| CC | LLVM ID | Description | Relevance |
|---|---|---|---|
| C | 0 | Default C calling convention (System V AMD64 ABI on x86_64) | Primary for interop |
| Fast | 8 | Optimized for internal use, passes in registers | Rust internal use |
| Cold | 9 | Rarely-called functions, callee-save heavy | Error paths |
| GHC | 10 | Glasgow Haskell Compiler, everything in registers | Not relevant |
| HiPE | 11 | Erlang HiPE, similar to GHC | Not relevant |
| WebKit JS | 12 | JavaScript JIT | Not relevant |
| AnyReg | 13 | Dynamic register allocation | JIT compilers |
| PreserveMost | 14 | Caller saves almost nothing | Interrupt handlers |
| PreserveAll | 15 | Caller saves nothing | Context switches |
| Swift | 16 | Swift self/error registers | Not relevant |
| CXX_FAST_TLS | 17 | C++ TLS access optimization | TLS wrappers |
| X86_StdCall | 64 | Windows stdcall | Not relevant |
| X86_FastCall | 65 | Windows fastcall | Not relevant |
| X86_RegCall | 95 | Register-based calling | Performance-critical code |
| X86_INTR | 83 | x86 interrupt handler | IDT handlers |
| Win64 | 79 | Windows x64 calling convention | Not relevant |
System V AMD64 ABI (The Default for capOS)
On x86_64, the System V AMD64 ABI (CC 0, “C”) is the standard:
- Integer args: RDI, RSI, RDX, RCX, R8, R9
- Float args: XMM0-XMM7
- Return: RAX (integer), XMM0 (float)
- Caller-saved: RAX, RCX, RDX, RSI, RDI, R8-R11, XMM0-XMM15
- Callee-saved: RBX, RBP, R12-R15
- Stack alignment: 16-byte at call site
- Red zone: 128 bytes below RSP (unavailable in kernel mode)
capOS already uses this convention – the syscall handler in
kernel/src/arch/x86_64/syscall.rs maps syscall registers to System V
registers before calling syscall_handler.
Customizing for a New OS Target
For a custom OS, calling convention customization is usually minimal:
-
Kernel code: Disable the red zone (capOS already does this via
x86_64-unknown-nonewhich sets"disable-redzone": true). The red zone is unsafe in interrupt/syscall contexts. -
Userspace code: Standard System V ABI is fine. The red zone is safe in userspace.
-
Syscall convention: This is an OS design choice, not an LLVM CC. capOS uses: RAX=syscall number, RDI-R9=args (matching System V for easy dispatch). Linux uses a slightly different register mapping (R10 instead of RCX for arg4, because SYSCALL clobbers RCX).
-
Interrupt handlers: Use
X86_INTR(CC 83) or manual save/restore. capOS currently uses manual asm stubs.
Cross-Language Interop Implications
| Languages | Convention | Notes |
|---|---|---|
| Rust <-> Rust | Rust ABI (unstable) | Internal to a crate, not stable across crates |
| Rust <-> C | extern "C" (System V) | Stable, well-defined. Used for libcapos API |
| Rust <-> Go | Complex (see Section 6) | Go has its own internal ABI (ABIInternal) |
| C <-> Go | extern "C" via cgo | Go’s cgo bridge, heavy overhead |
| Any <-> Kernel | Syscall convention | Register-based, OS-defined, not a CC |
Key point: The System V AMD64 ABI is the lingua franca. All languages
can produce extern "C" functions. capOS should standardize on System V
for all cross-language boundaries and capability invocations.
Go’s internal ABI (ABIInternal, using R14 as the g register) is different
from System V. Go functions called from outside Go must go through a
trampoline. This is handled by the Go runtime, not something capOS needs
to solve at the LLVM level.
3. Relocations
LLVM Relocation Models
| Model | Flag | Description |
|---|---|---|
| static | -relocation-model=static | All addresses resolved at link time. No GOT/PLT. |
| pic | -relocation-model=pic | Position-independent code. Uses GOT for globals, PLT for calls. |
| dynamic-no-pic | -relocation-model=dynamic-no-pic | Like static but with dynamic linking support (macOS legacy). |
| ropi | -relocation-model=ropi | Read-only position-independent (ARM embedded). |
| rwpi | -relocation-model=rwpi | Read-write position-independent (ARM embedded). |
| ropi-rwpi | -relocation-model=ropi-rwpi | Both ROPI and RWPI (ARM embedded). |
Code Models (x86_64)
| Model | Flag | Address Range | Use Case |
|---|---|---|---|
| small | -code-model=small | 0 to 2GB | Userspace default |
| kernel | -code-model=kernel | Top 2GB (negative 32-bit) | Higher-half kernel |
| medium | -code-model=medium | Code in low 2GB, data anywhere | Large data sets |
| large | -code-model=large | No assumptions | Maximum flexibility, worst performance |
What capOS Currently Uses
From .cargo/config.toml:
[target.x86_64-unknown-none]
rustflags = ["-C", "link-arg=-Tkernel/linker-x86_64.ld", "-C", "code-model=kernel", "-C", "relocation-model=static"]
-
Kernel:
code-model=kernel+relocation-model=static. Correct for a higher-half kernel at0xffffffff80000000. All kernel symbols are in the top 2GB of virtual address space, so 32-bit sign-extended addressing works. -
Init/demos/capos-rt userspace: The standalone userspace crates also target
x86_64-unknown-none, pass-Crelocation-model=static, and select their linker scripts through per-cratebuild.rsfiles. The binaries are loaded at0x200000. The pinned local toolchain (rustc 1.97.0-nightly, LLVM 22.1.2) printsx86_64-unknown-nonewithllvm-target = "x86_64-unknown-none-elf",code-model = "kernel", soft-float ABI, inline stack probes, and static PIE-capable defaults. A futurex86_64-unknown-caposuserspace target should setcode-model = "small"explicitly instead of inheriting the freestanding kernel-oriented default.
Kernel vs. Userspace Requirements
Kernel:
- Static relocations, kernel code model.
- No PIC overhead needed – the kernel is loaded at a known address.
- The linker script places everything in the higher half.
- This is the correct and standard approach (Linux kernel does the same).
Userspace (current – static binaries):
- Static relocations. A future custom userspace target should choose the small code model explicitly.
- Simple, no runtime relocator needed.
- Binary is loaded at a fixed address (
0x200000). - Works perfectly for single-binary-per-address-space.
Userspace (future – if shared libraries or ASLR desired):
- PIE (Position-Independent Executable) = PIC + static linking.
- Requires a dynamic loader or kernel-side relocator.
- Enables ASLR (Address Space Layout Randomization) for security.
- Adds GOT indirection overhead (typically < 5% performance impact).
Position-Independent Code in a Capability Context
PIC/PIE is relevant to capOS for several reasons:
-
ASLR: PIE enables loading binaries at random addresses, making ROP attacks harder. Even in a capability system, defense-in-depth matters.
-
Shared libraries: If capOS ever supports shared objects (e.g., a shared
libcapos.so), PIC is required for the shared library. -
WASI/Wasm: Not relevant – Wasm has its own memory model.
-
Multiple instances: With static linking, two instances of the same binary can share read-only pages (text, rodata) if loaded at the same address. PIC/PIE allows sharing even at different addresses (copy-on-write for the GOT).
Recommendation for capOS: Keep static relocation for now. Consider PIE for userspace when implementing ASLR (after threading and IPC are stable). The kernel should remain static forever.
4. TLS (Thread-Local Storage) Models
LLVM TLS Models
LLVM supports four TLS models, in order from most dynamic to most constrained:
| Model | Description | Runtime Requirement | Performance |
|---|---|---|---|
| general-dynamic | Any module, any time | Full __tls_get_addr via dynamic linker | Slowest (function call per access) |
| local-dynamic | Same module, any time | __tls_get_addr for module base, then offset | Slow (one call per module per thread) |
| initial-exec | Only modules loaded at startup | GOT slot populated by dynamic linker | Fast (one memory load) |
| local-exec | Main executable only | Direct FS/GS offset, known at link time | Fastest (single instruction) |
How TLS Works on x86_64
On x86_64, TLS is accessed via the FS segment register:
- The OS sets the FS base address for each thread (via
MSR_FS_BASEorarch_prctl(ARCH_SET_FS)). - TLS variables are accessed as offsets from FS base:
local-exec:mov %fs:OFFSET, %rax(offset known at link time)initial-exec:mov %fs:0, %rax; mov GOT_OFFSET(%rax), %rcx; mov %fs:(%rcx), %rdxgeneral-dynamic:call __tls_get_addr(returns pointer to TLS block)
Which Model for capOS?
Kernel:
- The kernel does not use compiler TLS. Current TLS support is for loaded userspace ELF images only.
- For SMP: per-CPU data via GS segment register (the standard approach).
Set
MSR_GS_BASEon each CPU to point to aPerCpustruct.swapgson kernel entry switches between user and kernel GS base. - Kernel TLS model: Not applicable (per-CPU data is accessed via GS, not the compiler’s TLS mechanism).
Userspace (static binaries, no dynamic linker):
- local-exec is the only correct choice. There’s no dynamic linker to resolve TLS relocations, so general-dynamic and initial-exec won’t work.
- Implemented for the current single-threaded process model: the ELF parser
records
PT_TLS, the loader maps a Variant II TLS block plus TCB self pointer, and the scheduler saves/restores FS base on context switch. - Still missing for future threading and Go: a syscall or
capability-authorized operation equivalent to
arch_prctl(ARCH_SET_FS)so a runtime can set each OS thread’s FS base itself.
Userspace (with dynamic linker, future):
- initial-exec for the main executable and preloaded libraries.
- general-dynamic for
dlopen()-loaded libraries. - Requires implementing
__tls_get_addrin the dynamic linker.
TLS Initialization Sequence
For a statically-linked userspace binary with local-exec TLS:
1. Kernel creates thread
2. Kernel allocates TLS block (size from ELF TLS program header)
3. Kernel copies .tdata (initialized TLS) into TLS block
4. Kernel zeros .tbss (uninitialized TLS) in TLS block
5. Kernel sets FS base = TLS block address (writes MSR_FS_BASE)
6. Thread starts executing; %fs:OFFSET accesses TLS directly
The ELF file contains two TLS sections:
.tdata(PT_TLS segment, initialized thread-local data).tbss(zero-initialized thread-local data, like.bssbut per-thread)
The PT_TLS program header tells the loader:
- Virtual address and file offset of
.tdata p_memsz= total TLS size (including.tbss)p_filesz= size of.tdataonlyp_align= required alignment
FS/GS Base Register Usage Plan
| Register | Used By | Purpose |
|---|---|---|
| FS | Userspace threads | Thread-local storage (set per-thread by kernel) |
| GS | Kernel (via swapgs) | Per-CPU data (set per-CPU during boot) |
This is the standard Linux convention and what Go expects (Go uses
arch_prctl(ARCH_SET_FS) to set the FS base for each OS thread).
What capOS Has and Still Needs
- Implemented: parse
PT_TLSincapos-lib/src/elf.rs. - Implemented: allocate/map a TLS block during process image load in
kernel/src/spawn.rs. - Implemented: copy
.tdata, zero.tbss, and write the TCB self pointer for the current Variant II static TLS layout. - Implemented: save/restore FS base through
kernel/src/sched.rsandkernel/src/arch/x86_64/tls.rs. - Still needed:
arch_prctl(ARCH_SET_FS)equivalent for Gosettls()and future multi-threaded userspace.
5. Rust Target Specification
How Custom Targets Work
Rust supports custom targets via JSON specification files. The workflow:
- Create a
<target-name>.jsonfile - Pass it to rustc:
--target path/to/x86_64-unknown-capos.json - Use with cargo via
-Zbuild-stdto build core/alloc/std from source
Target lookup priority:
- Built-in target names
- File path (if the target string contains
/or.json) RUST_TARGET_PATHenvironment variable directories
The Rust target JSON schema is explicitly unstable. Generate examples from the
pinned compiler with rustc -Z unstable-options --print target-spec-json and
validate against that same compiler’s target-spec-json-schema before checking
in a target file.
Viewing Existing Specs
# Print the JSON spec for a built-in target:
rustc +nightly -Z unstable-options --target=x86_64-unknown-none --print target-spec-json
# Print the JSON schema for all available fields:
rustc +nightly -Z unstable-options --print target-spec-json-schema
Example: x86_64-unknown-capos Kernel Target
Based on the current x86_64-unknown-none target, with capOS-specific
adjustments. This is a sketch; regenerate from the pinned rustc schema before
using it.
{
"llvm-target": "x86_64-unknown-none-elf",
"metadata": {
"description": "capOS kernel (x86_64)",
"tier": 3,
"host_tools": false,
"std": false
},
"data-layout": "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128",
"arch": "x86_64",
"cpu": "x86-64",
"target-endian": "little",
"target-pointer-width": 64,
"target-c-int-width": 32,
"os": "none",
"env": "",
"vendor": "unknown",
"linker-flavor": "gnu-lld",
"linker": "rust-lld",
"pre-link-args": {
"gnu-lld": ["-Tkernel/linker-x86_64.ld"]
},
"features": "-mmx,-sse,-sse2,-sse3,-ssse3,-sse4.1,-sse4.2,-avx,-avx2,+soft-float",
"disable-redzone": true,
"panic-strategy": "abort",
"code-model": "kernel",
"relocation-model": "static",
"rustc-abi": "softfloat",
"executables": true,
"exe-suffix": "",
"has-thread-local": false,
"position-independent-executables": false,
"static-position-independent-executables": false,
"plt-by-default": false,
"max-atomic-width": 64,
"stack-probes": { "kind": "inline" }
}
Example: x86_64-unknown-capos Userspace Target
{
"llvm-target": "x86_64-unknown-none-elf",
"metadata": {
"description": "capOS userspace (x86_64)",
"tier": 3,
"host_tools": false,
"std": false
},
"data-layout": "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128",
"arch": "x86_64",
"cpu": "x86-64",
"target-endian": "little",
"target-pointer-width": 64,
"target-c-int-width": 32,
"os": "capos",
"env": "",
"vendor": "unknown",
"linker-flavor": "gnu-lld",
"linker": "rust-lld",
"pre-link-args": {
"gnu-lld": ["-Tinit/linker.ld"]
},
"features": "-mmx,-sse,-sse2,-sse3,-ssse3,-sse4.1,-sse4.2,-avx,-avx2,+soft-float",
"disable-redzone": false,
"panic-strategy": "abort",
"code-model": "small",
"relocation-model": "static",
"rustc-abi": "softfloat",
"executables": true,
"exe-suffix": "",
"has-thread-local": true,
"position-independent-executables": false,
"static-position-independent-executables": false,
"max-atomic-width": 64,
"plt-by-default": false,
"stack-probes": { "kind": "inline" },
"tls-model": "local-exec"
}
Key JSON Fields
| Field | Purpose | Typical Values |
|---|---|---|
llvm-target | LLVM triple for code generation | x86_64-unknown-none-elf (reuse existing backend) |
os | OS name (affects cfg(target_os = "...")) | "none", "capos", "linux" |
arch | Architecture name | "x86_64", "aarch64" |
data-layout | LLVM data layout string | Copy from same-arch target |
linker-flavor | Which linker to use | "gnu-lld", "gcc", "msvc" |
linker | Linker binary | "rust-lld", "ld.lld" |
features | CPU features to enable/disable | Disable SIMD/FPU until context switching saves that state |
disable-redzone | Disable System V red zone | true for kernel, false for userspace |
code-model | LLVM code model | "kernel", "small" |
relocation-model | LLVM relocation model | "static", "pic" |
panic-strategy | How to handle panics | "abort", "unwind" |
has-thread-local | Enable #[thread_local] | true for userspace now that PT_TLS/FS base works |
tls-model | Default TLS model | "local-exec" for static binaries |
max-atomic-width | Largest atomic type (bits) | 64 for x86_64 |
pre-link-args | Arguments passed to linker before user args | Linker script path |
position-independent-executables | Generate PIE by default | false for now |
exe-suffix | Executable file extension | "" for ELF |
stack-probes | Stack overflow detection mechanism | {"kind": "inline"} in the current freestanding x86_64 spec |
no_std vs std Support Path
Current state: capOS uses no_std + alloc. This works with any
target, including x86_64-unknown-none.
Path to std support (what Redox, Hermit, and Fuchsia did):
-
Phase 1: Custom target with
os: "capos"(current report). Use-Zbuild-std=core,allocto build core and alloc. No std. -
Phase 2: Add capOS to Rust’s
stdlibrary. This requires:- Adding
mod caposunderlibrary/std/src/sys/with OS-specific implementations of: filesystem, networking, threads, time, stdio, process spawning, etc. - Each of these maps to capOS capabilities
- Use
cfg(target_os = "capos")throughout std - Build with
-Zbuild-std=std
- Adding
-
Phase 3: Upstream the target (optional). Submit the target spec and std implementations to the Rust project. Requires sustained maintenance.
What Redox did: Redox implemented a full POSIX-like userspace (relibc)
and added std support by implementing the sys module in terms of relibc
syscalls. This made Redox a Tier 2 target with pre-built std artifacts.
What Hermit did: Hermit is a unikernel, so std is implemented directly in terms of Hermit’s kernel-level APIs. Tier 3, community maintained.
What Fuchsia did: Fuchsia implemented std using Fuchsia’s native
zircon syscalls (handles, channels, VMOs – similar in spirit to
capabilities). Tier 2.
Recommendation for capOS: Stay on no_std + alloc with the custom
target JSON. std support is a large effort that should wait until the
syscall surface is stable and threading works. When the time comes, Fuchsia’s
approach (std over native capability syscalls) is the best model, since
Fuchsia’s handle-based API is conceptually close to capOS’s capabilities.
Other OS Projects Reference
| OS | Target | Tier | std | Approach |
|---|---|---|---|---|
| Redox | x86_64-unknown-redox | 2 | Yes | relibc (custom libc) over Redox syscalls |
| Hermit | x86_64-unknown-hermit | 3 | Yes | std directly over kernel API |
| Fuchsia | x86_64-unknown-fuchsia | 2 | Yes | std over zircon handles (capability-like) |
| Theseus | x86_64-unknown-theseus | N/A | No | Custom JSON, no_std, research OS |
| Blog OS | Custom JSON | N/A | No | Based on x86_64-unknown-none |
| MOROS | Custom JSON | N/A | No | Simple hobby OS |
6. Go Runtime Requirements
Go’s Runtime Architecture
Go’s runtime is essentially a userspace operating system. It manages goroutine scheduling, garbage collection, memory allocation, and I/O multiplexing. The runtime interfaces with the actual OS through a narrow set of functions that each GOOS must implement.
Minimum OS Interface for a Go Port
Based on analysis of runtime/os_linux.go, runtime/os_plan9.go, and
runtime/os_js.go, here is the minimum interface:
Tier 1: Absolute Minimum (single-threaded, like GOOS=js)
These functions are needed for “Hello, World!”:
func osinit() // OS initialization
func write1(fd uintptr, p unsafe.Pointer, n int32) int32 // stdout/stderr output
func exit(code int32) // process termination
func usleep(usec uint32) // sleep (can be no-op initially)
func readRandom(r []byte) int // random data (for maps, etc.)
func goenvs() // environment variables
func mpreinit(mp *m) // pre-init new M on parent thread
func minit() // init new M on its own thread
func unminit() // undo minit
func mdestroy(mp *m) // destroy M resources
Plus memory management (in runtime/mem_*.go):
func sysAllocOS(n uintptr) unsafe.Pointer // allocate memory (mmap)
func sysFreeOS(v unsafe.Pointer, n uintptr) // free memory (munmap)
func sysReserveOS(v unsafe.Pointer, n uintptr) unsafe.Pointer // reserve VA range
func sysMapOS(v unsafe.Pointer, n uintptr) // commit reserved pages
func sysUsedOS(v unsafe.Pointer, n uintptr) // mark as used
func sysUnusedOS(v unsafe.Pointer, n uintptr) // mark as unused (madvise)
func sysFaultOS(v unsafe.Pointer, n uintptr) // remove access
func sysHugePageOS(v unsafe.Pointer, n uintptr) // hint: use huge pages
Tier 2: Multi-threaded (real goroutines)
func newosproc(mp *m) // create OS thread (clone)
func exitThread(wait *atomic.Uint32) // exit current thread
func futexsleep(addr *uint32, val uint32, ns int64) // futex wait
func futexwakeup(addr *uint32, cnt uint32) // futex wake
func settls() // set FS base for TLS
func nanotime1() int64 // monotonic nanosecond clock
func walltime() (sec int64, nsec int32) // wall clock time
func osyield() // sched_yield
Tier 3: Full Runtime (signals, profiling, network poller)
func sigaction(sig uint32, new *sigactiont, old *sigactiont)
func signalM(mp *m, sig int) // send signal to thread
func setitimer(mode int32, new *itimerval, old *itimerval)
func netpollopen(fd uintptr, pd *pollDesc) uintptr
func netpoll(delta int64) (gList, int32)
func netpollBreak()
Linux Syscalls Used by Go Runtime (Complete List)
From runtime/sys_linux_amd64.s:
| Syscall | # | Go Wrapper | capOS Equivalent |
|---|---|---|---|
read | 0 | runtime.read | Store cap |
write | 1 | runtime.write1 | Console cap |
close | 3 | runtime.closefd | Cap drop |
mmap | 9 | runtime.sysMmap | VirtualMemory cap |
munmap | 11 | runtime.sysMunmap | VirtualMemory.unmap |
brk | 12 | runtime.sbrk0 | VirtualMemory cap |
rt_sigaction | 13 | runtime.rt_sigaction | Signal cap (future) |
rt_sigprocmask | 14 | runtime.rtsigprocmask | Signal cap (future) |
sched_yield | 24 | runtime.osyield | sys_yield |
mincore | 27 | runtime.mincore | VirtualMemory.query |
madvise | 28 | runtime.madvise | Future VirtualMemory decommit/query semantics, or unmap/remap policy |
nanosleep | 35 | runtime.usleep | Timer cap |
setitimer | 38 | runtime.setitimer | Timer cap |
getpid | 39 | runtime.getpid | Process info |
clone | 56 | runtime.clone | Thread cap |
exit | 60 | runtime.exit | sys_exit |
sigaltstack | 131 | runtime.sigaltstack | Not needed initially |
arch_prctl | 158 | runtime.settls | sys_arch_prctl (set FS base) |
gettid | 186 | runtime.gettid | Thread info |
futex | 202 | runtime.futex | sys_futex |
sched_getaffinity | 204 | runtime.sched_getaffinity | CPU info |
timer_create | 222 | runtime.timer_create | Timer cap |
timer_settime | 223 | runtime.timer_settime | Timer cap |
timer_delete | 226 | runtime.timer_delete | Timer cap |
clock_gettime | 228 | runtime.nanotime1 | Timer cap |
exit_group | 231 | runtime.exit | sys_exit |
tgkill | 234 | runtime.tgkill | Thread signal (future) |
openat | 257 | runtime.open | Namespace cap |
pipe2 | 293 | runtime.pipe2 | IPC cap |
Go’s TLS Model
Go uses arch_prctl(ARCH_SET_FS, addr) to set the FS segment base for
each OS thread. The convention:
- FS base points to the thread’s
m.tlsarray - Goroutine pointer
gis stored at-8(FS)(ELF TLS convention) - In Go’s ABIInternal, R14 is cached as the
gregister for performance - On signal entry or thread start,
gis loaded from TLS into R14
Go does NOT use the compiler’s TLS mechanisms (no __thread or
thread_local!). It manages TLS entirely in its own runtime via the FS
register.
For capOS, this means the kernel needs:
arch_prctl(ARCH_SET_FS)equivalent syscall- The kernel must save/restore FS base on context switch
- Each thread’s FS base must be independently settable
Adding GOOS=capos to Go
Files that need to be created/modified in a Go fork:
src/runtime/
os_capos.go // osinit, newosproc, futexsleep, etc.
os_capos_amd64.go // arch-specific OS functions
sys_capos_amd64.s // syscall wrappers in assembly
mem_capos.go // sysAlloc/sysFree/etc. over VirtualMemory cap
signal_capos.go // signal stubs (no real signals initially)
stubs_capos.go // misc stubs
netpoll_capos.go // network poller (stub initially)
defs_capos.go // OS-level constants
vdso_capos.go // VDSO stubs (no VDSO)
src/syscall/
syscall_capos.go // Go's syscall package
zsyscall_capos_amd64.go
src/internal/platform/
(modifications to supported.go, zosarch.go)
src/cmd/dist/
(modifications to add capOS to known OS list)
Estimated: ~2000-3000 lines for Phase 1 (single-threaded).
Feasibility Assessment
| Feature | Difficulty | Blocked On |
|---|---|---|
| Hello World (write + exit) | Easy | Console capability plus exit syscall |
| Memory allocator (mmap) | Medium | VirtualMemory capability exists; Go glue and any missing query/decommit semantics remain |
| Single-threaded goroutines (M=1) | Medium | VirtualMemory cap + timer |
| Multi-threaded (real threads) | Hard | Kernel thread support, futex, runtime-controlled FS base |
| Network poller | Hard | Async cap invocation, networking stack |
| Signal-based preemption | Hard | Signal delivery mechanism |
| Full stdlib | Very Hard | POSIX layer or native cap wrappers |
7. Relevance to capOS
Practical Scope of Work
Phase 1: Custom Target JSON (Low effort, high value)
What: Create a userspace x86_64-unknown-capos.json target spec. Keep
the kernel on x86_64-unknown-none unless a kernel JSON proves useful.
Why: Replaces the current approach of using x86_64-unknown-none with
rustflags overrides. Makes the build cleaner, enables cfg(target_os = "capos")
for conditional compilation, and is the foundation for everything else.
Effort: 1-2 hours for an initial file, plus recurring maintenance because Rust target JSON fields are not stable.
Blockers: None. Not required for the current no_std runtime path.
Phase 2: TLS Support (mostly landed, required for Go)
What: Parse PT_TLS from ELF, allocate per-thread TLS blocks, set FS base
on context switch, add arch_prctl-equivalent syscall.
Why: Required for Go runtime (Go’s settls() sets FS base), for Rust
#[thread_local] in userspace, and for C’s __thread.
Current state: PT_TLS parsing, static TLS mapping, FS-base context-switch
state, and a Rust #[thread_local] smoke are implemented. Remaining work is
the runtime-controlled FS-base operation and the thread model that makes it
per-thread rather than per-process.
Blockers: Thread support for the multi-threaded case.
Phase 3: VirtualMemory Capability (implemented baseline, required for Go)
What: Implement the VirtualMemory capability interface. The current schema has map, unmap, and protect; Go may need decommit/query semantics later.
Why: Go’s memory allocator (sysAlloc, sysReserve, sysMap, etc.)
needs mmap-like functionality. This is the single biggest kernel-side
requirement for Go.
Current state: VirtualMemoryCap implements map/unmap/protect over the
existing page-table code with ownership tracking and quota checks. Go-specific
work still has to map runtime sysAlloc/sysReserve/sysMap expectations
onto that interface.
Blockers: None for the baseline capability; timer/futex/threading still block useful Go.
Phase 4: Futex Operation (Low-medium effort, required for Go threading)
What: Implement futex(WAIT) and futex(WAKE) as a fast
capability-authorized kernel operation.
Why: Go’s runtime synchronization (lock_futex.go) is built on futexes.
The entire goroutine scheduler depends on futex-based sleeping.
Effort: ~100-200 lines for the first private-futex path. A wait queue keyed by address-space + userspace address is enough initially.
Blockers: Futex wait-queue design and, for full Go threading, the thread scheduler.
Phase 5: Kernel Threading (High effort, required for Go GOMAXPROCS>1)
What: Multiple threads per process sharing address space and cap table.
Why: Go’s newosproc() creates OS threads via clone(). Without real
threads, Go is limited to GOMAXPROCS=1.
Effort: ~500-800 lines. Major scheduler extension.
Blockers: Scheduler, per-CPU data, SMP support.
Biggest Blockers for Go
In priority order after the 2026-04-22 TLS and VirtualMemory work:
-
Timer / monotonic clock – Go’s scheduler needs
nanotime()for goroutine scheduling decisions. Without a timer, Go cannot preempt goroutines or manage timeouts. -
Runtime-controlled FS base – Go calls
arch_prctl(ARCH_SET_FS)on every new thread. capOS can load static ELF TLS today, but Go still needs a way to set the runtime’s own TLS base. -
Futex – Go’s M:N scheduler depends on futex for sleeping/waking OS threads. Without futex, Go falls back to spin-waiting (wasteful) or simply cannot block.
-
Thread creation – Required for
GOMAXPROCS > 1. Phase 1 Go can work single-threaded. -
Go runtime port glue – map
sysAlloc/write1/exit/random/env/time to capOS runtime and capabilities.
Biggest Blockers for C
C is much simpler than Go:
- Linker and toolchain setup – Need a cross-compilation toolchain targeting capOS (Clang with the custom target, or GCC cross-compiler).
libcapos.awith C headers – Rust library withextern "C"API.- musl integration (optional) – For full libc, replace musl’s
__syscall()with capability invocations.
Recommended Implementation Order
1. Custom userspace target JSON [optional build hygiene]
|
2. VirtualMemory capability [done: baseline map/unmap/protect]
|
3. TLS support (PT_TLS, FS base) [done for static ELF processes]
|
4. Futex authority cap + measured ABI [extends scheduler]
|
5. Timer capability (monotonic clock) [extends PIT/HPET driver]
|
6. Go Phase 1: minimal GOOS=capos [single-threaded, M=1]
|
7. Kernel threading [major scheduler work]
|
8. Go Phase 2: multi-threaded [GOMAXPROCS>1, concurrent GC]
|
9. C toolchain + libcapos [parallel with Go work]
|
10. Go Phase 3: network poller [depends on networking stack]
Steps 1-5 are kernel prerequisites. Step 6 is the Go fork. Steps 7-10 are incremental improvements that can proceed in parallel.
Key Architectural Decisions for capOS
-
Keep
x86_64-unknown-nonefor kernel,x86_64-unknown-caposfor userspace. The kernel does not benefit from a custom OS target (it’s freestanding). Userspace benefits fromcfg(target_os = "capos"). -
Use local-exec TLS model for static binaries. No dynamic linker means no general-dynamic or initial-exec TLS. local-exec is zero-overhead.
-
Implement FS base save/restore early. Both Go and Rust
#[thread_local]need it. It’s a small addition to context switch code. -
VirtualMemory cap stays on the Go critical path. The baseline exists; the Go port still needs exact runtime allocator semantics and any missing query/decommit behavior.
-
Futex is the synchronization primitive. Both Go and any future pthreads implementation need futex. Keep authority capability-based, but measure whether the hot path should use a compact transport operation rather than generic Cap’n Proto method dispatch.
-
Signals can be deferred. Go can start with cooperative-only preemption (no
SIGURG). Signal delivery is complex and can come much later.
Cap’n Proto Error Handling: Research Notes
Research on how Cap’n Proto handles errors at the protocol, schema, and Rust crate levels. Used as input for the capOS error handling proposal.
1. Protocol-Level Exception Model (rpc.capnp)
The Cap’n Proto RPC protocol defines an Exception struct used in three
positions: Message.abort, Return.exception, and Resolve.exception.
struct Exception {
reason @0 :Text;
type @3 :Type;
enum Type {
failed @0; # deterministic bug/invalid input; retrying won't help
overloaded @1; # temporary lack of resources; retry with backoff
disconnected @2; # connection to necessary capability was lost
unimplemented @3; # server doesn't implement the method
}
obsoleteIsCallersFault @1 :Bool;
obsoleteDurability @2 :UInt16;
trace @4 :Text; # stack trace from the remote server
}
The four exception types describe client response strategy, not error semantics:
| Type | Client response |
|---|---|
failed | Log and propagate. Don’t retry. |
overloaded | Retry with exponential backoff. |
disconnected | Re-establish connection, retry. |
unimplemented | Fall back to alternative methods. |
2. Rust capnp Crate (v0.25.x)
Core error types
#![allow(unused)]
fn main() {
pub type Result<T> = ::core::result::Result<T, Error>;
#[derive(Debug, Clone)]
pub struct Error {
pub kind: ErrorKind,
pub extra: String, // human-readable description (requires `alloc`)
}
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
#[non_exhaustive]
pub enum ErrorKind {
// Four RPC-mapped kinds (match Exception.Type)
Failed,
Overloaded,
Disconnected,
Unimplemented,
// Wire format validation errors (~40 more variants)
BufferNotLargeEnough,
EmptyBuffer,
MessageContainsOutOfBoundsPointer,
MessageIsTooDeeplyNested,
ReadLimitExceeded,
TextContainsNonUtf8Data(core::str::Utf8Error),
// ... etc
}
}
Constructor functions: Error::failed(s), Error::overloaded(s),
Error::disconnected(s), Error::unimplemented(s).
The NotInSchema(u16) type handles unknown enum values or union
discriminants.
std::io::Error mapping
When std feature is enabled, From<std::io::Error> maps:
TimedOut->OverloadedBrokenPipe/ConnectionRefused/ConnectionReset/ConnectionAborted/NotConnected->DisconnectedUnexpectedEof->PrematureEndOfFile- Everything else ->
Failed
3. capnp-rpc Rust Crate Error Mapping
Bidirectional conversion between wire Exception and capnp::Error:
Sending (Error -> Exception):
#![allow(unused)]
fn main() {
fn from_error(error: &Error, mut builder: exception::Builder) {
let typ = match error.kind {
ErrorKind::Failed => exception::Type::Failed,
ErrorKind::Overloaded => exception::Type::Overloaded,
ErrorKind::Disconnected => exception::Type::Disconnected,
ErrorKind::Unimplemented => exception::Type::Unimplemented,
_ => exception::Type::Failed, // all validation errors -> Failed
};
builder.set_type(typ);
builder.set_reason(&error.extra);
}
}
Receiving (Exception -> Error):
Maps exception::Type back to ErrorKind, preserving the reason string.
Server traits return Promise<(), capnp::Error>. Client gets
Promise<Response<Results>, capnp::Error>.
4. Cap’n Proto Error Handling Philosophy
From KJ library documentation and Kenton Varda:
“KJ exceptions are meant to express unrecoverable problems or logistical problems orthogonal to the API semantics; they are NOT intended to be used as part of your API semantics.”
“In the Cap’n Proto world, ‘checked exceptions’ (where an interface explicitly defines the exceptions it throws) do NOT make sense.”
Exceptions: infrastructure failures (network down, bug, overload). Application errors: should be modeled in the schema return types.
5. Schema Design Patterns for Application Errors
Generic Result pattern
struct Error {
code @0 :UInt16;
message @1 :Text;
}
struct Result(Ok) {
union {
ok @0 :Ok;
err @1 :Error;
}
}
interface MyService {
doThing @0 (input :Text) -> (result :Result(Text));
}
Constraint: generic type parameters bind only to pointer types (Text,
Data, structs, lists, interfaces), not primitives (UInt32, Bool). So
Result(UInt64) doesn’t work – need a wrapper struct.
Per-method result unions
interface FileSystem {
open @0 (path :Text) -> (result :OpenResult);
}
struct OpenResult {
union {
file @0 :File;
notFound @1 :Void;
permissionDenied @2 :Void;
error @3 :Text;
}
}
Unions must be embedded in structs (no free-standing unions). This allows adding new fields later without breaking compatibility.
6. How Other Cap’n Proto Systems Handle Errors
Sandstorm
Uses the exception mechanism for infrastructure errors. Capabilities report
errors through disconnection. The grain.capnp schema does not define
explicit error types. util.capnp documents errors as “It will throw an
exception if any error occurs.”
Cloudflare Workers (workerd)
Uses Cap’n Proto for internal RPC. JavaScript Error.message and
Error.name are preserved across RPC; stack traces and custom properties
are stripped. Does not model errors in capnp schema – relies on exception
propagation.
OCapN (Open Capability Network)
Adopted the same four-kind exception model for cross-system compatibility. Diagnostic information is non-normative. Security concern: exception objects may leak sensitive information (stack traces, paths) at CapTP boundaries.
Kenton Varda expressed reservations about unimplemented (ambiguity about
whether the direct method or callees failed) and disconnected (requires
catching at specific stack frames for meaningful retry).
7. Relevance to capOS
capOS uses the capnp crate but not capnp-rpc. Manual dispatch goes through
CapObject::call() with caller-provided params/result buffers. Current error
handling:
capnp::Error::failed()for semantic errorscapnp::Error::unimplemented()for unknown methods?for deserialization errors (naturally producecapnp::Error)- Transport errors become CQE status codes.
- Kernel-produced
CapExceptionvalues are serialized into result buffers for capability-level failures and decoded bycapos-rt.
The capnp::Error type carries the information needed for CapException:
kind maps to ExceptionType, and extra maps to message.
Sources
- Cap’n Proto RPC Protocol: https://capnproto.org/rpc.html
- Cap’n Proto C++ RPC: https://capnproto.org/cxxrpc.html
- Cap’n Proto Schema Language: https://capnproto.org/language.html
- Cap’n Proto FAQ: https://capnproto.org/faq.html
- KJ exception.h: https://github.com/capnproto/capnproto/blob/master/c%2B%2B/src/kj/exception.h
- rpc.capnp schema: https://github.com/capnproto/capnproto/blob/master/c%2B%2B/src/capnp/rpc.capnp
- OCapN error handling discussion: https://github.com/ocapn/ocapn/issues/10
- Cap’n Proto usage patterns: https://github.com/capnproto/capnproto/discussions/1849
- capnp-rpc Rust crate: https://crates.io/crates/capnp-rpc
- Cloudflare Workers RPC errors: https://developers.cloudflare.com/workers/runtime-apis/rpc/error-handling/
- Sandstorm util.capnp: https://docs.rs/crate/sandstorm/0.0.5/source/schema/util.capnp
OS Error Handling in Capability Systems: Research Notes
Research on error handling patterns in capability-based and microkernel operating systems. Used as input for the capOS error handling proposal.
1. seL4
Error Codes
seL4 defines 11 kernel error codes in errors.h:
typedef enum {
seL4_NoError = 0,
seL4_InvalidArgument = 1,
seL4_InvalidCapability = 2,
seL4_IllegalOperation = 3,
seL4_RangeError = 4,
seL4_AlignmentError = 5,
seL4_FailedLookup = 6,
seL4_TruncatedMessage = 7,
seL4_DeleteFirst = 8,
seL4_RevokeFirst = 9,
seL4_NotEnoughMemory = 10,
} seL4_Error;
Error Return Mechanism
- Capability invocations (kernel object operations) return
seL4_Errordirectly. - IPC messages use
seL4_MessageInfo_twithlabel,length,extraCaps,capsUnwrapped. Thelabelis copied unmodified – kernel doesn’t interpret it. - MR0 (Message Register 0) carries return codes for kernel object invocations
via
seL4_Call.
Error Propagation
Fault handler mechanism: each TCB has a fault endpoint capability. On fault (capability fault, VM fault, etc.):
- Kernel blocks the faulting thread.
- Kernel sends an IPC to the fault endpoint with fault-type-specific fields.
- Fault handler (separate process) receives, fixes, and replies.
- Kernel resumes the faulting thread.
Design Choices
seL4_NBSendon invalid capability: silently fails (prevents covert channels).seL4_Send/seL4_Callon invalid capability: returnsseL4_FailedLookup.- No application-level error convention – user servers choose their own protocol.
- Partial capability transfer: if some caps in a multi-cap transfer fail,
already-transferred caps succeed;
extraCapsreflects the successful count.
Sources
- seL4 errors.h: https://github.com/seL4/seL4/blob/master/libsel4/include/sel4/errors.h
- seL4 IPC tutorial: https://docs.sel4.systems/Tutorials/ipc.html
- seL4 fault handlers: https://docs.sel4.systems/Tutorials/fault-handlers.html
- seL4 API reference: https://docs.sel4.systems/projects/sel4/api-doc.html
2. Fuchsia / Zircon
zx_status_t
Signed 32-bit integer. Negative = error, ZX_OK (0) = success.
Categories:
| Category | Examples |
|---|---|
| General | ZX_ERR_INTERNAL, ZX_ERR_NOT_SUPPORTED, ZX_ERR_NO_RESOURCES, ZX_ERR_NO_MEMORY |
| Parameter | ZX_ERR_INVALID_ARGS, ZX_ERR_WRONG_TYPE, ZX_ERR_BAD_HANDLE, ZX_ERR_BUFFER_TOO_SMALL |
| State | ZX_ERR_BAD_STATE, ZX_ERR_NOT_FOUND, ZX_ERR_TIMED_OUT, ZX_ERR_ALREADY_EXISTS, ZX_ERR_PEER_CLOSED |
| Permission | ZX_ERR_ACCESS_DENIED |
| I/O | ZX_ERR_IO, ZX_ERR_IO_REFUSED, ZX_ERR_IO_DATA_INTEGRITY, ZX_ERR_IO_DATA_LOSS |
FIDL Error Handling (Three Layers)
Layer 1: Transport errors. Channel broke. Currently all transport-level
FIDL errors close the channel. Client observes ZX_ERR_PEER_CLOSED.
Layer 2: Epitaphs (RFC-0053). Server sends a special final message
before closing a channel, explaining why. Wire format: ordinal 0xFFFFFFFF,
error status in the reserved uint32 of the FIDL message header. After
sending, server closes the channel.
Layer 3: Application errors (RFC-0060). Methods declare error types:
Method() -> (string result) error int32;
Serialized as:
union MethodReturn {
MethodResult result;
int32 err;
};
Error types constrained to int32, uint32, or an enum thereof. Deliberately
no standard error enum – each service defines its own error domain.
Rationale: standard error enums “try to capture more detail than we think is
appropriate.”
C++ binding: zx::result<T> (specialization of fit::result<zx_status_t, T>).
Sources
- Zircon errors: https://fuchsia.dev/fuchsia-src/concepts/kernel/errors
- RFC-0060 error handling: https://fuchsia.dev/fuchsia-src/contribute/governance/rfcs/0060_error_handling
- RFC-0053 epitaphs: https://fuchsia.dev/fuchsia-src/contribute/governance/rfcs/0053_epitaphs
3. EROS / KeyKOS / Coyotos
KeyKOS Invocation Message Format
KC (Key, Order_code)
STRUCTFROM(arg_structure)
KEYSFROM(arg_key_slots)
STRUCTTO(reply_structure)
KEYSTO(reply_key_slots)
RCTO(return_code_variable)
- Order code: small integer selecting the operation (method selector).
- Return code: integer returned by the invoked object via
RCTO. - Data string: bulk data parameter (up to ~4KB).
- Keys: up to 4 capability parameters in each direction.
Invocation Primitives
- CALL: send + block for reply. Kernel synthesizes a resume key (capability to resume caller) as 4th key parameter to callee.
- RETURN: reply using a resume key + go back to waiting.
- FORK: send and continue (fire-and-forget).
Keeper Error Handling
Every domain has a domain keeper slot. On hardware trap (illegal instruction, divide-by-zero, protection fault):
- Kernel invokes the keeper as if the domain had issued a CALL.
- Keeper receives fault information in the message.
- Keeper can fix and resume (via resume key) or terminate.
- A non-zero return code from a key invocation triggers the keeper mechanism.
Coyotos (EROS Successor) – Formalized Error Model
Cleanly separates invocation-level vs application-level exceptions:
Invocation-level (before the target processes the message):
MalformedSyscall, InvalidAddress, AccessViolation,
DataAccessTypeError, CapAccessTypeError, MalformedSpace,
MisalignedReference
Application-level: signaled via OPR0.ex flag bit in the reply control
word. If set, remaining parameter words contain a 64-bit exception code
plus optional info.
Sources
- KeyKOS architecture: https://dl.acm.org/doi/pdf/10.1145/858336.858337
- Coyotos spec: https://hydra-www.ietfng.org/capbib/cache/shapiro:coyotosspec.html
- EROS (SOSP 1999): https://sites.cs.ucsb.edu/~chris/teaching/cs290/doc/eros-sosp99.pdf
4. Plan 9 / 9P
9P2000 Rerror Format
size[4] Rerror tag[2] ename[s]
ename[s]: variable-length UTF-8 string describing the error.- No
Terrormessage – only servers send errors. - String-based, not numeric. Conventional strings (“permission denied”, “file not found”) but no fixed taxonomy.
9P2000.u Extension (Unix compatibility)
size[4] Rerror tag[2] ename[s] errno[4]
Adds a 4-byte Unix errno as a hint. Clients should prefer the string.
ERRUNDEF sentinel when Unix errno doesn’t apply.
Design Rationale
Avoids “errno fragmentation” where different Unix variants assign different numbers to the same condition. The string is authoritative; the number is an optimization for Unix-compatibility clients.
Sources
- 9P2000 RFC: https://ericvh.github.io/9p-rfc/rfc9p2000.html
- 9P2000.u RFC: https://ericvh.github.io/9p-rfc/rfc9p2000.u.html
5. Genode
RPC Exception Propagation
GENODE_RPC_THROW(func_type, ret_type, func_name,
GENODE_TYPE_LIST(Exception1, Exception2, ...),
arg_type...)
Only the exception type crosses the boundary – exception objects (fields,
messages) are not transferred. Server encodes a numeric Rpc_exception_code,
client reconstructs a default-constructed exception of the matching type.
Undeclared exceptions: undefined behavior (server crash or hung RPC).
Infrastructure-Level Errors
RPC_INVALID_OPCODE: dispatched operation code doesn’t match.Rpc_exception_code: integral type, computed asRPC_EXCEPTION_BASE - index_in_exception_list.Ipc_error: kernel IPC failure (server unreachable).- Server death: capabilities become invalid, subsequent invocations
produce
Ipc_error.
Sources
- Genode RPC: https://genode.org/documentation/genode-foundations/20.05/functional_specification/Remote_procedure_calls.html
- Genode IPC: https://genode.org/documentation/genode-foundations/23.05/architecture/Inter-component_communication.html
6. Cross-System Comparison: Transport vs Application Errors
Every capability/microkernel IPC system separates two failure modes:
-
Transport errors – the invocation mechanism failed before the target processed the request (bad handle, insufficient rights, target dead, malformed message, timeout).
-
Application errors – the service processed the request and returned a meaningful error (not found, resource exhausted, invalid operation).
| System | Transport errors | Application errors |
|---|---|---|
| seL4 | seL4_Error (11 values) from syscall | IPC message payload (user-defined) |
| Zircon | zx_status_t (~30 values) from syscall | FIDL per-method error type |
| EROS/Coyotos | Invocation exceptions (kernel) | OPR0.ex flag + code in reply |
| Plan 9 | Connection loss | Rerror with string |
| Genode | Ipc_error + RPC_INVALID_OPCODE | C++ exceptions via GENODE_RPC_THROW |
| Cap’n Proto RPC | disconnected/unimplemented | failed/overloaded or schema types |
Common pattern: small kernel error code set for transport + typed service-specific errors for application.
7. POSIX errno: Strengths and Weaknesses for Capability Systems
Strengths
- Simple (single integer, zero overhead on success).
- Universal (every Unix developer knows it).
- Low overhead (no allocation on error path).
Weaknesses for Capability Systems
- Ambient authority assumption:
EACCES/EPERMassume ACL-style access control. In capability systems, having the capability IS the permission. - Global flat namespace: all errors share one integer space. Capability systems have typed interfaces; errors should be scoped per-interface.
- No structured information: just an integer, no “which argument” or “how much memory needed.”
- Thread-local state: clobbered by intermediate calls, breaks down with async IPC or promise pipelining.
- No transport/application distinction:
EBADF(transport) andENOENT(application) in the same space. - Not composable across trust boundaries: callee’s errno meaningless in caller’s address space without explicit serialization.
No capability system uses a POSIX-style global errno namespace.
IX-on-capOS Hosting Research
Research note on using IX as a package corpus and content-addressed build model for a more mature capOS system. It explains what IX provides, why it is useful for capOS, and how to extract the most value from it without importing CPython/POSIX assumptions as an architectural dependency.
What IX Is
IX is a source-based package/build system. It describes packages as templates, expands those templates into build descriptors and shell scripts, fetches and verifies source inputs, executes dependency-ordered builds, stores outputs in a content-addressed store, and publishes usable package environments through realm mappings.
For capOS, IX should be treated as three separable assets:
- a package corpus with thousands of package definitions and accumulated build knowledge;
- a content-addressed build/store model that already fits reproducible artifact management;
- a compact Python control plane that can be adapted once authority-bearing operations move behind capOS services.
IX should not be treated as a requirement to reproduce Unix inside capOS. Its current implementation uses CPython, Jinja2, subprocesses, shell tools, filesystem paths, symlinks, hardlinks, signals, and process groups because it runs on Unix-like hosts today. Those are implementation assumptions, not the part worth preserving unchanged.
Why IX Is Useful for capOS
capOS needs a credible path from isolated demos to a useful userspace closure. IX is useful because it supplies a package/build corpus and model that can exercise the exact system boundaries capOS needs to grow:
- process spawning with explicit argv, env, cwd, stdio, and exit status;
- fetch, archive extraction, and content verification as auditable services;
- Store and Namespace capabilities instead of ambient global filesystem authority;
- build sandboxing with explicit input, scratch, output, network, and resource policies;
- static-tool bootstrapping before a full dynamic POSIX environment exists;
- differential testing against the existing host IX implementation.
The main value is leverage. IX can give capOS real package metadata, real build scripts, and real toolchain pressure without making CPython or a broad POSIX personality the first required userspace milestone.
Best Way to Get the Most from IX
The optimal strategy is to preserve IX’s package corpus and build semantics while replacing the Unix-shaped execution boundary with capability-native services.
The high-value path is:
- Run upstream IX on the host first to build and validate early capOS artifacts.
- Use CPython/Jinja2 on the host as a reference oracle, not as the in-system foundation.
- Render IX templates through a Rust
ix-templatecomponent that implements the subset IX actually uses. - Run the adapted IX planner/control plane on native MicroPython once capOS has enough runtime support.
- Move fetch, extract, build, Store commit, Namespace publish, and process lifecycle into typed capOS services.
This gets most of IX’s value: package knowledge, reproducible build structure, and a practical self-hosting path. It avoids the lowest-value part: spending early capOS effort on a large CPython/POSIX compatibility layer just to preserve upstream implementation details.
Position
CPython is not an architectural prerequisite for IX-on-capOS.
It is a compatibility shortcut for running upstream IX with minimal changes. For a clean capOS-native integration, the better design is:
- keep IX’s package corpus and content-addressed build model;
- adapt IX’s Python control-plane code instead of preserving every CPython and POSIX assumption;
- run the adapted control plane on a native MicroPython port;
- move build execution, fetching, archive extraction, store mutation, and sandboxing into typed capOS services;
- render IX templates through a Rust template service or tightly scoped IX template engine, not full Jinja2 on MicroPython;
- keep CPython on the host as a differential test oracle and bootstrap tool, not as a required foundation layer for capOS.
MicroPython is a credible sweet spot only with that boundary. It is not a credible sweet spot if the requirement is “make upstream Jinja2, subprocess, fcntl, process groups, and Unix filesystem behavior all work inside MicroPython.”
Sources Inspected
- Upstream IX repository:
https://github.com/pg83/ix - IX package guide:
PKGS.md - IX core:
core/ - IX templates:
pkgs/die/ - Bundled IX template deps:
deps/jinja-3.1.6/,deps/markupsafe-3.0.3/ - MicroPython library docs:
https://docs.micropython.org/en/latest/library/index.html - MicroPython CPython-difference docs:
https://docs.micropython.org/en/latest/genrst/ - MicroPython porting docs:
https://docs.micropython.org/en/latest/develop/index.html - Jinja docs:
https://jinja.palletsprojects.com/en/latest/intro/ - MiniJinja docs:
https://docs.rs/minijinja/latest/minijinja/
Upstream IX Shape
IX is a source-based, content-addressed package/build system. Package
definitions are Jinja templates under pkgs/, mostly named ix.sh, and the
template hierarchy under pkgs/die/ expands those package descriptions into
JSON descriptors and shell build scripts.
The inspected clone has:
- 3788 package
ix.shfiles; - 66 files under
pkgs/die; - a template chain centered on
base.json,ix.json,script.json,sh0.sh,sh1.sh,sh2.sh,sh.sh,base.sh,std/ix.sh, and language/build-system templates for C, Rust, Go, Python, CMake, Meson, Ninja, WAF, GN, Kconfig, and shell-only generated packages.
The IX template surface is broad but not arbitrary Jinja. In the package tree surveyed, the Jinja tags used were:
| Tag | Count |
|---|---|
block | 14358 |
endblock | 14360 |
extends | 3808 |
if / endif | 451 / 451 |
include | 344 |
else | 123 |
set / endset | 52 / 52 |
for / endfor | 49 / 49 |
elif | 23 |
No macro, import, from, with, filter, raw, or call tags were
found in the inspected tree. That matters: IX’s template needs are probably a
finite subset around inheritance, blocks, self.block(), super(), includes,
conditionals, loops, assignments, expressions, and custom filters.
IX’s own Jinja wrapper is small. core/j2.py defines:
- custom loader with
//root handling; - include inlining;
- filters such as
b64e,b64d,jd,jl,group_by,basename,dirname,ser,des,lines,eval,defined,field,pad,add,preproc,parse_urls,parse_list,list_to_json, andfjoin.
That makes the template layer replaceable. The risk is not “Jinja is impossible.” The risk is “full upstream Jinja2 drags in a CPython-shaped runtime just to implement a template subset IX mostly uses in a disciplined way.”
Current IX Runtime Surface
The IX Python core uses ordinary host-scripting features:
os,os.path,json,hashlib,base64,random,string,functools,itertools,platform,getpass;shutil.which,shutil.rmtree,shutil.move;subprocess.run,check_call,check_output;os.execvpe,os.kill,os.setpgrp,signal.signal;fcntl.fcntlto reset stdout flags;asynciofor graph scheduling;multiprocessing.cpu_count;contextvarsfallback support forasyncio.to_thread;tarfile,zipfile;ssl,urllib3, usually only to suppress certificate warnings while fetchers are shell-driven;os.symlink,os.link,os.rename,os.makedirs,open, and file tests.
core/execute.py is the important boundary. It schedules a DAG, prepares
output directories, calls shell commands with environment variables and stdin,
checks output touch files, and kills the process group on failure.
core/cmd_misc.py and core/shell_cmd.py cover fetch, extraction, hash
checking, archive unpacking, and hardlinking fetched inputs.
core/realm.py maps build outputs into realm names using symlinks and metadata
under /ix/realm.
core/ops.py selects an execution mode. Today the modes are local, system,
fake, and molot. A capOS executor mode is the correct integration point.
CPython Path
CPython is the obvious route for upstream compatibility:
- upstream Jinja2 is designed for modern Python and uses normal CPython-style standard library facilities;
- IX’s current Python code assumes
subprocess,asyncio,fcntl,shutil, archive modules, and process semantics; - CPython plus
libcapos-posixwould let a large fraction of that code run with limited changes.
That does not make CPython the right product dependency for IX-on-capOS. CPython pulls in a large libc/POSIX surface and encourages preserving Unix process and filesystem assumptions that capOS should make explicit through capabilities.
CPython should be used in two places:
- Host-side bootstrap and reference evaluation.
- Optional compatibility mode once
libcapos-posixis mature.
It should not be the required path for a clean IX-capOS integration.
If CPython is needed later, capOS has two routes:
- Native CPython through musl plus
libcapos-posix. - CPython compiled to WASI and run through a native WASI runtime.
The native POSIX route is the only route that makes sense for IX-style build
workloads. It needs fd tables, path lookup, read/write/close/lseek, directory
iteration, rename/unlink/mkdir, time, memory mapping, posix_spawn, pipes,
exit status, and eventually sockets. That is the same compatibility work
needed for shell tools and build systems, so it should arrive as part of the
general userspace-compatibility track, not as an IX-specific dependency.
The WASI route is useful for sandboxed or compute-heavy Python, but it is a poor fit for IX package builds because IX fundamentally drives external tools, filesystem trees, fetchers, and process lifecycles. WASI CPython can be useful as a script sandbox, not as the main IX appliance runtime.
MicroPython Path
MicroPython is attractive because capOS needs an embeddable system scripting runtime before it needs a full desktop Python environment.
The upstream docs frame MicroPython as a Python implementation with a smaller,
configurable library set. The latest library docs list micro versions of
modules relevant to IX, including asyncio, gzip, hashlib, json, os,
platform, random, re, select, socket, ssl, struct, sys, time,
zlib, and _thread, while warning that most standard modules are subsets
and that port builds may include only part of the documented surface.
That is a good fit for capOS. It means a capOS port can expose a deliberately chosen OS surface instead of pretending to be Linux.
MicroPython should host:
- package graph traversal;
- package metadata parsing;
- target/config normalization;
- dependency expansion;
- high-level policy;
- command graph generation;
- calls into capOS-native services.
MicroPython should not own:
- generic subprocess emulation;
- shell execution internals;
- process groups or Unix signals;
- TLS/network fetching;
- archive formats beyond small helper cases;
- hardlink/symlink implementation;
- content store mutation;
- build sandboxing;
- parallel job scheduling if that wants kernel-visible resource control.
Those belong in capOS services.
Native MicroPython Port Shape
A capOS MicroPython port should be a new MicroPython platform port, not the Unix port with a large compatibility shim underneath.
The port should provide:
- VM startup through
capos-rt; - heap allocation from a fixed initial heap first, then
VirtualMemorywhen growth is available; - stdin/stdout/stderr backed by granted stream or Console capabilities;
- module import from a read-only Namespace plus frozen modules;
- a small VFS adapter over Store/Namespace for scripts and package metadata;
- native C/Rust extension modules for capOS capabilities;
- deterministic error mapping from capability exceptions to Python exceptions.
The initial built-in surface should be deliberately small:
syswith argv/path/modules;ospath and file operations backed by a granted namespace;timebacked by a clock capability;hashlib,json,binascii/base64,random,struct;- optional
asyncioif the planner keeps Python-level concurrency; - no general-purpose
subprocessuntil the service boundary proves it is necessary.
For IX, the MicroPython port should ship frozen planner modules and native
bindings to ix-template, BuildCoordinator, Store, Namespace, Fetcher,
and Archive. That keeps the trusted scripting surface small and avoids
import-time dependency drift.
Jinja2 and MicroPython
Full Jinja2 compatibility on MicroPython remains unproven and is probably not the optimal target.
Current Jinja docs say Jinja supports Python 3.10 and newer, depends on
MarkupSafe, and compiles templates to optimized Python code. The bundled IX
Jinja tree imports modules such as typing, weakref, importlib,
contextlib, inspect, ast, types, collections, itertools, io, and
MarkupSafe. Some of these can be ported or stubbed, but that is a CPython
compatibility project, not a small MicroPython extension.
The better path is to treat IX’s template language as an input format and render it with a capOS-native component.
Recommended template strategy:
- Build an
ix-templateRust component using MiniJinja or a smaller IX-specific template subset. - Register IX’s custom filters from
core/j2.py. - Implement IX’s loader semantics:
//package-root paths, relative includes, and cached sources. - Reject unsupported Jinja constructs with deterministic errors.
- Keep CPython/Jinja2 as a host-side oracle for differential testing until the capOS renderer matches the package corpus.
MiniJinja is a practical candidate because it is Rust-native, based on Jinja2
syntax/behavior, supports custom filters and dynamic objects, and has feature
flags for trimming unused template features. IX needs multi-template support
because it uses extends, include, and block.
If MiniJinja compatibility is insufficient, the fallback is not CPython by
default. The fallback is an IX-template subset evaluator that implements the
constructs actually used by pkgs/.
Optimal Architecture
The clean design is an IX-capOS build appliance, not a Unix personality layer that happens to run IX.
flowchart TD
CLI[ix CLI or build request] --> Planner[ix planner on MicroPython]
Planner --> Template[ix-template renderer]
Planner --> Graph[normalized build graph]
Template --> Graph
Graph --> Coordinator[capOS BuildCoordinator service]
Coordinator --> Fetcher[Fetcher service]
Coordinator --> Extractor[Archive service]
Coordinator --> Store[Store service]
Coordinator --> Sandbox[BuildSandbox service]
Fetcher --> Store
Extractor --> Store
Sandbox --> Proc[ProcessSpawner]
Sandbox --> Scratch[writable scratch namespace]
Sandbox --> Inputs[read-only input namespaces]
Proc --> Tools[sh, make, cc, cargo, go, coreutils]
Sandbox --> Output[write-once output namespace]
Output --> Store
Store --> Realm[Namespace snapshot / realm publish]
The planner remains small and scriptable. The authority-bearing work happens in services:
BuildCoordinator: owns graph execution and job state.Store: content-addressed objects and output commits.Namespace: names, realms, snapshots, and package environments.Fetcher: network-capable source acquisition with explicit TLS and cache policy.Archive: deterministic extraction and path-safety checks.BuildSandbox: constructs per-build capability sets.ProcessSpawner: starts shell/tools with controlled argv, env, cwd, stdio, and granted capabilities.Toolchainpackages: statically linked tools built externally first, then eventually by IX itself.
The adapted IX planner should call service APIs instead of shelling out for operations that are native capOS concepts.
Control-Plane Boundary
MicroPython should see a narrow, high-level API. It should not synthesize Unix from first principles.
Example shape:
import ixcapos
import ixtemplate
pkg = ixcapos.load_package("bin/minised")
desc = ixtemplate.render_package(pkg.name, pkg.context)
graph = ixcapos.plan(desc, target="x86_64-unknown-capos")
result = ixcapos.build(graph)
ixcapos.publish_realm("dev", result.outputs)
The Python layer can still look like IX. The implementation behind it should be capability-native.
Service API Sketch
The exact schema should follow the project schema style, but this is the shape of the boundary:
interface BuildCoordinator {
plan @0 (package :Text, target :Text, options :BuildOptions)
-> (graph :BuildGraph);
build @1 (graph :BuildGraph) -> (result :BuildResult);
publish @2 (realm :Text, outputs :List(OutputRef))
-> (namespace :Namespace);
}
interface BuildSandbox {
run @0 (command :Command, inputs :List(Namespace),
scratch :Namespace, output :Namespace, policy :SandboxPolicy)
-> (status :ExitStatus, log :BlobRef);
}
interface Fetcher {
fetch @0 (url :Text, sha256 :Data, policy :FetchPolicy)
-> (blob :BlobRef);
}
interface Archive {
extract @0 (archive :BlobRef, policy :ExtractPolicy)
-> (tree :Namespace);
}
Important policy fields:
- network allowed or denied;
- wall-clock and CPU budgets;
- maximum output bytes;
- allowed executable namespaces;
- allowed output path policy;
- whether timestamps are normalized;
- whether symlinks are preserved, rejected, or translated;
- whether hardlinks become store references or copied files.
Store and Realm Mapping
IX’s /ix/store maps well to capOS Store.
IX’s realms should not be literal symlink trees in capOS. They should be named Namespace snapshots:
| IX concept | capOS mapping |
|---|---|
/ix/store/<uid>-name | Store object/tree with stable content hash and metadata |
| build output dir | write-once output namespace |
| build temp dir | scratch namespace with cleanup policy |
| realm | named Namespace snapshot |
| symlink from realm to output | Namespace binding or bind manifest |
| hardlinked source cache | Store reference or copy-on-write blob binding |
touch output sentinel | build-result metadata, optionally synthetic file for compatibility |
This preserves IX’s reproducibility model without importing global Unix authority.
Process and Filesystem Requirements
A mature capOS needs these primitives before IX builds can run natively:
ProcessSpawnerandProcessHandle;- argv/env/cwd/stdin/stdout/stderr passing;
- exit status;
- pipes or stream capabilities;
- fd-table support in the POSIX layer for ported tools;
- read-only input namespaces;
- writable scratch namespaces;
- write-once output namespaces;
- directory listing, create, rename, unlink, and metadata;
- symlink translation or explicit rejection policy;
- hardlink translation or store-reference fallback;
- monotonic time;
- resource limits;
- cancellation.
For package builds, the tool surface is larger than IX’s Python surface:
sh;find,sed,grep,awk,sort,xargs,install,cp,mv,rm,ln,chmod,touch,cat;tar,gzip,xz,zstd,zip,unzip;make,cmake,ninja,meson,pkg-config;- C compiler/linker/archive tools;
cargoand Rust toolchains;- Go toolchain;
- Python only for packages that build with Python.
IX’s static-linking bias helps because the early tool closure can be imported as statically linked binaries.
What to Patch Out of IX
For a clean capOS fit, patch or replace these upstream assumptions:
| Upstream assumption | capOS replacement |
|---|---|
subprocess.run everywhere | BuildSandbox.run() or ProcessSpawner |
process groups and SIGKILL | ProcessHandle.killTree() or sandbox cancellation |
fcntl stdout flag reset | remove or make no-op |
chrt, nice | scheduler/resource policy on sandbox |
sudo, su, chown | no permission-bit authority; use capability grants |
unshare, tmpfs, jail | BuildSandbox with explicit caps |
/ix/store global path | Store capability plus namespace mount view |
/ix/realm symlink tree | Namespace snapshot/publish |
| hardlinks for fetched files | Store refs or copy fallback |
curl/wget subprocess fetch | Fetcher service |
Python tarfile/zipfile | Archive service |
asyncio executor | BuildCoordinator scheduler |
This is more invasive than a “light patch”, but it is cleaner. The IX package corpus and target/build knowledge are preserved; Unix process plumbing is not.
MicroPython Port Scope
The MicroPython port should be sized around IX planner needs plus general system scripting:
Native modules:
capos: bootstrap capabilities, typed capability calls, errors.ixcapos: package graph and build-service client bindings.ixtemplate: template render calls if the renderer is an embedded Rust/C component.ixstore: Store and Namespace helpers.
Python/micro-library requirements:
json;hashlib;base64orbinascii;os.pathsubset;random;time;- small
shutilsubset for path operations if old IX code remains; - small
asyncioonly if planner concurrency remains in Python.
Avoid implementing:
- general
subprocess; - general
fcntl; - full
signal; - full
multiprocessing; - full
tarfile; - full
zipfile; - full
ssl/urllib3; - full Jinja2.
Those are symptoms of preserving the wrong boundary.
CPython Still Has a Role
CPython remains useful even if it is not a capOS prerequisite:
- run upstream IX on the development host;
- compare rendered descriptors from CPython/Jinja2 against
ix-template; - generate fixtures for the capOS renderer;
- bootstrap the first static tool closure;
- serve as a later optional POSIX compatibility demo.
Differential testing should be explicit:
flowchart LR
Pkg[IX package] --> Cpy[Host CPython + Jinja2]
Pkg --> Cap[capOS ix-template]
Cpy --> A[descriptor A]
Cap --> B[descriptor B]
A --> Diff[normalized diff]
B --> Diff
Diff --> Corpus[compatibility corpus]
This makes CPython a test oracle, not a trusted runtime dependency inside capOS.
Staged Plan
Stage A: Host IX builds capOS artifacts
Run IX on Linux host first. Add a capos target and recipes for static capOS
ELFs. This validates package metadata, target triples, linker flags, and static
closure assumptions before capOS hosts any of it.
Outputs:
x86_64-unknown-capostarget model in IX;- recipes for
libcapos,capos-rt, shell/coreutils candidates, MicroPython, and archive/fetch helpers; - static artifacts imported into the boot image or Store.
Stage B: Template compatibility harness
Build ix-template on the host. Render a package corpus through CPython/Jinja2
and through ix-template. Normalize JSON/script output and record divergences.
Outputs:
- supported IX template subset;
- custom filter implementation;
- fixture corpus;
- list of unsupported packages or constructs.
Stage C: Native MicroPython port
Port MicroPython to capOS as a normal native userspace program using
capos-rt and a small libc/POSIX subset only where needed.
Outputs:
- REPL or script runner;
- frozen IX planner modules;
- native
capos,ixcapos, andixtemplatemodules; - no promise of full CPython compatibility.
Stage D: BuildCoordinator and sandboxed execution
Implement capOS-native build services and run simple package builds using externally supplied static tools.
Outputs:
- build graph execution;
- per-build scratch/output namespaces;
- deterministic logs and output commits;
- cancellation and resource policies.
Stage E: IX package corpus migration
Patch IX templates for capOS target semantics. Start with simple C/static packages, then Rust, then Go.
Outputs:
- C/static package subset;
- regular Rust package support once regular Rust runtime/toolchain work is ready;
- Go package support when
GOOS=caposor imported Go toolchain support is credible; - WASI packages as a separate target family where useful.
Stage F: Self-hosting
Run the IX-capOS appliance inside capOS to rebuild a meaningful part of its own userspace closure.
Outputs:
- build the MicroPython IX planner inside capOS;
- build core shell/coreutils/archive tools inside capOS;
- build
libcaposand selected static service binaries; - eventually build Rust and Go runtime/toolchain pieces.
Why This Is Better Than “CPython First”
The CPython-first route optimizes for running upstream IX quickly. The MicroPython-plus-services route optimizes for capOS’s actual design:
- capability authority stays typed and explicit;
- build isolation is native instead of Linux namespace emulation;
- Store/Namespace are first-class rather than hidden behind
/ix; - fetch/archive/build operations are auditable services;
- the scripting runtime remains small;
- the system does not need full CPython before it can have a package manager;
- CPython can still be added later through the POSIX layer without blocking IX-capOS.
The tradeoff is that IX-capOS becomes a real port/fork at the control-plane boundary. That is acceptable for a clean capability-native fit.
Risks
Template compatibility is the main technical risk. IX uses a restricted-looking
Jinja subset, but exact self.block(), super(), whitespace, expression, and
undefined-value behavior must match closely enough for package hashes to remain
stable. This needs corpus testing, not confidence.
Build-script compatibility is the largest scope risk. Even if IX planning is native, the package corpus still executes conventional build systems. capOS must provide enough shell, coreutils, archive, compiler, and filesystem behavior for those tools.
Toolchain bootstrapping is a long dependency chain. The first useful IX-capOS system will import statically linked tools from a host. Native self-hosting is late-stage work.
Store semantics need care around directories, symlinks, hardlinks, mtimes, and executable bits. These details affect build reproducibility and package compatibility.
MicroPython must not grow into a bad CPython clone. If many missing modules are implemented only to satisfy upstream IX assumptions, the design boundary has failed.
Recommendation
Adopt IX as a package corpus and build model, not as a CPython/POSIX program to preserve unchanged.
The optimal capOS-native solution is:
- Host-side upstream IX remains available for bootstrap and oracle tests.
ix-templatein Rust renders the actual IX template subset.- Native MicroPython runs the adapted IX planner/control plane.
- capOS services execute all authority-bearing operations: fetch, extract, build sandbox, Store commit, Namespace publish, and process lifecycle.
- CPython is deferred to general POSIX compatibility and optional tooling.
This makes MicroPython the sweet spot for the in-system IX control plane while avoiding the trap of turning MicroPython into CPython.
Research: Out-of-Kernel Scheduling
Survey of whether capOS can move CPU scheduler implementation out of the kernel, which parts are normally kept privileged, and which policy has been moved to user-space services or loadable policy modules in prior systems.
Scope
“User-space scheduler” is an overloaded term. The question here is narrower than language/runtime scheduling: can the OS CPU scheduler itself be moved out of the kernel?
This report separates the relevant models:
| Model | Schedules | Kernel sees | Examples |
|---|---|---|---|
| User-controlled kernel scheduling | Kernel threads / scheduling contexts | Privileged mechanism plus user policy inputs | L4 user-level scheduling, seL4 MCS, ARINC partition schedulers on seL4 |
| Dynamic in-kernel policy | Kernel threads | Policy loaded from user space but executed in kernel | Linux sched_ext, Ekiben, Bossa |
| Whole-machine core arbitration | Cores granted to applications/runtimes | Kernel threads pinned, parked, or revoked | Arachne, Shenango, Caladan |
| In-process M:N runtime | Goroutines, virtual threads, fibers, async tasks | A smaller set of OS threads | Go, Java virtual threads, Erlang, Tokio |
| User-level thread package | User-level threads or tasklets | One or more kernel execution contexts | Capriccio, Argobots |
| Kernel-assisted two-level runtime scheduling | User threads plus kernel events | Virtual processors / activations | Scheduler activations, Windows UMS |
The common boundary in prior systems is: the kernel allocates protected execution resources, handles blocking and preemption, and enforces isolation. User space supplies domain policy: which goroutine, actor, task, request, or coroutine runs next.
Feasibility Assessment
Moving the entire scheduler out of the kernel is not viable in a protected, preemptive system if “scheduler” means the code that runs on timer interrupts, chooses an immediately runnable kernel thread, saves/restores CPU state, changes page tables, updates per-CPU state, and enforces CPU-time isolation. That mechanism is part of the CPU protection boundary.
Moving scheduler policy out of the kernel is viable. A capOS-like kernel can act as a small CPU driver that enforces runnable-state invariants, capability-authorized scheduling contexts, budgets, priorities, CPU affinity, timeout faults, and IPC donation. A privileged user-space scheduler service can own admission control, budgets, priorities, placement, CPU partitioning, and service-specific policy.
The design point supported by the surveyed systems is not “no scheduler in kernel.” It is “minimal kernel dispatch and enforcement, user-space policy.”
Executive Conclusions
- The next-thread dispatch path is normally kept in kernel mode. It runs when the current user process may be untrusted, blocked, faulting, or out of budget.
- User space can own policy if the kernel exposes scheduling contexts as capability-controlled CPU-time objects. Thread creation and thread handles should follow the same capability-first model.
- Consulting a user-space scheduler server on every timer tick adds context switches to the hottest path and creates a bootstrap problem when the scheduler server itself is not runnable.
- seL4 MCS is the most directly comparable model: scheduling contexts are explicit objects, budgets are enforced by the kernel, and passive servers can run on caller-donated scheduling contexts.
- L4 user-level scheduling experiments show that user-directed scheduling is possible, with reported overhead from 0 to 10 percent compared with a pure in-kernel scheduler for their workload. That is plausible for policy changes, not for every dispatch decision.
- seL4 user-mode partition schedulers show the downside: a prototype partitioned scheduler measured substantial overhead because each scheduling event crosses the user/kernel boundary.
- sched_ext and Ekiben are useful evidence for pluggable scheduler policy, but they still execute policy in or near the kernel. They do not prove that the dispatch mechanism can be a normal user process.
- Whole-machine core arbiters such as Arachne, Shenango, and Caladan support a different split: the kernel still schedules threads, while a privileged control plane grants, revokes, and places cores at coarser granularity.
- Direct-switch IPC and scheduling-context donation reduce the priority inversion and dispatch-overhead risks that appear when capability servers are scheduled only by per-process priorities.
- Pure M:1 user-level threads are insufficient for capOS as the only threading story. They are fast, but one blocking syscall, page fault wait, or long CPU loop can stall unrelated user threads unless every blocking operation is converted to async form.
- M:N runtimes need a small OS contract: capability-created kernel threads, TLS/FS-base state, capability-authorized futex-style wait/wake, monotonic timers, async I/O/event notification, and a way to detect or avoid kernel blocking.
- Scheduler activations solved the right conceptual problem but exposed a complicated upcall contract. A capability OS can get most of the benefit with simpler primitives: async capability rings, notification objects, futexes, and explicit thread objects.
- Work-stealing with per-worker local queues is the dominant general-purpose runtime design. It gives locality and scale, but it needs explicit fairness guards and I/O polling integration.
- SQPOLL-style polling is a scheduling decision. It trades a core for lower submission latency and depends on SMP plus explicit CPU ownership.
- A generic language scheduler in the kernel is a separate design from out-of-kernel CPU policy. Go, Rust async, actor runtimes, and POSIX layers need kernel mechanisms that let them implement their own policy.
Privileged Mechanisms
The following responsibilities are mechanism, not policy. Moving them to a normal user process either breaks protection or puts a user/kernel round trip on the critical path:
- Save and restore CPU register context.
- Switch page tables / address spaces.
- Update per-CPU current-thread state, kernel stack, TSS/RSP0, and syscall stack state.
- Handle timer interrupts and IPIs.
- Maintain a safe runnable/blocked/exited state machine.
- Enforce CPU budgets and preempt a thread that exceeds its budget.
- Choose an emergency runnable thread when the policy owner is dead, blocked, or malicious.
- Run idle and halt safely when no runnable work exists.
- Integrate scheduling with blocking syscalls, page faults, futex waits, and IPC wakeups.
- Preserve invariants under SMP races.
These are exactly the parts currently concentrated in
kernel/src/sched.rs
and the x86 context-switch path. They can be simplified and made more generic,
but they remain required somewhere privileged.
Policy Surface
The following are policy examples that can be owned by a privileged user-space service once scheduling contexts exist:
- Admission control: which process/thread is allowed to consume CPU time.
- Priority assignment and dynamic priority changes.
- Budget/period selection for temporal isolation.
- CPU affinity and CPU partitioning decisions.
- Core grants for SQPOLL, device polling, network stacks, and latency-sensitive services.
- Overload handling policy.
- Per-service or per-tenant fair-share policy.
- Instrumentation-driven tuning.
- Runtime-specific hints, such as “latency-sensitive”, “batch”, “driver”, or “poller”.
This split gives a capOS-like system policy freedom while preserving a small, auditable kernel CPU mechanism.
Viable Architectures
1. Minimal Kernel Scheduler Plus User Policy Service
This is one capOS-compatible design point.
The kernel implements:
- Thread states and per-CPU run queues.
- Priority/budget-aware dispatch.
- Scheduling-context objects.
- Timer-driven budget accounting.
- Timeout faults or notifications.
- Capability-checked operations to bind/unbind scheduling contexts to threads.
- Emergency fallback policy.
A user-space sched service implements:
- System policy loaded from the boot manifest.
- Resource partitioning between services.
- Priority/budget updates.
- CPU pinning and SQPOLL grants.
- Diagnostics and policy reload.
The policy service is invoked on configuration changes and timeout faults, not on every context switch.
2. seL4-MCS-Style Scheduling Contexts
seL4 MCS makes CPU time a first-class kernel object. A thread needs a scheduling context to run. A scheduling context carries budget, period, and priority. The kernel enforces the budget with a sporadic-server model. Passive servers can block without their own scheduling context; callers donate their scheduling context through synchronous IPC, and the context returns on reply.
This maps directly to capOS:
SchedContext {
budget_ns
period_ns
priority
cpu_mask
remaining_budget
timeout_endpoint
}
Kernel responsibilities:
- Enforce budget and period.
- Dispatch runnable threads with eligible scheduling contexts.
- Donate and return contexts across direct-switch IPC.
- Notify user space on timeout or depletion.
User-space responsibilities:
- Create and distribute scheduling-context capabilities.
- Decide budgets and priorities.
- Build passive service topologies.
- React to timeout faults.
This moves scheduling policy out without moving the hot dispatch mechanism out.
3. Hierarchical User-Level Scheduler
L4 research evaluated exporting scheduling to user level through a hierarchical user-level scheduling architecture. The reported application overhead was 0 to 10 percent compared with a pure in-kernel scheduler in their evaluation, and the design enabled user-directed scheduling.
This is possible, but the cost model is sensitive:
- Every policy decision that requires a scheduler-server round trip is expensive.
- The scheduler server needs guaranteed CPU time, or the system can deadlock.
- Faults and interrupts still need kernel fallback.
- SMP multiplies races around run queues, CPU ownership, and migration.
This architecture is viable for coarse-grained partition scheduling, VM scheduling, or policy control. As a first general dispatch path, it has higher latency and bootstrap risk than an in-kernel dispatcher.
4. Dynamic In-Kernel Policy
Linux sched_ext lets user space load BPF scheduler programs, but the policy runs inside the kernel scheduler framework. The kernel preserves integrity by falling back to the fair scheduler if the BPF scheduler errors or stalls runnable tasks. Ekiben similarly targets high-velocity Linux scheduler development with safe Rust policies, live upgrade, and userspace debugging.
This model is a later-stage option for dynamic scheduler experiments, but it is not “scheduler in user space.” It also adds verifier/runtime complexity.
5. Core Arbiter / Resource Manager
Arachne, Shenango, and Caladan move high-level core allocation decisions out of the ordinary kernel scheduler path. Applications or runtimes know which cores they own, while an arbiter grants and revokes cores based on load or interference.
This model is useful for capOS after SMP:
- grant cores to NIC drivers, network stacks, or SQPOLL workers;
- revoke poller cores under CPU pressure;
- isolate latency-sensitive services from batch work;
- expose CPU ownership through capabilities.
It does not remove the kernel dispatcher. It changes the granularity of policy from “which thread next” to “which service owns this CPU budget.”
Classic Problem: Kernel Threads vs User Threads
The scheduler activations paper is still the cleanest statement of the core problem: kernel threads have integration with blocking and preemption, while user-level threads have cheaper context switching and better policy control. The failure mode of user-level threads layered naively on kernel threads is that kernel events are hidden from the runtime. A kernel thread can block in the kernel while runnable user threads exist, and the kernel can preempt a kernel thread without telling the runtime which user thread was stopped.
Scheduler activations address this by giving each address space a “virtual multiprocessor.” The kernel allocates processors to address spaces and vectors events to the user scheduler when processors are added, preempted, blocked, or unblocked. The activation is both an execution context and a notification vehicle.
The lesson for capOS is not to copy the full activation API. The durable idea is the split:
- Kernel owns physical CPU allocation, protection, preemption, and blocking.
- Runtime owns which application-level work item runs on a granted execution context.
- Kernel-visible blocking must create a runtime-visible event, or it must be avoided by making the operation async.
For capOS, async capability rings already avoid many blocking syscalls. The remaining hard cases are futex waits, page faults that require I/O, synchronous IPC, and preemption of long-running runtime tasks.
Runtime Schedulers in Practice
Go
Go uses an M:N scheduler with three central concepts:
- G: goroutine.
- M: worker thread.
- P: processor token required to execute Go code.
The Go runtime distributes runnable goroutines over worker threads, keeps per-P queues for scalability, uses global queues and netpoller integration for fairness and I/O, and parks/unparks OS threads conservatively to avoid wasting CPU. Its own source comments call out why centralized state and direct handoff were rejected: centralization hurts scalability, while eager handoff hurts locality and causes thread churn.
Preemption is mixed. Go has synchronous safe points and asynchronous preemption using OS mechanisms such as signals. The runtime can only safely stop a goroutine at points where stack and register state can be scanned.
Implications for capOS:
- Initial
GOOS=caposcan run withGOMAXPROCS=1and cooperative preemption, but useful Go requires kernel threads, futexes, FS-base/TLS, a monotonic timer, and an async network poller. - A signal clone is not strictly required if capOS provides a runtime-visible timer/preemption notification and the Go port accepts cooperative-first behavior.
- The kernel must schedule threads, not processes, before Go can use multiple cores.
Java Virtual Threads
JDK virtual threads use M:N scheduling: many virtual threads are mounted on a
smaller number of platform threads. The default scheduler is a FIFO-mode
work-stealing ForkJoinPool; the platform thread currently carrying a virtual
thread is called its carrier.
The design is intentionally not pure cooperative scheduling from the application’s perspective: most JDK blocking operations unmount the virtual thread, freeing the carrier. But some operations pin the virtual thread to the carrier, notably native calls and some synchronized regions. The JEP also notes that the scheduler does not currently implement CPU time-sharing for virtual threads.
Implications for capOS:
- “Blocking” compatibility requires library/runtime cooperation, not just a scheduler. The runtime needs blocking operations to yield carriers.
- Native calls and pinned regions remain a general M:N hazard. capOS cannot make that disappear in the kernel.
Tokio and Rust Async Executors
Tokio represents the async executor model rather than stackful green threads.
Tasks run until they return Poll::Pending, so fairness depends on cooperative
yield points and wakeups. Tokio’s multi-thread scheduler uses one global queue,
per-worker local queues, work stealing, an event interval for I/O/timer checks,
and a LIFO slot optimization for locality.
Implications for capOS:
- A
capos-rtasync executor can integrate capability-ring completions, notification objects, and timers as wake sources. - A cooperative budget is mandatory. A future that never awaits can monopolize a worker until kernel preemption takes the whole OS thread away.
- A single global CQ per process can become an executor bottleneck if many worker threads consume completions. Per-thread or sharded wake queues are likely needed after SMP.
Erlang/BEAM
BEAM schedulers run lightweight Erlang processes on scheduler threads. The runtime exposes scheduler count and binding controls, and Erlang processes are preempted by reductions rather than OS timer slices. This shows a different point in the design space: the language VM owns fairness because it controls execution of bytecode.
Implications for capOS:
- Managed runtimes can implement stronger fairness than native async libraries because they control instruction dispatch or compiler-inserted safe points.
- Native Rust/C userspace cannot rely on that unless the compiler/runtime inserts yield or safe-point checks.
Capriccio and Argobots
Capriccio showed that a user-level thread package can scale to very high concurrency by combining cooperative user-level threads, asynchronous I/O, O(1) thread operations, linked stacks, and resource-aware scheduling. The important lesson is that the thread abstraction can survive high concurrency when the runtime controls stacks and blocking.
Argobots generalizes lightweight execution units into user-level threads and tasklets over execution streams. It is designed as a substrate for higher-level systems such as OpenMP and MPI, with customizable schedulers. This is directly relevant to capOS because it argues for low-level runtime mechanisms, not one global scheduling policy.
Lithe
Lithe targets composition of parallel libraries. Its thesis is that a universal task abstraction or one global scheduler does not compose well when multiple parallel libraries are nested. Instead, physical hardware threads are shared through an explicit resource interface, while each library keeps its own task representation and scheduling policy.
Implications for capOS:
- Avoid oversubscription by making CPU grants visible to user space.
- A future
CpuSetor scheduling-context capability could let runtimes know how much parallelism they are actually allowed to use. - Nested runtimes benefit from the ability to donate or yield execution resources without going through a process-global policy singleton.
Kernel Interfaces That Matter
Futexes
Futexes are the standard split-lock design: user space does the uncontended fast path with atomics, and the kernel only participates to sleep or wake threads. Linux also has priority-inheritance futex operations for cases where the kernel must manage lock-owner priority propagation.
For capOS:
- Implement futex as a capability-authorized primitive. Do not assume generic Cap’n Proto method encoding is acceptable for the hot path; measure it against a compact operation before fixing the ABI.
- Key futex wait queues by
(address_space, user_virtual_address)for private futexes. Shared-memory futexes eventually need a memory-object identity plus offset. - Support timeout against monotonic time first. Requeue and PI futexes can wait.
Restartable Sequences
Linux rseq lets user space maintain per-CPU data without heavyweight atomics and lets a thread cheaply read its current CPU/node. The current kernel docs also describe scheduler time-slice extensions for short critical sections.
For capOS:
- rseq-style current-CPU access becomes useful after SMP and per-CPU run queues.
- It is not a first threading prerequisite. Futex, TLS, and kernel threads come first.
- If added, expose a small per-thread ABI page with
cpu_id,node_id, and an abort-on-migration critical-section protocol.
io_uring SQPOLL
SQPOLL moves submission from syscall-driven to polling-driven. A kernel thread polls the submission queue and submits work as soon as userspace publishes SQEs. This reduces submission latency and syscall overhead for sustained I/O, but it burns CPU and needs careful affinity.
capOS already has an io_uring-inspired capability ring, so the analogy is direct:
- Current tick-driven ring processing is correct for a toy system but couples invocation latency to timer frequency.
- A kernel-side SQ polling thread interacts badly with single-CPU systems. On a single CPU it competes with the application it is supposed to accelerate.
- Make SQPOLL a scheduling/capability decision: the process donates or is granted a CPU budget for the poller.
- Completion handling remains a separate problem. A runtime still needs to poll CQs or block on notifications.
sched_ext
Linux sched_ext is not a normal user-level thread scheduler. It is a scheduler class whose behavior is defined by BPF programs loaded from user space. The kernel docs emphasize that sched_ext can be enabled and disabled dynamically, can group CPUs freely, and falls back to the default scheduler if the BPF scheduler misbehaves. The docs also warn that the scheduler API has no stability guarantee.
For capOS:
- The relevant idea is safe, dynamically replaceable policy with kernel integrity fallback.
- Copying the BPF ABI is not required. capOS can get a smaller version through privileged scheduler-policy capabilities later.
- Keep early scheduling policy in kernel Rust until the invariants are clear.
Whole-Machine User-Space/Core Schedulers
Arachne
Arachne is a user-level thread system for very short-lived threads. It is core-aware: applications know which cores they own and control placement of work on those cores. A central arbiter reallocates cores among applications. The published results report strong memcached and RAMCloud improvements, and the implementation requires no Linux kernel modifications.
Takeaway: user-level scheduling gets much better when the runtime has explicit core ownership. Blindly creating more kernel threads and hoping the OS scheduler does the right thing is a weaker contract.
Shenango
Shenango targets datacenter services with microsecond-scale tail-latency goals. It uses kernel-bypass networking and an IOKernel on a dedicated core to steer packets and reallocate cores across applications every 5 microseconds. The key policy is rapid core reallocation based on whether queued work is waiting long enough to imply congestion.
Takeaway: a dedicated scheduling/control core can be worthwhile when latency SLOs are tighter than normal kernel scheduling reaction times. It is expensive and only justified for sustained latency-sensitive workloads.
Caladan
Caladan extends the idea from load to interference. It uses a centralized scheduler core and kernel module to monitor and react to memory hierarchy and hyperthread interference at microsecond scale. Its main claim is that static partitioning of cores, caches, and memory bandwidth is neither necessary nor sufficient for rapidly changing workloads.
Takeaway: CPU scheduling is not only “which runnable thread next.” On modern machines it is also placement relative to caches, sibling SMT threads, memory bandwidth, and bursty workload phase changes.
Design Axes
| Axis | Options | Practical conclusion |
|---|---|---|
| Stack model | Stackless tasks, segmented/growing stacks, fixed stacks | Rust async uses stackless futures; Go/Java need runtime-managed stacks; POSIX threads need fixed or growable user stacks |
| Preemption | Cooperative, safe-point, signal/upcall, timer-forced OS preemption | Kernel preemption alone protects the system; runtime fairness needs safe points or cooperative budgets |
| Blocking | Convert all operations to async, add carriers, kernel upcalls | Async caps reduce blocking; Go/POSIX still need kernel threads and futexes |
| Queueing | Global queue, per-worker queues, work stealing, priority queues | Per-worker queues plus stealing are the default; add global fairness escape hatches |
| CPU ownership | Invisible OS scheduling, affinity hints, explicit CPU grants | Explicit grants matter for high-performance runtimes and SQPOLL |
| Cross-process calls | Queue through scheduler, direct switch, scheduling donation | Direct switch and scheduling-context donation reduce sync IPC overhead and inversion |
| Isolation | Best-effort fairness, priorities, budget/period contexts | Cloud-oriented capOS eventually needs budget/period scheduling contexts |
capOS Design Options
Option: Minimal Kernel Mechanism Plus User Policy
This option keeps dispatch and enforcement in the kernel, replaces the current round-robin process scheduler with a minimal kernel CPU mechanism, and moves policy to user space through scheduling-context capabilities.
The kernel side covers:
- dispatching the next runnable thread on each CPU;
- enforcing budget/period/priority invariants;
- handling interrupts, blocking, wakeups, and exits;
- direct-switch IPC and scheduling-context donation;
- an emergency fallback policy.
The user-space scheduler service covers:
- policy configuration from the manifest;
- per-service budgets, periods, priorities, and CPU masks;
- admission control for new processes and threads;
- SQPOLL/core grants;
- response to timeout faults and overload telemetry.
This gives a capOS-like system the exokernel/microkernel benefit of policy freedom without putting a user-space server on the context-switch fast path.
Possible Implementation Sequence
- Thread scheduler in kernel. Convert from process scheduling to thread scheduling, with per-thread kernel stack, saved registers, FS base, and shared process address space/cap table.
- Scheduling contexts. Add kernel objects that carry budget, period, priority, CPU mask, and timeout endpoint. Initially assign one default context per thread.
- ThreadSpawner and ThreadHandle capabilities. Expose thread creation and
lifecycle through capabilities from the start. Bootstrap grants
initthe initial authority;initor a scheduler service delegates it under quota. - Scheduling-context donation for IPC. Baseline direct-switch IPC handoff exists for blocked Endpoint receivers; add budget/priority donation and return once scheduling contexts exist.
- User-space policy service. Let init or a
schedservice create and update scheduling contexts via capabilities. - SMP core ownership. After per-CPU run queues and TLB shootdown exist, allow the scheduler service to manage CPU masks and SQPOLL/poller grants.
- Optional dynamic policy. Much later, consider sched_ext-like policy modules if Rust/verifier infrastructure exists. This is not a prerequisite.
Minimal Kernel API Sketch
interface SchedulerControl {
createContext @0 (budgetNs :UInt64, periodNs :UInt64, priority :UInt16)
-> (context :SchedulingContext);
setCpuMask @1 (context :SchedulingContext, mask :Data) -> ();
bind @2 (thread :ThreadHandle, context :SchedulingContext) -> ();
unbind @3 (thread :ThreadHandle) -> ();
setTimeoutEndpoint @4 (context :SchedulingContext, endpoint :Endpoint) -> ();
stats @5 (context :SchedulingContext) -> (consumedNs :UInt64, throttled :Bool);
}
interface SchedulingContext {
yieldTo @0 (thread :ThreadHandle) -> ();
consumed @1 () -> (consumedNs :UInt64);
}
interface ThreadSpawner {
create @0 (
entry :UInt64,
stackTop :UInt64,
arg :UInt64,
context :SchedulingContext,
flags :UInt64
) -> (thread :ThreadHandle);
}
interface ThreadHandle {
join @0 (timeoutNs :UInt64) -> (status :Int64);
exitCode @1 () -> (exited :Bool, status :Int64);
bind @2 (context :SchedulingContext) -> ();
}
The hot path does not invoke these methods; they are control-plane operations.
Dependency: In-Process Threading
Kernel threads inside a process are a dependency for sophisticated user-level thread support:
Threadobject with saved registers, per-thread kernel stack, user stack pointer, FS base, state, and parent process reference.- Scheduler runs threads, not processes.
- Process owns address space and cap table; threads share both.
- Process context switch saves/restores FS base today; thread scheduling must make that state per-thread.
- Thread creation is exposed first as a
ThreadSpawnercapability; bootstrap grants initial authority toinit, and later policy delegates it through the capability graph. - Thread exit reclaims the thread stack and wakes joiners if join exists.
This directly unblocks Go phase 2, POSIX pthread compatibility, native
thread-local storage, and any multi-worker Rust async executor.
Dependency: Futex and Timer
A minimal capability-authorized futex primitive has this shape:
futex_wait(futex_space, uaddr, expected, timeout_ns) -> Result
futex_wake(futex_space, uaddr, max_count) -> usize
Required semantics:
waitchecks that*uaddr == expectedwhile holding the futex wait-lock equivalent, then blocks the current thread.wakemakes up tomax_countwaiters runnable.- Timeouts use monotonic ticks or a timer wheel/min-heap.
- Return values must distinguish woken, timed out, interrupted, and value mismatch.
The authority should be capability-based from the start, for example through a
FutexSpace, WaitSet, or memory-object-derived capability. The encoding is
still a measurement question. Generic capnp capability calls may be acceptable
if measured overhead is close to a compact operation; otherwise futex should use
a dedicated compact capability-authorized operation because the primitive sits
on the runtime parking path.
Measure this before fixing the ABI:
CAP_OP_NOP: ring validation plus CQE post, with no cap lookup or capnp.- Empty and small
NullCapcalls through normal cap lookup, method dispatch, capnp param decode, and capnp result encode. - Futex-shaped compact operation carrying
cap_id,uaddr,expected, andtimeout/max_count, initially returning without blocking. - Later, real blocking paths: failed wait, wake with no waiters, wait-to-block, wake-to-runnable, and wake-to-resume.
The useful decision is not “capability or syscall”; it is “generic capnp method or compact capability-authorized scheduler primitive.” Authority remains in the capability model either way.
Near Term: Runtime Event Integration
For capos-rt, design the executor around kernel completion sources:
- Capability-ring CQ entries wake tasks waiting on cap invocations.
- Notification objects wake tasks waiting on interrupts, timers, or service events.
- Futex wakes resume parked worker threads.
- Timers can be integrated as wakeups instead of periodic polling.
The executor policy can start simple:
- One worker per kernel thread.
- Local FIFO queue per worker.
- One global injection queue.
- Work stealing when local and global queues are empty.
- Cooperative operation budget, then requeue.
Stage 6: IPC Scheduling
For synchronous IPC, direct switch has been introduced before priority scheduling:
- If client A calls server B and B is blocked in receive, switch A -> B directly without picking an unrelated runnable thread. This is implemented for the current single-CPU Endpoint path.
- Mark A blocked on reply.
- Future fastpath work can transfer a small message inline; use shared buffers for large data.
Scheduling-context donation then adds the budget/priority transfer:
- The server runs the request using the caller’s scheduling context.
- The caller’s budget covers client + server work.
- Passive servers can exist without independent CPU budget and only run when a caller donates one.
This avoids priority inversion through the capability graph and matches the service architecture better than per-process priorities alone.
Stage 7: SMP and Core Ownership
Once per-CPU scheduler queues exist, these become policy surfaces:
- CPU affinity depends on correct migration and TLB shootdown.
- A
CpuSetorSchedulingContextcapability can describe allowed CPUs, budget, period, and priority. - Cheap current-CPU exposure depends on a stable per-thread ABI page.
- SQPOLL can be gated on available CPU budget to avoid unlimited poller creation.
Risks and Failure Modes
- M:1 green threads do not provide Go or POSIX compatibility by themselves.
- A normal user-space process choosing the next thread on every timer tick puts a context-switch round trip on the hot path.
- Recovery from scheduler-service failure cannot depend solely on the scheduler service being runnable.
- A Go-like G/M/P scheduler in the kernel couples language runtime policy to the kernel.
- Generic Cap’n Proto capability calls may be too heavy for every synchronization primitive. Measure generic calls against compact capability-authorized operations before fixing the futex ABI.
- sched_ext-like dynamic policy loading depends on mature scheduler invariants and verifier/runtime machinery.
- SQPOLL on a single-core system can compete with the application it is meant to accelerate.
Open Questions
- Does capOS need scheduler-activation-style upcalls? Async caps and notification objects cover many of the same cases with less machinery.
- How can runtime preemption work without Unix signals? Options are cooperative-only, timer notification to a runtime handler, or a kernel forced safe-point ABI. Cooperative-only is one first-support option for Go.
- How are shared-memory futex keys represented? Private futexes can key on address space and virtual address. Shared futexes need memory-object identity and offset.
- Should futex wait/wake use generic capnp capability methods or a compact capability-authorized operation? The answer should come from the measurement plan above, not from assumption.
- How much policy belongs in the boot manifest versus a long-running
schedservice? Static embedded systems can use manifest policy. Cloud or developer systems need runtime policy updates. - What is the emergency fallback if the scheduler service exits? Options are a tiny kernel round-robin fallback for privileged recovery threads, a pinned immortal scheduler thread, or panic. The first is the only robust development choice.
Source Notes
- Anderson et al., “Scheduler Activations: Effective Kernel Support for the User-Level Management of Parallelism” (SOSP 1991): https://polaris.imag.fr/vincent.danjean/papers/anderson.pdf
- “Towards Effective User-Controlled Scheduling for Microkernel-Based Systems” (L4 user-level scheduling): https://os.itec.kit.edu/21_738.php
- Asberg and Nolte, “Towards a User-Mode Approach to Partitioned Scheduling in the seL4 Microkernel”: https://www.es.mdh.se/pdf_publications/2641.pdf
- Kang et al., “A User-Mode Scheduling Mechanism for ARINC653 Partitioning in seL4”: https://link.springer.com/chapter/10.1007/978-981-10-3770-2_10
- L4Re overview: https://l4re.org/doc/l4re_intro.html
- Liedtke, “On micro-kernel construction”: https://elf.cs.pub.ro/soa/res/lectures/papers/lietdke-1.pdf
- seL4 MCS tutorial: https://docs.sel4.systems/Tutorials/mcs.html
- seL4 design principles: https://microkerneldude.org/2020/03/11/sel4-design-principles/
- Linux kernel sched_ext documentation: https://www.kernel.org/doc/html/next/scheduler/sched-ext.html
- Arun et al., “Agile Development of Linux Schedulers with Ekiben”: https://arxiv.org/abs/2306.15076
- Williams, “An Implementation of Scheduler Activations on the NetBSD Operating System” (USENIX 2002): https://web.mit.edu/nathanw/www/usenix/freenix-sa/freenix-sa.html
- Microsoft, “User-Mode Scheduling”: https://learn.microsoft.com/en-us/windows/win32/procthread/user-mode-scheduling
- Go runtime scheduler source: https://go.dev/src/runtime/proc.go
- Go preemption source: https://go.dev/src/runtime/preempt.go
- OpenJDK JEP 444, “Virtual Threads”: https://openjdk.org/jeps/444
- Tokio runtime scheduling documentation: https://docs.rs/tokio/latest/tokio/runtime/
- von Behren et al., “Capriccio: Scalable Threads for Internet Services” (SOSP 2003): https://web.stanford.edu/class/archive/cs/cs240/cs240.1046/readings/capriccio-sosp-2003.pdf
- Argobots paper page: https://www.anl.gov/argonne-scientific-publications/pub/137165
- Argobots project: https://www.argobots.org/
- Pan et al., “Lithe: Enabling Efficient Composition of Parallel Libraries” (HotPar 2009): https://www.usenix.org/legacy/event/hotpar09/tech/full_papers/pan/pan_html/
- Linux
futex(2)manual: https://man7.org/linux/man-pages/man2/futex.2.html - Linux kernel restartable sequences documentation: https://docs.kernel.org/userspace-api/rseq.html
io_uring_sqpoll(7)manual: https://manpages.debian.org/testing/liburing-dev/io_uring_sqpoll.7.en.html- Qin et al., “Arachne: Core-Aware Thread Management” (OSDI 2018): https://www.usenix.org/conference/osdi18/presentation/qin
- Ousterhout et al., “Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads” (NSDI 2019): https://www.usenix.org/conference/nsdi19/presentation/ousterhout
- Fried et al., “Caladan: Mitigating Interference at Microsecond Timescales” (OSDI 2020): https://www.usenix.org/conference/osdi20/presentation/fried