Proposal: Storage, Naming, and Persistence
What replaces the filesystem in a capability OS where Cap’n Proto is the universal wire format.
The Problem with Filesystems
In Unix, the filesystem is the universal namespace. Everything is a path:
/dev/sda, /etc/config, /proc/self/fd/3, /run/dbus/system_bus_socket.
Paths are ambient authority — any process can open /etc/passwd if the
permission bits allow. The filesystem conflates naming, access control,
persistence, and device abstraction into one mechanism.
capOS has capabilities instead of paths. Access control is structural (you can only use what you were granted), not advisory (permission bits checked at open time). This means:
- No global namespace needed — each process sees only its granted caps
- No path-based access control — the cap IS the access
- No distinction between “file”, “device”, “socket” — everything is a typed capability interface
A traditional VFS would reintroduce ambient authority through the back door. Instead, capOS needs a storage and naming model native to capabilities and Cap’n Proto.
Core Insight: Cap’n Proto Everywhere
Cap’n Proto is already used in capOS for:
- Interface definitions —
.capnpschemas define capability contracts - IPC messages — capability invocations are capnp messages
- Serialization — capnp wire format crosses process boundaries
If we extend this to storage, then:
- Stored objects are capnp messages
- Configuration is capnp structs
- Binary images are capnp-wrapped blobs
- The boot manifest is a capnp message describing the initial capability graph
No format conversion anywhere. The same tools (schema compiler, serializer, validator) work for IPC, storage, config, and network transfer.
Architecture
Three Layers
Target architecture after the manifest executor and process-spawner work:
Boot Image (read-only, baked into ISO)
│
│ capnp-encoded manifest + binaries
│
v
Kernel (creates initial caps from manifest)
│
│ grants caps to init
│
v
Init (builds live capability graph)
│
├──> Filesystem services (FAT, ext4 — wrap BlockDevice as Directory/File)
│
├──> Store service (capability-native content-addressed storage)
│ backed by: virtio-blk, RAM, or network
│
└──> All other services (receive Directory, Store, or Namespace caps)
Layer 1: Boot Image
The boot image (ISO/disk) contains a capnp-encoded system manifest loaded as a Limine module alongside the kernel. The manifest describes:
struct SystemManifest {
# Binaries available at boot, keyed by name
binaries @0 :List(NamedBlob);
# Initial service graph — what to spawn and with what caps
services @1 :List(ServiceEntry);
# Static configuration values as an evaluated CUE-style tree
config @2 :CueValue;
}
struct NamedBlob {
name @0 :Text;
data @1 :Data;
}
struct ServiceEntry {
name @0 :Text;
binary @1 :Text; # references a NamedBlob by name
caps @2 :List(CapRef); # what caps this service receives
restart @3 :RestartPolicy;
exports @4 :List(Text); # cap names this service is expected to export
}
struct CapRef {
name @0 :Text; # local name in the child's cap table
expectedInterfaceId @1 :UInt64; # generated .capnp TYPE_ID for validation
union {
unset @2 :Void; # invalid; keeps omitted sources fail-closed
kernel @3 :KernelCapSource;
service @4 :ServiceCapSource;
}
}
enum KernelCapSource {
console @0;
endpoint @1;
frameAllocator @2;
virtualMemory @3;
}
struct ServiceCapSource {
service @0 :Text;
export @1 :Text;
}
enum RestartPolicy {
never @0;
onFailure @1;
always @2;
}
struct CueValue {
union {
null @0 :Void;
boolean @1 :Bool;
intValue @2 :Int64;
uintValue @3 :UInt64;
text @4 :Text;
bytes @5 :Data;
list @6 :List(CueValue);
fields @7 :List(CueField);
}
}
struct CueField {
name @0 :Text;
value @1 :CueValue;
}
Capability source identity is already structured in the bootstrap manifest, so source selection does not depend on parsing authority strings:
struct CapRef {
name @0 :Text; # local name in the child's CapSet
expectedInterfaceId @1 :UInt64; # generated .capnp TYPE_ID for validation
union {
unset @2 :Void; # invalid; keeps omitted sources fail-closed
kernel @3 :KernelCapSource;
service @4 :ServiceCapSource;
}
}
enum KernelCapSource {
console @0;
endpoint @1;
frameAllocator @2;
virtualMemory @3;
}
struct ServiceCapSource {
service @0 :Text;
export @1 :Text;
}
KernelCapSource / ServiceCapSource select the authority to grant. The
expectedInterfaceId field carries the generated Cap’n Proto interface
TYPE_ID and only checks that the granted object speaks the expected schema.
It cannot replace source identity: many different objects may expose the same
interface while representing different authority.
The build system (Makefile) generates this manifest from a human-authored
description and packs it into the ISO as manifest.bin. Current code embeds
every SystemManifest.binaries entry into that manifest as NamedBlob data,
including the release-built init and smoke-demo ELFs. Exposing the manifest to
init as a read-only BootPackage capability (rather than letting the kernel
parse and act on the service graph) is the selected follow-on milestone.
Using a CueValue tree instead of AnyPointer keeps the manifest directly
decodable in no_std userspace without depending on Cap’n Proto reflection.
Transitional Schema Note
ServiceEntry, CapSource::Service, and ServiceEntry.exports are
transitional. ProcessSpawner and copy/move cap transfer are implemented
(2026-04-22), but the default make run boot path still has the kernel
spawn every declared service and wire cross-service caps. Once init owns
generic manifest execution, the manifest loses the service graph entirely:
struct SystemManifest {
# Binaries available at boot, keyed by name
binaries @0 :List(NamedBlob);
# Init's config blob (replaces the service graph)
initConfig @1 :CueValue;
# Kernel boot parameters (memory limits, feature flags)
kernelParams @2 :CueValue;
}
ServiceEntry / CapRef disappear from the schema and become plain CUE
fields inside initConfig. Init reads them at runtime and calls
ProcessSpawner directly. validate_manifest_graph,
validate_bootstrap_cap_sources, and create_all_service_caps all retire
once that happens. See docs/proposals/service-architecture-proposal.md —
“Legacy Manifest Fields After Stage 6” for the deprecation plan.
Layer 2: Kernel Bootstrap
Target design for the kernel’s boot role:
- Parse the system manifest (read-only capnp message from Limine module).
- Hash the embedded binaries for optional measured-boot attestation.
- Create kernel-provided capabilities:
Console,Timer,DeviceManager,ProcessSpawner,FrameAllocator,VirtualMemory(per-process), and a read-onlyBootPackagecap exposingSystemManifest.binariesandinitConfig. - Spawn init — exactly one userspace process — with that cap bundle.
Current code has not reached this split for the default boot: the kernel
still parses the manifest and creates one process per ServiceEntry.
The transition path exists in system-spawn.cue: it sets
config.initExecutesManifest, the kernel validates the full manifest but
boots only init, and init spawns endpoint, IPC, VirtualMemory, and
FrameAllocator cleanup demo children through ProcessSpawner. Retiring the
legacy kernel resolver for default make run is the selected follow-on
milestone tracked in WORKPLAN.md.
Layer 3: Init and the Live Capability Graph
Target init reads initConfig from the BootPackage cap and executes it:
fn main(caps: CapSet) {
let spawner = caps.get::<ProcessSpawner>("spawner");
let boot = caps.get::<BootPackage>("boot");
let config = boot.init_config()?; // CueValue
// Walk service entries from the config and spawn in dependency order
for entry in config.field("services")?.iter()? {
let binary = boot.binary(entry.field("binary")?.as_str()?)?;
let granted = resolve_caps(entry.field("caps")?, &running_services, &caps);
let handle = spawner.spawn(binary, granted, entry.field("restart")?.into())?;
running_services.insert(entry.field("name")?.as_str()?.into(), handle);
}
supervisor_loop(&running_services);
}
In this target model, init is a generic manifest executor rather than a
hardcoded service graph. The system topology is defined in the boot
package’s initConfig, not in init’s source code. Changing what services
run means rebuilding the boot image with a different config blob, not
recompiling init. Manifest graph resolution stops being a kernel concern.
The current transition still uses SystemManifest.services as the service
graph instead of initConfig; init reads the BootPackage manifest, validates a
metadata-only ManifestBootstrapPlan, resolves kernel and service cap sources,
records exported caps, spawns children in manifest order, and waits for their
ProcessHandles.
Two Storage Models
capOS supports two complementary storage models, both exposed as typed capabilities:
Filesystem Capabilities (Directory, File)
For accessing traditional block-based filesystems (FAT, ext4, ISO9660) and
for POSIX compatibility. A filesystem service wraps a BlockDevice and
exports Directory/File capabilities.
BlockDevice (raw sectors)
│
└──> Filesystem service (FAT, ext4, ...)
│
├──> Directory caps (namespace over files)
└──> File caps (read/write byte streams)
This model maps naturally to USB flash drives, NVMe partitions, and
network-mounted filesystems. The open() and sub() operations return new
capabilities via IPC cap transfer (see “IPC and Capability Transfer” below).
Capability-Native Store (Store, Namespace)
For capOS-native data: configuration, service state, content-addressed object
storage. A store service wraps a BlockDevice and exports Store/Namespace
capabilities.
BlockDevice (raw sectors)
│
└──> Store service
│
├──> Store cap (content-addressed put/get)
└──> Namespace caps (mutable name→hash mappings)
Content-addressing provides automatic deduplication, verifiable integrity, and immutable references. Namespaces add mutable bindings on top.
Bridging the Two Models
The models are composable. An adapter service can bridge between them:
- FsStore adapter: exposes a Directory tree as a content-addressed Store (hash each file’s contents, directory listings become capnp-encoded objects)
- StoreFS adapter: exposes Store/Namespace as a Directory tree (each name maps to a File whose contents are the stored object)
- Import/export: a utility service reads files from a Directory and stores them in a Store, or materializes Store objects as files in a Directory
In both cases the adapter is a userspace service holding caps to both subsystems. No kernel mechanism needed — just capability composition.
File I/O Interfaces
Directory, File, Store, and Namespace caps may be scoped to a user session, guest profile, anonymous request, or service identity, but the cap remains the authority. POSIX ownership metadata is compatibility data inside these services, not a system-wide authorization channel. See user-identity-and-policy-proposal.md.
BlockDevice
Raw sector access, served by device drivers (virtio-blk, NVMe, USB mass
storage). The driver receives hardware capabilities (MMIO, IRQ,
FrameAllocator for DMA) and exports a BlockDevice cap.
interface BlockDevice {
readBlocks @0 (startLba :UInt64, count :UInt32) -> (data :Data);
writeBlocks @1 (startLba :UInt64, count :UInt32, data :Data) -> ();
info @2 () -> (blockSize :UInt32, blockCount :UInt64, readOnly :Bool);
flush @3 () -> ();
}
For bulk transfers, readBlocks/writeBlocks accept a SharedBuffer
capability instead of inline Data (see “Shared Memory for Bulk Data”
below). The inline-Data variants work for metadata reads and small
operations; the SharedBuffer variants avoid copies for large I/O.
File
Byte-stream access to a single file. Served by filesystem services. Created
dynamically when a client calls Directory.open() — the filesystem service
creates a File CapObject for the opened file and transfers it to the
caller via IPC cap transfer.
interface File {
read @0 (offset :UInt64, length :UInt32) -> (data :Data);
write @1 (offset :UInt64, data :Data) -> (written :UInt32);
stat @2 () -> (size :UInt64, created :UInt64, modified :UInt64);
truncate @3 (length :UInt64) -> ();
sync @4 () -> ();
close @5 () -> ();
}
close releases the server-side state for this file (open cluster chain
cache, dirty buffers). The kernel-side CapTable entry is removed by the system
transport via CAP_OP_RELEASE when the local holder releases it; generated
capos-rt handle drop still needs RAII integration before ordinary userspace
handles submit that opcode automatically. CapabilityManager is
management-only (list(), later grant()); it does not expose a drop()
method because ordinary handle lifetime belongs to the transport, not to an
application call on the same table that dispatches it.
Attenuation: a read-only File wraps the original and rejects write,
truncate, sync calls. An append-only File rejects write at offsets
other than the current size.
Directory
Namespace over files on a filesystem. Served by filesystem services.
open() and sub() return new capabilities via IPC cap transfer.
interface Directory {
open @0 (name :Text, flags :UInt32) -> (file :File);
list @1 () -> (entries :List(DirEntry));
mkdir @2 (name :Text) -> (dir :Directory);
remove @3 (name :Text) -> ();
sub @4 (name :Text) -> (dir :Directory);
}
struct DirEntry {
name @0 :Text;
size @1 :UInt64;
isDir @2 :Bool;
}
sub() returns a Directory scoped to a subdirectory — the analog of chroot.
The caller cannot traverse upward or see the parent directory. open() with
create flags creates a new file if it doesn’t exist.
The flags field in open() is a bitmask: CREATE = 1, TRUNCATE = 2,
APPEND = 4. No READ/WRITE flags — those are determined by the
Directory cap’s attenuation (a read-only Directory returns read-only Files).
Syscall Trace: Reading a File from a FAT USB Drive
Four userspace processes: App, FAT service, USB mass storage, xHCI driver.
With promise pipelining (one submission):
Cap’n Proto promise pipelining lets the App chain dependent calls without waiting for intermediate results. The App submits a single pipelined request: “open this file, then read from the result”:
# Single pipelined submission (SQEs with PIPELINE flag):
# call 0: dir.open("report.pdf") → promise P0
# call 1: P0.file.read(offset=0, len=4096) → depends on P0
cap_submit([
{cap=2, method=OPEN, params={"report.pdf", flags=0}},
{cap=PIPELINE(0, field=file), method=READ, params={offset:0, length:4096}},
])
→ kernel routes call 0 to FAT service via Endpoint
→ FAT service reads directory entry from BlockDevice
→ FAT service creates FileCapObject, replies with File cap
→ kernel sees pipelined call 1 targeting the File cap from call 0
→ kernel dispatches call 1 to the same FAT service (or direct-invokes
the new File CapObject if it's a local endpoint)
→ FAT service maps offset → cluster chain → LBA
→ FAT service submits CALL SQE: {cap=blk_cap, method=READ_BLOCKS, params={lba, count}}
→ USB mass storage → xHCI → hardware → back up
← completion: {data: [4096 bytes]}, File cap installed as cap_id=5
One app-to-kernel transition. The kernel resolves the pipeline dependency internally — the App never sees the intermediate File cap until the whole chain completes (though the cap is installed and usable afterward).
This is a core Cap’n Proto feature: by expressing “call method on the
not-yet-resolved result of another call,” the client avoids a round-trip
for each link in the chain. For deeper chains (e.g., dir.sub("a").sub("b") .open("file").read(0, 4096)), the savings compound — one submission instead
of four sequential syscalls.
Without pipelining (two sequential ring submissions):
Without promise pipelining, the App submits two separate CALL SQEs via the ring, blocking on each completion before submitting the next:
# 1. Open file (App holds Directory cap, cap_id=2)
# App writes CALL SQE: {cap=2, method=OPEN, params={"report.pdf", flags=0}}
cap_enter(min_complete=1, timeout=MAX)
→ kernel routes CALL to FAT service via Endpoint
→ FAT service reads directory entry from BlockDevice
→ FAT service creates FileCapObject for this file
→ FAT service posts RETURN SQE with [FileCapObject] in xfer_caps
→ kernel installs File cap in App's table → cap_id=5
← App reads CQE: result={file: cap_index=0}, new_caps=[5]
# 2. Read 4096 bytes from offset 0
# App writes CALL SQE: {cap=5, method=READ, params={offset:0, length:4096}}
cap_enter(min_complete=1, timeout=MAX)
→ kernel routes CALL to FAT service
→ FAT service maps offset → cluster chain → LBA
→ FAT service submits CALL SQE: {cap=blk_cap, method=READ_BLOCKS, params={lba, count}}
→ kernel routes to USB mass storage
→ mass storage submits CALL SQE: {cap=usb_cap, method=BULK_TRANSFER, params={scsi_cmd}}
→ kernel routes to xHCI driver
→ xHCI programs TRBs, waits for interrupt
← returns raw sector data
← returns sector data
← FAT service extracts file bytes, posts RETURN SQE with {data: [4096 bytes]}
This works but costs two round-trips where pipelining needs one. The synchronous path is useful for simple cases and bootstrapping; pipelining is the intended steady-state model.
In both cases, the intermediate IPC hops (FAT → USB mass storage → xHCI) are invisible to the App.
Capability-Native Store
The Store Capability
Once the system is running, persistent storage is provided by a userspace service — the store. It’s backed by a block device (virtio-blk), and exposes a content-addressed object store where objects are capnp messages.
interface Store {
# Store a capnp message, returns its content hash
put @0 (data :Data) -> (hash :Data);
# Retrieve by hash
get @1 (hash :Data) -> (data :Data);
# Check existence
has @2 (hash :Data) -> (exists :Bool);
# Delete (if caller has authority — see note below)
delete @3 (hash :Data) -> ();
}
Note on delete: In a content-addressed store, deleting a hash can break
references from other namespaces pointing to the same object. delete on the
base Store interface is dangerously broad — a StoreAdmin interface
(separate from Store) may be more appropriate, with delete restricted to a
GC service that can verify no live references exist. Open Question #3 (GC)
should be resolved before implementing delete. The attenuation table below
lists Store (full) as “Read, write, delete any object” — in practice, most
callers should receive a Store attenuated to put/get/has only.
Content-addressed means:
- Deduplication is automatic (same content = same hash)
- Integrity is verifiable (hash the data, compare)
- References between objects are just hashes embedded in capnp messages
- No mutable paths — “updating a file” means storing a new version and updating the reference
Mutable References: Namespaces
A Namespace capability provides mutable name-to-hash mappings on top of
the immutable store:
interface Namespace {
# Resolve a name to a store hash
resolve @0 (name :Text) -> (hash :Data);
# Bind a name to a hash (if caller has write authority)
bind @1 (name :Text, hash :Data) -> ();
# List names (if caller has list authority)
list @2 () -> (names :List(Text));
# Get a sub-namespace (attenuated — restricted to a prefix)
sub @3 (prefix :Text) -> (ns :Namespace);
}
A Namespace cap scoped to "config/" can only see and modify names under
that prefix. This is the analog of a chroot — but structural, not a kernel
hack. The sub() method returns a new Namespace cap via IPC cap transfer.
Future: union composition. The research survey recommends
extending Namespace with Plan 9-inspired union semantics — a union(other, mode) method that merges two namespaces with before/after/replace ordering.
This adds composability without a global mount table. See
research.md §6.
IPC and Capability Transfer
Several storage operations return new capabilities: Directory.open()
returns a File, Directory.sub() returns a Directory, Namespace.sub()
returns a Namespace. This requires dynamic capability management — the kernel
must install new capabilities in a process’s CapTable at runtime as part of
IPC.
The Capability Ring
All kernel-userspace interaction goes through a shared-memory ring pair (submission queue + completion queue), inspired by io_uring. SQE opcodes map to capnp-rpc Level 1 message types. The ring is allocated per-process at spawn time and mapped into the process’s address space.
Syscall surface: 2 syscalls. New capabilities, operations, and transfer mechanisms are expressed as new SQE opcodes instead of expanding the syscall ABI.
| # | Syscall | Purpose |
|---|---|---|
| 1 | exit(code) | Terminate process |
| 2 | cap_enter(min_complete, timeout_ns) | Process pending SQEs, then wait until enough CQEs exist or the timeout expires |
Writing SQEs is syscall-free, but ordinary capability CALLs make progress
through cap_enter. Timer polling handles non-CALL ring work and only CALL
targets that explicitly opt into interrupt-context dispatch. cap_enter
flushes pending SQEs and can block the process until min_complete
completions are available or a finite timeout expires. An indefinite wait uses
timeout_ns = u64::MAX; timeout_ns = 0 keeps the call non-blocking. A future
SQPOLL-style worker can reintroduce a zero-syscall CALL-completion hot path
without running arbitrary capability methods from timer interrupt context.
The ring structs and synchronous CALL dispatch are implemented and working.
See capos-config/src/ring.rs for the shared ring structs and
kernel/src/cap/ring.rs for kernel-side processing.
Ring Layout
One 4 KiB page per process, mapped into both kernel (HHDM) and user space:
┌─────────────────────────┐ offset 0
│ Ring Header │ SQ/CQ head, tail, mask, flags
├─────────────────────────┤ offset 128
│ SQE Array (16 × 64B) │ submission queue entries
├─────────────────────────┤ offset 1152
│ CQE Array (32 × 32B) │ completion queue entries
└─────────────────────────┘
SQ: userspace owns tail (producer), kernel owns head (consumer)
CQ: kernel owns tail (producer), userspace owns head (consumer)
SQE Opcodes
Five opcodes handle everything — client calls, server dispatch, capability transfer, pipelining, and lifecycle:
| Opcode | capnp-rpc analog | Purpose |
|---|---|---|
CALL | Call | Invoke method on a capability |
RETURN | Return | Respond to incoming call (server side) |
RECV | (implicit) | Wait for incoming calls on Endpoint |
RELEASE | Release | Drop a capability reference |
FINISH | Finish | Release pipeline answer state |
TIMEOUT | — | Post a CQE after N nanoseconds (io_uring-inspired) |
TIMEOUT is an alternative to the timeout_ns argument on cap_enter:
it works with zero-syscall polling (kernel fires the CQE on a timer tick)
and composes with LINK/DRAIN for deadline-based chains.
SQE flags: PIPELINE (cap_id is a promise reference), LINK (chain to
next SQE), MULTISHOT (keep generating CQEs), DRAIN (barrier).
Promise Pipelining
A CALL SQE can target either a concrete CapId or a PromisedAnswer
reference (via the PIPELINE flag + pipeline_dep/pipeline_field fields).
The kernel resolves the dependency chain internally:
SQE[0]: CALL dir.open("report.pdf") → user_data=100
SQE[1]: CALL [PIPELINE: dep=100, field=0].read(0, 4096) → user_data=101
One cap_enter call. The kernel dispatches SQE[0], extracts the File cap
from the result, and dispatches SQE[1] against it — all without returning
to userspace between steps.
The Endpoint Kernel Object
For cross-process IPC, an Endpoint connects client-side proxy caps to a server’s receive loop:
Client's CapTable Server's CapTable
┌─────────────────┐ ┌──────────────────┐
│ cap 2: Proxy │ │ cap 0: Endpoint │
│ → endpoint ────────── Endpoint ◄──── RECV SQE ──│ │
│ badge: 42 │ (kernel obj) │ │
└─────────────────┘ └──────────────────┘
The server posts a RECV SQE (with MULTISHOT flag). Incoming calls appear
as CQEs with badge, interface_id, method_id, and a kernel-assigned call_id.
The server responds by posting a RETURN SQE referencing the call_id.
interface_id is the transported schema ID for the interface being invoked.
It should equal the generated TYPE_ID for that capnp interface. cap_id is
the authority-bearing table handle; interface_id is only the protocol tag.
The target capability entry owns one public interface; method_id selects a
method inside that interface, while cap_id identifies the object being
invoked. If the same backing state needs another interface, the transport
should mint a separate capability entry for that interface rather than letting
one handle accept multiple unrelated interface_id values.
Direct-Switch IPC
When a client’s CALL targets a cap served by a blocked server (waiting on RECV), the kernel marks that server as the direct IPC handoff target so the next context-switch path runs the callee before unrelated round-robin work. The current implementation still uses the ordinary saved-context restore path; small-message register transfer remains a future fastpath after measurement. See research.md §2.
Capability Transfer via Ring
Capabilities travel as sideband arrays (CapTransferDescriptor) alongside capnp
message bytes:
- CALL params: params buffer contains the capnp message bytes followed by
xfer_cap_counttransfer descriptors packed ataddr + len, which must be aligned toCAP_TRANSFER_DESCRIPTOR_ALIGNMENT. - RETURN results: server result buffers carry the capnp reply bytes and may
carry return transfer descriptors on
addr + len; the kernel inserts destination capability records in the caller’s result buffer after the normal result bytes. Count is reported in CQEcap_countand those records are written asCapTransferResult { cap_id, interface_id }values atresult_addr + result. The requested result buffer (result_len) must be large enough for both normal reply bytes and all appendedcap_countrecords.
xfer_cap_count > 0 with malformed descriptor metadata (bad mode bits, reserved
bits, _reserved0, or misalignment) fails closed as
CAP_ERR_INVALID_TRANSFER_DESCRIPTOR. Kernels that have not yet enabled transfer
handling should return CAP_ERR_TRANSFER_NOT_SUPPORTED for transfer-bearing SQEs.
The capnp wire format’s WirePointerKind::Other encodes capability indices
in messages. The sideband arrays map these indices to actual CapIds. The
kernel does not parse capnp messages — it transfers a list of caps alongside
the opaque message bytes.
Dynamic Capability Management
Every open(), sub(), or resolve() creates and transfers a new
capability at runtime. The kernel’s CapTable insert() and remove() are
the primitives. Capabilities flow through RETURN SQE sideband arrays (and
through the manifest at boot). No separate cap_grant mechanism needed —
authority flow follows the ring’s IPC graph.
The CapTable generation counter handles stale references: when a File cap is
closed (slot freed, generation bumps), any cached CapId returns
StaleGeneration instead of accidentally hitting a new occupant.
Shared Memory for Bulk Data
Copying file data through capnp Data fields works for metadata and small
reads, but is impractical for anything above a few KB. A 1 MB read through
a capability CALL copies data four times: device → driver heap → capnp
message → kernel buffer → client buffer.
SharedBuffer Capability
A SharedBuffer (also called MemoryObject, listed in ROADMAP.md Stage 6)
is a kernel object backed by physical pages that can be mapped into multiple
address spaces simultaneously. Zero copies between processes.
interface SharedBuffer {
# Map into caller's address space (returns virtual address and size)
map @0 () -> (addr :UInt64, size :UInt64);
# Unmap from caller's address space
unmap @1 () -> ();
# Size of the buffer
size @2 () -> (bytes :UInt64);
}
The kernel creates SharedBuffer objects on request (via a kernel-provided
BufferAllocator capability). The pages are reference-counted — the buffer
persists as long as any process holds a cap to it.
File I/O with SharedBuffer
File and BlockDevice interfaces support both inline-Data and SharedBuffer modes:
# Small read (< ~4 KB): inline in capnp message
file.read(offset=0, length=256) → {data: [256 bytes]}
# Large read: caller provides SharedBuffer, server fills it
let buf = buf_alloc.create(1048576); # 1 MB SharedBuffer
file.readBuf(offset=0, buf, length=1048576) → {bytesRead: 1048576}
# Data is now in buf's mapped pages — no copy through kernel
Extended File interface with SharedBuffer support:
interface File {
read @0 (offset :UInt64, length :UInt32) -> (data :Data);
write @1 (offset :UInt64, data :Data) -> (written :UInt32);
readBuf @2 (offset :UInt64, buffer :SharedBuffer, length :UInt32) -> (bytesRead :UInt32);
writeBuf @3 (offset :UInt64, buffer :SharedBuffer, length :UInt32) -> (written :UInt32);
stat @4 () -> (size :UInt64, created :UInt64, modified :UInt64);
truncate @5 (length :UInt64) -> ();
sync @6 () -> ();
close @7 () -> ();
}
The readBuf/writeBuf methods accept a SharedBuffer cap (transferred
via IPC). The server maps the buffer, performs DMA or memory copies into it,
then returns. The caller reads directly from the mapped pages.
For BlockDevice, the same pattern applies — the driver maps the SharedBuffer, programs DMA descriptors pointing to its physical pages, and the device writes directly into the shared memory.
When to Use Each Mode
| Scenario | Mechanism | Why |
|---|---|---|
| Reading a 64-byte config value | File.read() inline Data | Copy overhead negligible |
| Reading a 10 MB binary | File.readBuf() SharedBuffer | Avoids 4× copy overhead |
| FAT directory entry (32 bytes) | BlockDevice.readBlocks() inline | Small metadata read |
| Streaming video frames | File.readBuf() + ring of SharedBuffers | Continuous zero-copy |
| Network packet buffers | SharedBuffer ring between NIC driver and net stack | DMA-capable pages |
Attenuation
Storage services mint restricted capabilities using wrapper CapObjects:
| Capability | Authority |
|---|---|
Directory (full) | Open, list, mkdir, remove, sub |
Directory (read-only) | Open (returns read-only Files), list, sub only |
File (full) | Read, write, truncate, sync |
File (read-only) | Read and stat only |
File (append-only) | Read, stat, write at end only |
Store (full) | Read, write, delete any object |
Store (read-only) | Get and has only |
Namespace (full) | Resolve, bind, list under prefix |
Namespace (read-only) | Resolve and list only |
Blob (single object) | Read one specific hash |
SharedBuffer (read-only) | Map as read-only (page table: R, no W) |
An application that only needs to read its config gets a read-only
Directory scoped to its config path. It can’t write, can’t see other
apps’ directories, can’t access the raw BlockDevice.
Naming Without Paths
Traditional OS: process opens /var/lib/myapp/data.db — a global path.
capOS: process receives a Directory or Namespace cap at spawn time,
opens "data.db" within it. The process has no idea where on disk this
lives. It can’t traverse upward. There is no global root.
# Traditional: global path namespace
/
├── etc/
│ └── myapp/
│ └── config.toml
├── var/
│ └── lib/
│ └── myapp/
│ └── data.db
└── sbin/
└── myapp
# capOS: per-process capability set (no global namespace)
Process "myapp" sees:
"config" → Directory(read-only, scoped to myapp's config files)
"data" → Directory(read-write, scoped to myapp's data files)
"state" → Namespace(read-write, scoped to myapp's store objects)
"log" → Console cap
"api" → HttpEndpoint cap
The process doesn’t know or care about the backing storage layout. It just uses the capabilities it was granted.
Configuration
Build-Time Config (Boot Manifest)
The system manifest is authored at build time. The human-writable source
could be any format — TOML, CUE, or even a Makefile target that generates
the capnp binary. What matters is that it compiles to a SystemManifest
capnp message baked into the ISO.
Example source (TOML, compiled to capnp by a build tool):
[services.virtio-net]
binary = "virtio-net"
restart = "always"
caps = [
{ name = "device_mmio", source = { kernel = "device_mmio" } },
{ name = "interrupt", source = { kernel = "interrupt" } },
{ name = "log", source = { kernel = "console" } },
]
exports = ["nic"]
[services.net-stack]
binary = "net-stack"
restart = "always"
caps = [
{ name = "nic", source = { service = { service = "virtio-net", export = "nic" } } },
{ name = "timer", source = { kernel = "timer" } },
{ name = "log", source = { kernel = "console" } },
]
exports = ["net"]
[services.fat-fs]
binary = "fat-fs"
restart = "always"
caps = [
{ name = "blk", source = { service = { service = "usb-storage", export = "block-device" } } },
{ name = "log", source = { kernel = "console" } },
]
exports = ["root-dir"]
[services.my-app]
binary = "my-app"
restart = "on-failure"
caps = [
{ name = "api", source = { service = { service = "http-service", export = "api" } } },
{ name = "docs", source = { service = { service = "fat-fs", export = "root-dir" } } },
{ name = "data", source = { service = { service = "store", export = "namespace" } } },
{ name = "log", source = { kernel = "console" } },
]
A build tool validates this against the capnp schemas (does virtio-net
actually export "nic"? does http-service support endpoint() minting?)
and produces the binary manifest.
Runtime Config (via Store)
Once the store service is running, configuration can be stored there and updated without rebuilding the ISO. The store is just another capability — a config-management service could watch for changes and signal services to reload.
Connection to Network Transparency
If capabilities are the only abstraction, and capnp is the only wire format, then the transport is irrelevant:
- Local IPC: capnp message copied between address spaces by kernel
- Local store: capnp message written to block device
- Remote IPC: capnp message sent over TCP to another machine
- Remote store: capnp message fetched from a remote store service
A capability reference doesn’t encode where the backing service lives. The kernel (or a proxy) handles routing. This means:
- A
Directorycap could be backed by local FAT or a remote 9P server - A
Namespacecap could be backed by local storage or a remote store - A
Fetchcap could route through a local HTTP service or a remote proxy - A
ProcessSpawnercap could spawn locally or on a remote machine
The system manifest could describe services that run on different machines, and the capability graph spans the network. This is the “network transparency” item in the roadmap — it falls out naturally from the model.
Persistence of the Capability Graph
The live capability graph (which process holds which caps) is ephemeral — it exists in kernel memory and is lost on reboot. The system manifest describes the intended graph, and init rebuilds it on each boot.
For true persistence (resume after reboot without re-initializing):
- Each service serializes its state to the store before shutdown
- On next boot, the manifest includes “restore from store hash X” hints
- Services read their saved state from the store and resume
This is application-level persistence, not kernel-level. The kernel doesn’t snapshot the capability graph — services are responsible for their own state. This avoids the complexity of EROS-style transparent persistence while still allowing stateful services.
Phases
Phase 1: Boot Manifest (parallel with Stage 4)
- Define
SystemManifestschema inschema/ - Build tool (
tools/mkmanifest) that compilessystem.cueinto a capnp-encoded manifest and packs it into the ISO as a Limine module - Kernel parses the manifest and currently creates one process per
ServiceEntry - Kernel passes the manifest to init as bytes or a
Manifestcapability without interpreting the child service graph in thesystem-spawn.cuetransition path - Init becomes a generic manifest executor instead of a demo parser for
the
system-spawn.cuetransition path - No persistent storage yet — boot image is the only data source
Phase 2: File I/O Interfaces in Schema (parallel with Stage 6)
Depends on: IPC (Stage 6) for cross-process cap transfer.
- Add
BlockDevice,File,Directory,DirEntry,SharedBuffertoschema/capos.capnp - Implement kernel Endpoint and RECV/RETURN SQE opcodes
- Capability transfer in IPC replies (RETURN SQE xfer_caps installs caps in caller’s table)
- Demo: two-process file server (in-memory File/Directory service + client)
Phase 3: RAM-backed Store (after Phase 2)
Depends on: IPC (Stage 6) for cross-process store access.
- Implement
StoreandNamespaceas a userspace service - Backed by RAM (no disk driver yet, data lost on reboot)
- Services can store and retrieve capnp objects at runtime
- Demonstrates the naming model without requiring a block device driver
Namespace.sub()returns new caps via IPC cap transfer
Phase 4: BlockDevice Drivers and Filesystem (after virtio infrastructure)
- virtio-blk driver (userspace, reuses virtqueue infrastructure from networking smoke test)
BlockDevicetrait implementation- FAT filesystem service: wraps BlockDevice, exports Directory/File caps
- SharedBuffer integration for bulk reads (depends on Stage 6 MemoryObject)
- Store service uses BlockDevice for persistence
- System state survives reboot via store + manifest restore hints
Phase 5: Network Store (after networking)
- Store service can replicate to or fetch from a remote store
- Capability references transparently span machines
- Directory cap backed by a remote filesystem (9P-style)
Relationship to Other Proposals
- Networking proposal — the NIC driver and net stack are services described in the manifest, not hardcoded. The store could be backed by network storage once networking works. A remote Directory cap (9P over capnp) reuses the same File/Directory interfaces.
- Service architecture proposal — the manifest replaces code-as-config for init. ProcessSpawner, supervision, and cap export work as described there, but driven by manifest data instead of compiled Rust code. IPC Endpoints are the mechanism for service export.
- Capability model — IPC cap transfer (Endpoint + RETURN SQE) is the
mechanism that makes
open()andresolve()work. SharedBuffer is the bulk data path that makes file I/O practical. Both are tracked in ROADMAP.md Stage 6.
Open Questions
-
Manifest validation. How much can the build tool verify statically? Cap export names depend on runtime behavior of services. Should services declare their exports in their own metadata (like a package manifest)?
-
Schema evolution. When a service’s capnp interface changes, stored objects referencing the old schema need migration. Cap’n Proto has backwards-compatible schema evolution, but breaking changes need a story.
-
Garbage collection. Content-addressed store accumulates unreferenced objects. Who GCs? A separate service with
Storeread + delete authority? Reference counting in the namespace layer? -
Large objects. Storing multi-megabyte binaries as single capnp
Datafields is wasteful (capnp allocates contiguously). SharedBuffer partially addresses this for I/O, but the Store’sput/getinterface still takesData. Options: chunked storage (Merkle tree of hashes), a streamingBlobinterface, or SharedBuffer-aware Store methods. -
Trust model for the manifest. The boot manifest has full authority to define the system. Who signs it? How do you prevent a tampered ISO from granting excessive caps? Secure boot integration?
-
File locking and concurrent access. Multiple processes opening the same file through the same filesystem service need coordination. Options: mandatory locking in the filesystem service (rejects conflicting opens), advisory locking via a separate Lock capability, or single-writer enforcement at the Directory level (open with exclusive flag).
-
RETURN+RECV atomicity. When a server posts a RETURN SQE followed by a RECV SQE, there must be no window where a client call can arrive but the server isn’t listening. SQE LINK chaining (RETURN → RECV) should provide this atomicity — the kernel processes both SQEs as a unit.