Capability Ring
The capability ring is the userspace-to-kernel transport for capability invocation. It avoids one syscall per operation while preserving a typed Cap’n Proto method boundary and explicit completion reporting.
Status: Implemented. The shared-memory ring, cap_enter, CALL/RECV/RETURN/RELEASE/NOP dispatch,
structured transport errors, bounded scratch buffers, and Loom ring model are
implemented. FINISH, promise pipelining, multishot, link, drain, and SQPOLL
remain future work.
Current Behavior
Each non-idle process gets one 4 KiB ring page mapped at RING_VADDR. The page
contains a volatile header, a 16-entry submission queue, and a 32-entry
completion queue. Userspace writes CapSqe records, advances sq_tail, and
uses cap_enter(min_complete, timeout_ns) to make ordinary calls progress.
sequenceDiagram
participant U as Userspace runtime
participant R as Ring page
participant K as Kernel ring dispatcher
participant C as Capability object
U->>R: write CapSqe and advance sq_tail
U->>K: cap_enter(min_complete, timeout_ns)
K->>R: read sq_head..sq_tail
K->>K: validate SQE fields and user buffers
K->>C: call method or endpoint operation
C-->>K: completion, pending, or error
K->>R: write CapCqe and advance cq_tail
K-->>U: return available CQE count
U->>R: read matching CapCqe
Timer polling also processes each current process’s ring before preemption, but
only non-CALL operations and CALL targets that explicitly allow interrupt
dispatch may run there. Ordinary CALLs wait for cap_enter.
Why ordinary CALL waits for
cap_enter: Submitting aCALLSQE is only a shared-memory write. The kernel still needs a safe execution point to drain the ring and run capability code. Timer polling runs in interrupt context, so it must not execute arbitrary capability methods that may allocate, block on locks, mutate page tables, spawn processes, parse Cap’n Proto messages, or perform IPC side effects.cap_enteris the normal process-context drain point: it processes pending SQEs, posts CQEs, and then either returns the available completion count or blocks until enough completions arrive. The design keeps SQE publication syscall-free and batchable, keeps the syscall ABI limited toexitandcap_enter, and avoids turning the timer interrupt into a general capability executor. A future SQPOLL-style path can remove the explicit syscall from the hot path only by running dispatch in a worker context, not from arbitrary timer interrupt execution.
Design
CapSqe is a fixed 64-byte ABI record. CAP_OP_CALL names a local cap-table
slot and method ID plus parameter/result buffers. CAP_OP_RECV and
CAP_OP_RETURN implement endpoint IPC. CAP_OP_RELEASE removes a local
cap-table slot through the transport. CAP_OP_NOP measures the fixed ring
path. CAP_OP_FINISH is ABI-reserved and currently returns
CAP_ERR_UNSUPPORTED_OPCODE.
The kernel copies user params into preallocated per-process scratch, dispatches
capability methods, serializes results directly into caller-provided result
buffers, and posts CapCqe. A successful method returns non-negative bytes
written. Transport failures are negative CAP_ERR_* codes. Application
exceptions are serialized CapException payloads with
CAP_ERR_APPLICATION_EXCEPTION.
Transfer-bearing CALL and RETURN SQEs pack CapTransferDescriptor records
after the params/result payload. Successful result-cap transfers append
CapTransferResult records after normal result bytes.
Promise-pipelined CALLs remain rejected by current kernels. When that flag is
enabled, pipeline_dep names a process-local promised-answer identifier, and
pipeline_field selects a zero-based CapTransferResult record from that
answer’s completion. It is not a Cap’n Proto schema field number or payload
path. The kernel resolves dependencies only through the sideband result-cap
records it already owns; normal result bytes stay opaque to the transport.
Future behavior should use the reserved SQE fields for system transport features, not ad hoc per-interface extensions.
Invariants
- SQ and CQ sizes are powers of two and fixed by the ABI.
- Unknown opcodes fail closed;
FINISHis reserved, not silently accepted. - Reserved fields must be zero for currently implemented opcodes.
cap_enterrejectsmin_complete > CQ_ENTRIES.- User buffers must be in user address space with page permissions matching read/write intent before copy.
- Timer dispatch must not run capabilities that allocate, block on locks, or mutate page tables unless the cap explicitly opts in.
- Per-dispatch SQE processing is bounded by
SQ_ENTRIES. - Transfer descriptors must be aligned, valid, and bounded by
MAX_TRANSFER_DESCRIPTORS. - Promise-pipelined dependency resolution must use sideband
CapTransferResultordinals, never general Cap’n Proto result traversal in the kernel.
Code Map
capos-config/src/ring.rs- shared ring ABI, opcodes, errors, SQE/CQE structs, endpoint message headers, transfer records.kernel/src/cap/ring.rs- kernel dispatcher, SQE validation, CQE posting, cap calls, endpoint CALL/RECV/RETURN, release, transfer framing.kernel/src/arch/x86_64/syscall.rs-cap_entersyscall.kernel/src/sched.rs- timer polling, cap-enter blocking, direct IPC wake.kernel/src/process.rs- ring page allocation and mapping.capos-rt/src/ring.rs- runtime ring client, pending calls, transfer packing, result-cap parsing.capos-rt/src/entry.rs- single-owner runtime ring client token and release queue flushing.capos-config/tests/ring_loom.rs- bounded producer/consumer model.
Validation
cargo test-ring-loomvalidates SQ/CQ producer-consumer behavior, capacity, FIFO, CQ overflow/drop behavior, and corrupted SQ recovery.make runexercises Console CALLs, reserved opcode rejection, ring corruption recovery, NOP, fairness, transfers, and endpoint IPC.make run-measureexercises measurement-only counters. The old NullCap baseline needs a follow-up init-owned grant path before it can be treated as covered again.cargo test-configcovers shared ring layout and helper invariants.make capos-rt-checkchecks userspace runtime ring code under the bare-metal target.
Open Work
- Implement
CAP_OP_FINISHas part of the system Cap’n Proto transport. - Implement promise pipelining using the reserved
pipeline_depanswer ID andpipeline_fieldresult-cap ordinal mapping. - Define LINK, DRAIN, and MULTISHOT semantics before accepting those flags.
- Add SQPOLL after SMP gives the kernel a spare execution context.