Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Capability Ring

The capability ring is the userspace-to-kernel transport for capability invocation. It avoids one syscall per operation while preserving a typed Cap’n Proto method boundary and explicit completion reporting.

Status: Implemented. The shared-memory ring, cap_enter, CALL/RECV/RETURN/RELEASE/NOP dispatch, structured transport errors, bounded scratch buffers, and Loom ring model are implemented. FINISH, promise pipelining, multishot, link, drain, and SQPOLL remain future work.

Current Behavior

Each non-idle process gets one 4 KiB ring page mapped at RING_VADDR. The page contains a volatile header, a 16-entry submission queue, and a 32-entry completion queue. Userspace writes CapSqe records, advances sq_tail, and uses cap_enter(min_complete, timeout_ns) to make ordinary calls progress.

sequenceDiagram
    participant U as Userspace runtime
    participant R as Ring page
    participant K as Kernel ring dispatcher
    participant C as Capability object
    U->>R: write CapSqe and advance sq_tail
    U->>K: cap_enter(min_complete, timeout_ns)
    K->>R: read sq_head..sq_tail
    K->>K: validate SQE fields and user buffers
    K->>C: call method or endpoint operation
    C-->>K: completion, pending, or error
    K->>R: write CapCqe and advance cq_tail
    K-->>U: return available CQE count
    U->>R: read matching CapCqe

Timer polling also processes each current process’s ring before preemption, but only non-CALL operations and CALL targets that explicitly allow interrupt dispatch may run there. Ordinary CALLs wait for cap_enter.

Why ordinary CALL waits for cap_enter: Submitting a CALL SQE is only a shared-memory write. The kernel still needs a safe execution point to drain the ring and run capability code. Timer polling runs in interrupt context, so it must not execute arbitrary capability methods that may allocate, block on locks, mutate page tables, spawn processes, parse Cap’n Proto messages, or perform IPC side effects. cap_enter is the normal process-context drain point: it processes pending SQEs, posts CQEs, and then either returns the available completion count or blocks until enough completions arrive. The design keeps SQE publication syscall-free and batchable, keeps the syscall ABI limited to exit and cap_enter, and avoids turning the timer interrupt into a general capability executor. A future SQPOLL-style path can remove the explicit syscall from the hot path only by running dispatch in a worker context, not from arbitrary timer interrupt execution.

Design

CapSqe is a fixed 64-byte ABI record. CAP_OP_CALL names a local cap-table slot and method ID plus parameter/result buffers. CAP_OP_RECV and CAP_OP_RETURN implement endpoint IPC. CAP_OP_RELEASE removes a local cap-table slot through the transport. CAP_OP_NOP measures the fixed ring path. CAP_OP_FINISH is ABI-reserved and currently returns CAP_ERR_UNSUPPORTED_OPCODE.

The kernel copies user params into preallocated per-process scratch, dispatches capability methods, serializes results directly into caller-provided result buffers, and posts CapCqe. A successful method returns non-negative bytes written. Transport failures are negative CAP_ERR_* codes. Application exceptions are serialized CapException payloads with CAP_ERR_APPLICATION_EXCEPTION.

Transfer-bearing CALL and RETURN SQEs pack CapTransferDescriptor records after the params/result payload. Successful result-cap transfers append CapTransferResult records after normal result bytes.

Future behavior should use the reserved SQE fields for system transport features, not ad hoc per-interface extensions.

Invariants

  • SQ and CQ sizes are powers of two and fixed by the ABI.
  • Unknown opcodes fail closed; FINISH is reserved, not silently accepted.
  • Reserved fields must be zero for currently implemented opcodes.
  • cap_enter rejects min_complete > CQ_ENTRIES.
  • User buffers must be in user address space with page permissions matching read/write intent before copy.
  • Timer dispatch must not run capabilities that allocate, block on locks, or mutate page tables unless the cap explicitly opts in.
  • Per-dispatch SQE processing is bounded by SQ_ENTRIES.
  • Transfer descriptors must be aligned, valid, and bounded by MAX_TRANSFER_DESCRIPTORS.

Code Map

  • capos-config/src/ring.rs - shared ring ABI, opcodes, errors, SQE/CQE structs, endpoint message headers, transfer records.
  • kernel/src/cap/ring.rs - kernel dispatcher, SQE validation, CQE posting, cap calls, endpoint CALL/RECV/RETURN, release, transfer framing.
  • kernel/src/arch/x86_64/syscall.rs - cap_enter syscall.
  • kernel/src/sched.rs - timer polling, cap-enter blocking, direct IPC wake.
  • kernel/src/process.rs - ring page allocation and mapping.
  • capos-rt/src/ring.rs - runtime ring client, pending calls, transfer packing, result-cap parsing.
  • capos-rt/src/entry.rs - single-owner runtime ring client token and release queue flushing.
  • capos-config/tests/ring_loom.rs - bounded producer/consumer model.

Validation

  • cargo test-ring-loom validates SQ/CQ producer-consumer behavior, capacity, FIFO, CQ overflow/drop behavior, and corrupted SQ recovery.
  • make run exercises Console CALLs, reserved opcode rejection, ring corruption recovery, NOP, fairness, transfers, and endpoint IPC.
  • make run-measure exercises measurement-only NOP and NullCap baselines.
  • cargo test-config covers shared ring layout and helper invariants.
  • make capos-rt-check checks userspace runtime ring code under the bare-metal target.

Open Work

  • Implement CAP_OP_FINISH as part of the system Cap’n Proto transport.
  • Add promise pipelining using pipeline_dep and pipeline_field.
  • Define LINK, DRAIN, and MULTISHOT semantics before accepting those flags.
  • Add SQPOLL after SMP gives the kernel a spare execution context.