S.9 Design: Authority Graph and Resource Accounting for Transfer

This document defines the concrete S.9 design gate for:

WORKPLAN 3.6 capability transfer (xfer_cap_count, copy/move, rollback)
WORKPLAN 5.2 ProcessSpawner prerequisites (spawn quotas and result-cap insertion)

S.9 is complete when this design contract is concrete enough to guide implementation. The invariants and acceptance criteria below are implementation gates for later work in 3.6/5.2/S.8/S.12, not requirements for declaring the S.9 design artifact complete.

1. Authority Graph Model

Authority is modeled as a directed multigraph:

Nodes:
- Process(Pid)
- Object(ObjectId) (kernel object identity, independent of per-process CapId)
Edges:
- Hold(Pid -> ObjectId) with metadata:
  - cap_id (table-local handle)
  - interface_id
  - badge
  - transfer_mode (copy, move, non_transferable)
  - origin (kernel, spawn_grant, ipc_transfer, result_cap)

Security invariant A1: all authority is represented by Hold edges; no operation can create object authority outside this graph.

Security invariant A2: each process mutates only its own CapTable edges except through explicit transfer/spawn transactions validated by the kernel.

Security invariant A3: for every live Hold edge there is exactly one cap_id slot in one process table referencing the object generation.

2. Per-Process Resource Ledger and Quotas

Each process owns a kernel-maintained ResourceLedger with hard limits. Enforcement is fail-closed at reservation time (before side effects).

ResourceLedger {
  cap_slots_used / cap_slots_max
  endpoint_queue_used / endpoint_queue_max
  outstanding_calls_used / outstanding_calls_max
  scratch_bytes_used / scratch_bytes_max
  frame_grant_pages_used / frame_grant_pages_max
  log_bytes_window_used / log_bytes_per_window (token bucket)
  cpu_time_us_window_used / cpu_budget_us_per_window (token bucket)
}

Initial quota profile for Stage 6/5.2 bring-up (tunable by kernel config):

cap_slots_max: 256
endpoint_queue_max: 128 messages
outstanding_calls_max: 64
scratch_bytes_max: 256 KiB
frame_grant_pages_max: 4096 pages (16 MiB at 4 KiB pages)
log_bytes_per_window: 64 KiB/sec with 256 KiB burst
cpu_budget_us_per_window: 10,000 us per 100,000 us window

Security invariant Q1: no counter may exceed its max.

Security invariant Q2: every resource reservation has a matched release on all success, error, timeout, process-exit, and rollback paths.

Security invariant Q3: quota checks for transfer/spawn happen before mutating sender or receiver capability state.

3. Diagnostic Rate Limiting and Aggregation

Repeated invalid ring/cap submissions are aggregated per process and error key.

Key: (pid, error_code, opcode, cap_id_bucket)
Buckets:
- cap_id_bucket = exact cap id for stale/invalid cap failures
- cap_id_bucket = 0 for structural ring errors
Per-key token bucket: allow first N=4 emissions/sec, then suppress.
Suppressed counts are flushed once per second as one summary line:
- pid=X invalid submissions suppressed=Y last_err=...

Security invariant D1: invalid submission floods cannot consume unbounded serial bandwidth or scheduler time in log formatting.

Security invariant D2: suppression never hides first-observation diagnostics for a new (pid,error,opcode,cap bucket) key.

4. Transfer and Rollback Semantics

Transfers (xfer_cap_count > 0) use a kernel transfer transaction (TransferTxn) scoped to a single SQE dispatch. The current ring ABI does not provide kernel-owned SQE sequence numbers or a durable transaction table, so userspace replay of a copy-transfer SQE is repeatable: each replay is treated as a new copy grant. Move-transfer replay fails closed after the source slot is removed or reserved by the first successful dispatch.

Future exactly-once replay suppression requires transaction identity scoped to (sender_pid, call_id, sqe_seq) and a monotonic transfer epoch. Until that exists, exactly-once claims apply only within one dispatch attempt, not across malicious rewrites of shared SQ ring indexes.

Phases:

Prepare:
- validate SQE transport fields and xfer_cap_count
- validate sender ownership/generation/transferability for each exported cap
- reserve receiver quota (cap_slots, outstanding_calls, scratch if needed)
- pin sender entries in txn state (no sender table mutation yet)
Commit:
- insert destination edges exactly once
- for copy: increment object refcount/export ref
- for move: remove sender slot only after destination insertion succeeds
- publish completion/result
Finalize:
- release transient reservations
- mark txn terminal (committed or aborted)

On any error before Commit, rollback is full:

receiver inserts are not visible
sender slots/refcounts unchanged
reservations released
CQE returns transfer failure (CAP_ERR_TRANSFER_ABORTED / subtype)

On error during Commit, kernel executes compensating rollback to preserve exactly-once visibility: either all inserts are visible with matching sender state transition, or none are visible.

Security invariant T1: each transfer descriptor is applied at most once within a single SQE dispatch attempt.

Security invariant T2: move transfer is atomic from observer perspective; no state exists where both sender and receiver lose authority due to partial apply.

Security invariant T3: copy-transfer SQE replay is explicitly repeatable until kernel-owned transaction identity exists. Move-transfer replay fails closed after source removal or source reservation.

Security invariant T4: CAP_OP_RELEASE removes one local hold edge only from the caller table and decrements remote export refs exactly once.

5. Integration with 3.6 Capability Transfer

3.6 implementation must consume this design directly:

CALL and RETURN validate all currently-reserved transfer fields fail-closed when unsupported.
xfer_cap_count path is wired through TransferTxn (no ad hoc direct inserts).
Badge propagation is explicit in transfer descriptors and copied into destination edge metadata.
CAP_OP_RELEASE uses the same authority ledger and refcount bookkeeping.

3.6 acceptance criteria:

Copy transfer produces one new receiver edge and retains sender edge.
Move transfer produces one new receiver edge and deletes sender edge atomically.
Any transfer failure leaves sender and receiver CapTables unchanged.
Copy replay is an explicit repeatable-grant policy until a kernel-owned transaction identity is added; move replay fails closed after source removal or reservation.
CAP_OP_RELEASE on stale/non-owned cap fails closed without mutating other process tables.

6. Integration with 5.2 ProcessSpawner Prerequisites

5.2 must use the same accounting and transfer machinery:

spawn() preflights child quotas (cap_slots, outstanding_calls, scratch, frame_grant_pages, endpoint queue baseline) before mapping child memory or scheduling.
Parent-provided CapGrant entries are inserted via the same transfer transaction semantics (copy for initial grants in 5.2.2).
Returned ProcessHandle is inserted through the standard result-cap insertion path and accounted as a normal cap slot.
Child setup rollback must unwind:
- address space mappings
- ring page
- CapSet page
- kernel stack
- allocated frames
- provisional capability edges/reservations

5.2 acceptance criteria:

Spawn failure at any step leaves no child-visible process and no leaked ledger usage.
Successful spawn accounts all child bootstrap resources within quotas.
Parent and child cap-table accounting remains balanced under repeated spawn/exit cycles.
ProcessHandle.wait and exit cleanup release outstanding-call/scratch/frame usage deterministically.

7. Implementation Notes for Verification Tracks

This design unblocks:

S.8 hostile-input tests for quota and invalid-transfer failures.
S.12 Kani bounds refresh for ledger and transfer invariants.
Target 12 in docs/proposals/security-and-verification-proposal.md with explicit allocator hooks and fail-closed exhaustion behavior.

Keyboard shortcuts

capOS Documentation