# Completion Rings And Threaded Runtimes

This note grounds the capOS ring/threading roadmap in existing completion I/O
and futex designs. The question is not whether a shared CQ can be made to work
with many waiting threads; it can. The question is which ownership model keeps
the kernel ABI stable once capOS runs multiple process threads on multiple CPUs.

## Sources Checked

- Linux `io_uring_enter(2)` documents the aggregate wait shape: with
  `IORING_ENTER_GETEVENTS`, the syscall waits until `min_complete` completion
  events are available.
- Linux `io_uring_setup(2)` documents SQPOLL, CQ sizing, and
  single-issuer-oriented task-run modes.
- Linux `io_uring_register(2)` documents registered wait regions.
- Jens Axboe's `io_uring` paper explains the core ring design as a pair of
  shared rings with single producer/single consumer ownership on each side and
  `user_data` copied from request to completion for matching.
- Linux `futex(2)` and `futex(7)` document futexes as a kernel-assisted
  blocking path for synchronization objects whose uncontended state lives in
  user memory.
- Microsoft I/O completion ports document the port model: threads wait on a
  completion port and dequeue completion packets, rather than each thread
  waiting directly on one specific operation's storage slot.

## Consequences For capOS

The current process-wide capOS ring matches the early `io_uring` shape: one SQ,
one CQ, and `user_data` for completion matching. That shape is efficient when
userspace serializes submission and completion consumption through one runtime
owner. It becomes the wrong primitive for full SMP if multiple kernel-scheduled
threads in the same process concurrently enter the kernel, because the ring
turns into a multi-producer/multi-consumer coordination problem.

Waiting for a raw CQ slot is not a good abstraction. CQ slots are circular
buffer storage and are reused. Stable wait identities are request `user_data`,
kernel answer ids, completion packets, or a completion queue/lane chosen at
submission time.

The clean full-SMP target is per-thread completion ownership. Each thread gets
its own capability ring endpoint: a complete SQ/CQ pair, even if multiple
endpoints are packed into one larger mapping. The existing
`cap_enter(min_complete, timeout_ns)` semantics can then remain aggregate:
`min_complete` counts completions available on the current thread's CQ. Runtime
code still matches individual operations by `user_data`, but two sibling
threads no longer race to consume the same process CQ.

The Windows IOCP model is a useful counterpoint: a shared completion port works
when the abstraction is explicitly a packet queue consumed by a worker pool.
That is a runtime/service scheduling model, not the same thing as multiple
threads blocking on one raw process CQ while each expects a private answer.

### Current Implementation State

The kernel dispatches six SQE opcodes today: CALL, RECV, RETURN, RELEASE,
CANCEL, and NOP. FINISH is reserved for the future system capnp transport and
completes with an unsupported-opcode error. PARK and UNPARK (capability-
authorized futex-style thread-park operations) are also dispatched. Only
CALL opcodes are gated to syscall context (via
`call_requires_syscall_dispatch`); the other dispatched opcodes, PARK and
UNPARK included, are processed in both syscall and timer-interrupt contexts.
PARK_BENCH is measurement-only and dispatched only when the kernel is built
with the `measure` feature.

Per-process resource limits are enforced via `ResourceProfile`, a quota
struct carried on each `Process` and resolved at spawn time. Two fields
directly bound the ring's resource use: `ring_scratch_limit_bytes` caps the
input and output buffer capacity of the per-process ring scratch allocator
(narrowing the kernel-side ceilings `MAX_PARAMS` and `MAX_RESULT`);
`in_flight_call_limit` and `endpoint_queue_limit` cap the per-`Endpoint`
in-flight CALL count and the queued (parked) CALL queue depth respectively,
each clamped by a kernel structural maximum of 32.

SQPOLL on the per-process ring has landed: a process can hold a
`kernelSqpoll` lease whose bound ring transitions into SQPOLL mode, with
the kernel acting as sole SQ consumer for that ring. This is the SQPOLL
foundation for the full-SMP per-thread ring target described below, not the
target itself. Generic full-nohz for explicitly budgeted compute leases and
SQPOLL nohz for explicitly leased caller-thread rings have landed; broader
userspace-poller/device-queue issuance remains future work.

## Recommended Direction

1. Keep the current process ring as the bootstrap and compatibility surface.
2. Add runtime reactor/demux support as an interim path for multithreaded
   runtimes that still use one process ring.
3. Make the full SMP ABI a per-thread ring model:
   - each `Thread` owns one ring endpoint with a complete SQ/CQ pair;
   - `cap_enter` operates on the current thread's ring;
   - SQPOLL, when enabled, is the sole kernel SQ consumer for that ring;
   - result-cap transfers still mutate the process cap table;
   - endpoint, timer, process-wait, thread-join, and futex completions post to
     the waiting `ThreadRef`'s ring.
4. Consider shared completion ports only as a userspace runtime/service
   abstraction above per-thread rings, not as the kernel's first full-SMP ring
   ABI.

## References

- Linux `io_uring_enter(2)`:
  <https://man.archlinux.org/man/io_uring_enter.2.en>
- Linux `io_uring_setup(2)`:
  <https://man7.org/linux/man-pages/man2/io_uring_setup.2.html>
- Linux `io_uring_register(2)` registered wait regions:
  <https://www.man7.org/linux/man-pages/man2/io_uring_register.2.html>
- Jens Axboe, "Efficient IO with io_uring":
  <https://www.kernel.dk/io_uring.pdf>
- Linux `futex(2)`:
  <https://man7.org/linux/man-pages/man2/futex.2.html>
- Linux `futex(7)`:
  <https://man7.org/linux/man-pages/man7/futex.7.html>
- Microsoft I/O completion ports:
  <https://learn.microsoft.com/en-us/windows/win32/fileio/i-o-completion-ports>