# In-Process Threading Contract

This page records the implemented contract for kernel-managed threads inside
one process. The park authority contract is frozen separately in
[Park Authority](park.md). These pages are the handoff from the initial
single-thread runtime checkpoint to same-process SMP work. The current slice
has per-thread completion rings for spawned child threads, per-CPU WFQ run
queues with bounded stealing, a caller-thread-bound `SchedulingPolicyCap`,
and a `SchedulingContext` cap that records identity, bind/revoke,
dispatcher budget charging/replenishment, bounded endpoint donation/return,
and fixed depletion/deadline notification cells. Same-process sibling
scheduling has formal accepted 1-to-2 evidence on `capos-bench` 2026-05-02
21:38 UTC against `main` commit `374f8556` (capOS work `1.883x` / total
`1.787x`, both clearing the configured `1.6x` gates; matching Linux pthread
baseline `1.988x`/`1.987x` on the same physical-core pin set). The
2026-05-02 1-to-4 row was the diagnostic that justified Phase D's fair-share
enqueue policy: capOS sat at `1.566x`/`1.538x` while Linux scaled to
`3.963x`/`3.858x`. Phase D now runs per-CPU WFQ queues with bounded stealing
and manually accepted the 2026-05-10 1-to-4 diagnostic row
(`3.088x`/`2.700x`) while the harness-enforced gate remains 1-to-2
work/total speedup; see `docs/benchmarks.md` for the full evidence table
including historical pre-collapse rows. Phase F has landed the
one-SQ-consumer prerequisite, nohz telemetry, housekeeping/deferred-work
placement, the clockevent/deadline substrate, and bounded SQPOLL ring mode
including the non-periodic SQPOLL producer-wake progress path; the first
automatic nohz activation increment is closed via
`docs/tasks/done/2026/scheduler-phase-f-auto-nohz-activation.md` and
SQPOLL-driven auto-nohz activation is also closed via
`docs/tasks/done/2026/scheduler-phase-f-auto-nohz-sqpoll.md`; generic
full-nohz for ordinary budgeted compute leases and timeout-based auto-revoke are
landed; policy-service AutoNoHz issuance remains future work.


## Scope

The threading milestone changes the scheduler's unit of execution from process
to thread while keeping the process as the authority, address-space, and
resource-accounting boundary. Same-process sibling scheduling on multiple CPUs
is functional for per-thread-ring processes. The accepted 1-to-2 performance
claim is now the formal `capos-bench` 5-run pair recorded on 2026-05-02
21:38 UTC against `main` commit `374f8556`: capOS work `1.883x` and total
`1.787x` clear the configured `1.6x` gates; the matching Linux pthread
baseline on the same physical-core pin set (`0,1,2,3`) records
`1.988x`/`1.987x`, validating the workload shape. The 2026-05-02 1-to-4 row
was the diagnostic that justified Phase D: capOS sat at `1.566x`/`1.538x`
while Linux scaled to `3.963x`/`3.858x`. Phase D now runs per-CPU WFQ queues
with bounded stealing and its 2026-05-10 1-to-4 row (`3.088x`/`2.700x`) was
manually accepted from recorded diagnostics; the harness-enforced gate remains
1-to-2 work/total speedup. Historical pre-collapse rows and the post-collapse
3-run diagnostic remain in `docs/benchmarks.md` for reference. Phase E adds
the `SchedulingContext` cap (identity, caller-thread bind, revoke, budget
charging/replenishment, bounded synchronous endpoint donation/return, and
fixed depletion/deadline notification cells with drain observer results),
and Phase F has landed the bounded SQPOLL ring mode plus the
clockevent/deadline substrate. Automatic nohz activation, realtime
admission, and privileged userspace scheduler-policy services remain later
work.

This contract covers:

- process-owned versus thread-owned state;
- the initial thread creation ABI;
- per-thread FS-base/TLS rules;
- thread exit and join semantics;
- the per-thread ring blocking and completion-routing contract;
- the caller-thread-bound `SchedulingPolicyCap` and `SchedulingContext`
  surfaces that mutate per-thread WFQ weight/latency-class and per-thread
  scheduling-context binding;
- the handoff to the 7.1.1 park authority design.

## Ownership Split

The process remains the security boundary. All threads in one process share
the same address space and capability table, so a thread has the same
authority as its sibling threads.

| Process-owned state | Thread-owned state |
| --- | --- |
| Process id and process generation | Thread id and thread generation |
| User address space and CR3 | Saved CPU context and user register state |
| Capability table and resource ledger | Kernel stack and syscall stack top |
| Initial compatibility ring and ring arena ownership | Per-thread ring endpoint, scratch, and FS base |
| Read-only CapSet page | Scheduling/blocking state |
| ProcessHandle exit state | ThreadHandle join/exit state |
| Endpoint owner state and process-wide cleanup hooks | WFQ weight, latency class, virtual runtime, and `virtual_finish_ns` enqueue tag |
| Process-wide resource ledgers (thread records, kernel stacks, cap-table slots) | `SchedulingContext` binding (identity/generation, remaining budget, replenish/deadline timestamps, donation/return slot, notification recorder) |

The implementation migrated incrementally. The 7.2.0 slice made each process
contain a single initial `Thread`, with saved context, kernel stack, FS base,
and blocking state stored on that thread. Later slices changed scheduler-owned
queues, current execution, direct IPC handoff, and wake records to
generation-checked `ThreadRef` values, added creation and lifecycle caps, and
then assigned per-thread rings to spawned children.

## Scheduler Contract

`Scheduler` stores runnable execution contexts as thread
references, not process ids. A thread reference is `(pid, process_generation,
tid, thread_generation)`. The process generation keeps handles from naming a
reused process; the thread generation keeps handles from naming a reused
thread slot inside a live process.

This identity applies to `Scheduler.current`, run queues, direct IPC targets,
Timer sleep waiters, process/terminal waiters, endpoint caller/receiver wake
records, and deferred cancellation state.

Runnable ownership is split across per-CPU run queues
(`SCHEDULER_CPUS = 4`). Each queue is ordered ascending by
`virtual_finish_ns`, which is recomputed per enqueue from
`virtual_runtime_ns`, the thread's WFQ `weight` (clamped to
`[MIN_WEIGHT, MAX_WEIGHT]` in `capos-abi::scheduler`), and a per-class
slice scaled by `LatencyClass` (`Interactive` divides the slice,
`Batch` multiplies it, `Normal`/`IpcServer` pass it through). Default
placement targets the current CPU; a bounded steal path balances when a
CPU's local queue is empty, recomputes the WFQ tag at the destination,
and records placement-spread / steal migrations under the `measure`
feature. Each per-CPU queue is reserved at thread-create time to the live
runnable-capable thread count so timer-tick, unblock, direct-IPC fallback,
and steal-requeue paths never allocate.

The run queue, `current`, direct IPC target, and blocked waiter scans are
thread-oriented. Address-space switches happen only when the next runnable
thread belongs to a different process. TSS.RSP0, the syscall kernel stack, and
FS base are updated on every thread switch because those are thread-local
machine resources. Per-thread `runtime_ns` advances 1:1 with elapsed CPU
time; `virtual_runtime_ns` advances by
`elapsed_ns * REFERENCE_WEIGHT / weight` so weight changes the cumulative
WFQ share rather than just an enqueue tie-breaker.

`SchedulingContext` bindings layer dispatcher budget on top of WFQ. A
thread may carry at most one `SchedulingContextThreadBinding`. While
bound, the dispatcher charges elapsed time against the binding's
`remaining_budget_ns`, replenishes from `period_ns` at the next replenish
boundary, records `deadline_or_timeout` and `budget_depleted`
notifications in the per-context fixed cells, and routes synchronous
endpoint donation/return for passive receiver threads (`donated_holder`
in the notification snapshot tracks whether the holder is the donor or
the receiver). Stale-generation or revoked caps fail closed before
mutating scheduler state. Realtime-island admission, CPU placement
enforcement, and overrun-fault policy remain deferred.

The idle path is a per-CPU CPL0 (kernel-mode) idle thread; the former
special user-mode idle process has been removed. Each CPU's idle thread is a
kernel-owned execution context — it runs on the kernel PML4 with a dedicated
idle kernel stack and cannot block, exit, or hold ordinary caps. A lightweight
synthetic idle `Process` record is retained per CPU only so the idle
`ThreadRef` resolves through scheduler bookkeeping; it maps no user code,
stack, or cap ring. See the "Idle paths" section of
`docs/architecture/scheduling.md`.

Phase F has landed the one-SQ-consumer prerequisite, nohz telemetry,
housekeeping/deferred-work placement, the clockevent/deadline substrate,
and a bounded SQPOLL ring-mode worker (`MAX_SQPOLL_WORKERS = 16`,
`request_sqpoll_start_for_thread` / `finalize_pending_sqpoll_start_for_thread`
with stale-owner rollback). Tick suppression now exists behind explicit
`CpuIsolationLease` admission, including ordinary budgeted compute leases that
target a live `SchedulingContext`; policy-service AutoNoHz issuance and generic
SQPOLL nohz for arbitrary rings remain future work.

## Thread Creation ABI

Thread creation is exposed through a process-local `ThreadSpawner` capability.
It creates threads only in the caller's current process. It does not grant
authority to another process and is non-transferable across IPC in the initial
implementation.

The initial control-plane shape is:

```capnp
interface ThreadSpawner {
    create @0 (
        entry :UInt64,
        stackTop :UInt64,
        arg :UInt64,
        fsBase :UInt64,
        flags :UInt64
    ) -> (handleIndex :UInt16);
}

interface ThreadHandle {
    join @0 () -> (exitCode :Int64);
    exitCode @1 () -> (exited :Bool, exitCode :Int64);
}

interface ThreadControl {
    getFsBase @0 () -> (fsBase :UInt64);
    setFsBase @1 (fsBase :UInt64) -> ();
    exitThread @2 (code :Int64) -> ();
}
```

Any 7.2 schema adjustment must update this page in the same branch before
implementation review. The stable semantics are that creation is in-process,
the returned handle is an observed result cap, `ThreadHandle` observes one
thread rather than the whole process, and current-thread exit is available
through both `ThreadControl.exitThread` and the raw `exit(code)` syscall.

The new thread starts in Ring 3 at `entry` with:

- `RDI = arg`;
- `RSI = tid`;
- `RDX = pid`;
- `RCX = the current thread's ring address`;
- `R8 = CAPSET_VADDR`, or zero if the process has no CapSet.

The runtime supplies the user stack and TLS block. The kernel validates that
`entry`, `stackTop`, and `fsBase` are user-canonical, that `stackTop` is
16-byte aligned at entry, and that reserved `flags` bits are zero. Page
presence and stack-growth policy remain process address-space questions;
before a page-fault subsystem exists, an invalid thread stack can fault the
process.

## Resource Accounting

Thread creation allocates kernel memory and is quota-backed by process-owned
ledger state, not per-capability helper counters. The 7.2.0 checkpoint charges
the initial thread during process creation; `ThreadSpawner.create` extends the
same ledgers to additional threads. The ledger of record is:

- `PROCESS_THREAD_LIMIT`, the maximum live or retained thread records in one
  process, initially 16;
- `PROCESS_THREAD_KERNEL_STACK_PAGES`, initially matching the current
  per-thread kernel stack allocation size of 32 pages;
- `thread_records_used` / `thread_records_max`;
- `thread_kernel_stack_pages_used` / `thread_kernel_stack_pages_max`.

The initial process thread charges one thread record and one kernel-stack
allocation during process creation. `ThreadSpawner.create` reserves a thread
record and kernel-stack page budget before allocating the stack or publishing a
`ThreadHandle`; every later failure rolls both reservations back before
returning. Cap-slot reservation for the result handle remains charged to the
existing process cap-table ledger.

Creation failures are controlled application exceptions. Thread count,
kernel-stack budget, handle cap-slot exhaustion, and kernel stack allocation
failure return `Overloaded` with a specific message and no partially runnable
thread. Invalid entry, stack, FS base, or flags return `Failed`.

Thread exit releases the kernel stack only after the scheduler is running on a
different kernel stack. The thread record remains charged while a live
`ThreadHandle`, pending join waiter, or unjoined exit status can still observe
it. Once the handle is released without a pending join, or once a one-shot join
has consumed the status and no wait record pins it, the retained record charge
is released. Process exit releases all thread records and stack charges once.

The off-stack property is enforced by an `OffStackToken` witness on every stack
frame release path: the deferred per-thread drain calls
`Process::release_thread_kernel_stack`, whole-process teardown calls
`Process::release_all_thread_kernel_stacks`, and pre-publication rollback calls
`Process::rollback_created_thread`. The token constructor is private to the
scheduler module. Implicit `Thread::Drop` is deliberately not a release path;
if a `Thread` value reaches its destructor with a nonzero stack, it fails
closed by leaving the frames allocated instead of freeing a stack without an
off-stack witness.

## FS Base And TLS

FS base is thread-owned. The existing `ThreadControl.getFsBase` and
`ThreadControl.setFsBase` operations keep their names, but after threading they
refer to the current thread, not the whole process. `setFsBase` continues to
reject non-user-canonical values and writes the CPU FS-base MSR immediately
when called by the running thread. Both methods route through
context-aware dispatch (`CapCallContext::caller_thread`) so the
operation always targets the caller, never a different thread; calling
`ThreadControl` from a non-live caller returns
`ProcessFsBaseError::CallerNotLive`.

The initial process thread uses the PT_TLS block installed by ELF loading.
Additional threads receive an FS base from `ThreadSpawner.create`; the runtime
is responsible for allocating and initializing each thread's TLS/TCB data.
There is no process-global FS base. Current-thread FS-base operations are useful
for the single-thread runtime checkpoint, but they must not be treated as the
final threading ABI for language runtimes. True multi-threaded Go or
C/POSIX-like runtime support requires each `ThreadRef` to own a distinct TLS
block and FS base.

Context switching must save the outgoing thread's FS base and restore the next
thread's FS base even when both threads belong to the same process and no CR3
switch is needed.

## Thread Identity In Waiters And Dispatch

The concrete identity type for in-process scheduling is:

```rust
ThreadRef {
    pid,
    process_generation,
    tid,
    thread_generation,
}
```

Process identity still governs authority and accounting, but wakeup and
blocking state must name a thread. 7.2 changes context-aware capability
dispatch so `CapCallContext` carries both the caller process id for authority
checks and the caller `ThreadRef` for wake/cancel decisions. Existing pid-only
records that can resume execution or write a caller CQE must be widened before
multiple threads can run in one process.

The migration target is:

- `TimerSleepWaiter` stores the sleeping `ThreadRef` and validates the
  generation before waking it;
- endpoint CALL, RECV, RETURN target, deferred-cancel, current-caller, and
  direct IPC handoff records store the blocked or target `ThreadRef`;
- terminal line input and any other `ProcessWaiter` consumer store the waiting
  `ThreadRef` and validate the generation before writing a CQE;
- `ProcessHandle.wait` records the waiting `ThreadRef` while the handle still
  names the child process;
- `ThreadHandle.join` records the waiting `ThreadRef` and the target
  `ThreadRef`;
- `cap_enter` blocks the current `ThreadRef` on that thread's ring endpoint;
- process-exit cleanup cancels every waiter whose `pid` and
  `process_generation` match the exiting process, regardless of thread id.

A generation mismatch on wake or completion is a stale waiter and must be
drained without writing to userspace. This mirrors current process-generation
behavior and prevents one thread slot reuse from receiving another thread's
Timer, endpoint, join, or ring completion.

## Exit And Join

The current `exit(code)` syscall terminates the current thread. This preserves
single-thread process exit because the process exits when its last non-idle
thread exits, and it avoids tearing down a shared address space while sibling
threads are still current on other CPUs.

Thread exit does not add a new syscall. The initial implementation added
`ThreadControl.exitThread(code)` as a terminal capability-ring operation on
the current thread, with the same current-thread termination semantics as the
raw syscall. A successful invocation does not post a CQE back to the exiting
thread, because `cap_enter` will not return to that execution context. It
records the exit code, wakes or completes any valid join waiter, and removes
only the current thread from scheduling. If the last non-idle thread in a
process exits through `exit(code)` or `exitThread`, the process exits with that
thread's code and completes the parent-facing `ProcessHandle`.

Whole-process termination remains a `ProcessHandle` operation. It releases the
shared capability table, cancels process-owned endpoint state, removes
timer/park/ring waiters for every thread in the process, and completes the
parent-facing `ProcessHandle` after the process is no longer current on any
CPU.

`ThreadHandle.join` is process-local and one-shot. If the target thread already
exited and its status is retained, join returns its code immediately and marks
the status joined. If it is still live, join blocks the caller's thread until
the target exits. Self-join returns `Failed`. A second waiter, join after a
successful join, or join after detach returns `Failed`; it must not park an
ambiguous waiter. `ThreadHandle.exitCode` is nonblocking and may observe the
retained status while the handle is live, but it does not consume the one-shot
join right.

Releasing the last `ThreadHandle` before the target exits detaches the target:
the thread continues to run, but no exit status is retained after it exits
unless a join waiter already pins the state. Releasing the handle after exit
but before join drops the retained status and releases the thread-record
charge. A pending join waiter pins the handle state until completion or process
exit, so cap release cannot create a use-after-free. The exiting thread's
kernel stack must not be freed while it is still executing on that stack; final
process teardown performs an explicit token-gated stack release after another
kernel stack is active, before the deferred `Process` value is dropped.

Fatal user faults remain process-fatal in the first implementation. Per-thread
fault isolation can be designed later, after the basic scheduler and futex
paths are stable.

## Capability Ring And Blocking

The first Ring v2 implementation keeps the initial thread's compatibility
ring at `RING_VADDR` and gives each spawned child thread a kernel-chosen ring
mapping inside the reserved process ring arena. Runtime-selected ring address
ranges remain a later `VirtualMemory` reservation extension.

`ThreadSpawner.create` allocates a ring record and user mapping for the new
thread, stores that mapping on the child `ThreadRef`, and passes the ring
address in the child start registers. `cap_enter` blocks the current thread
against that thread's own CQ, so same-process sibling threads may block in
`cap_enter` independently. Timer, endpoint, join, park, and cancellation paths
must route completions by generation-checked `ThreadRef` to the target
thread's ring endpoint.

The runtime's single-owner ring-client invariant remains local to each ring
client. Well-formed userspace serializes submission and completion matching per
thread ring through `capos-rt`; it must not have two consumers racing on the
same SQ/CQ. The scheduler still refuses to run the exact same `ThreadRef` on
two CPUs at once, but it no longer treats every multithreaded pid as tied to
one scheduler CPU.

This is sufficient for functional same-process sibling scheduling. The formal
accepted 1-to-2 `make run-thread-scale` capOS evidence is the `capos-bench`
2026-05-02 21:38 UTC pair (work `1.883x`, total `1.787x`, both clearing the
configured `1.6x` gates). The guest result row's `accepted` field remains
diagnostic; the host summary enforces the work-window and total-time gates, and
refuses speedup enforcement unless `CAPOS_THREAD_SCALE_QEMU_TASKSET_CPUS`
records the QEMU CPU pin set. Linux validates the repaired benchmark shape
through four workers on physical cores (`3.963x`/`3.858x`). That capOS
4-worker row was diagnostic (`1.566x`/`1.538x`) and justified Phase D's
per-CPU WFQ queues plus bounded stealing. The 2026-05-10 Phase D rerun
recorded 1-to-4 work/total diagnostics `3.088x`/`2.700x`, manually accepted
for closeout; remaining risks are the shared scheduler lock, temporary CPU
pinning, CQ/join/exit/block/schedule overhead, broader workload classes, and
higher-thread-count evidence.

## Scheduling Policy And Context Authority

`SchedulingPolicyCap` is the caller-thread-bound surface for WFQ knobs.
Every method routes through `CapCallContext::caller_thread`; there is no
per-cap-object `ThreadHandle`, no badge-encoded thread id, and no
cross-thread mutation in this slice. Cross-thread authority is deferred to
the privileged scheduler-policy service plan. The schema shape is:

```capnp
interface SchedulingPolicyCap {
    setWeight @0 (weight :UInt16) -> ();
    setLatencyClass @1 (class :LatencyClass) -> ();
    snapshot @2 () -> (
        weight :UInt16,
        class :LatencyClass,
        runtimeNs :UInt64,
        virtualRuntimeNs :UInt64,
    );
}
```

`setWeight` validates against `[MIN_WEIGHT, MAX_WEIGHT]` at the cap
boundary and updates the caller thread's WFQ weight; the new weight
applies to the next enqueue's `virtual_finish_ns` tag and to subsequent
`virtual_runtime_ns` accounting. `setLatencyClass` swaps the per-thread
`LatencyClass` (`Normal`, `Interactive`, `IpcServer`, `Batch`) used to
scale the dispatcher slice. `snapshot` is a read-only observer over the
core WFQ state and does not expose the `measure`-only counters.

`SchedulingContext` is the schema-typed cap for dispatcher budget
authority:

```capnp
interface SchedulingContext {
    info @0 () -> (info :SchedulingContextInfo);
    create @1 (spec :SchedulingContextSpec) -> (
        contextIndex :UInt16,
        identity :SchedulingContextIdentity,
        result :SchedulingContextOperationResult,
        dispatchEffect :SchedulingContextDispatchEffect,
    );
    bindCallerThread @2 () -> (
        identity :SchedulingContextIdentity,
        binding :SchedulingContextBinding,
        result :SchedulingContextOperationResult,
        dispatchEffect :SchedulingContextDispatchEffect,
    );
    revoke @3 () -> (
        identity :SchedulingContextIdentity,
        previousGeneration :UInt64,
        result :SchedulingContextOperationResult,
        dispatchEffect :SchedulingContextDispatchEffect,
    );
    drainNotifications @4 () -> (
        notifications :SchedulingContextNotificationSnapshot,
    );
}
```

`create` returns a same-interface child context as transferred result
cap 0 and becomes chargeable only after `bindCallerThread`. `revoke`
bumps the generation and clears any matching thread binding; later calls
through the stale cap generation report `staleGeneration` or fail closed
before mutating scheduler state. `drainNotifications` reads the fixed
per-context budget-depleted and deadline-or-timeout slots; the
scheduler updates these in place from hard paths without allocation,
including the holder identity and a `donatedHolder` bit for endpoint
donation/return. The bootstrap manifest grants `SchedulingPolicyCap` and
`SchedulingContext` only to focused-proof manifests; the default boot
manifest does not grant them.

## Userspace API Surface

The `capos-rt` runtime exposes the threading caps as typed clients on top
of the per-thread ring:

- `ThreadControlClient` -- `get_fs_base`/`set_fs_base`/`exit_thread`,
  including `*_wait` blocking variants over `RuntimeRingClient`.
- `ThreadSpawnerClient::create` -- submits the `entry`/`stackTop`/`arg`/
  `fsBase`/`flags` ABI and returns an `OwnedCapability<ThreadHandle>`
  delivered as transferred result cap 0 in the CQE.
- `ThreadHandleClient` -- `join`, `exit_code` (nonblocking observer), and
  their `finish_*` helpers; `finish_join` decodes the one-shot exit
  code.
- `SchedulingPolicyClient` -- `set_weight`, `set_latency_class`, and
  `snapshot`, all caller-thread-bound.
- `SchedulingContextClient` -- `info`, `create`, `bind_caller_thread`,
  `revoke`, and `drain_notifications`.

A typical spawn/join pseudocode against these clients is:

```rust
let handle = thread_spawner.create_wait(
    &mut ring,
    entry_addr,
    user_stack_top,
    arg,
    fs_base,
    /* flags */ 0,
    timeout_ns,
)?;
// ... runtime work on the parent thread ...
let exit_code = thread_handle
    .join_wait(&mut ring, timeout_ns)?;
```

The userspace runtime is responsible for the user stack, TLS/TCB, and
any free-list bookkeeping for retired handles; the kernel only validates
the ABI fields and charges the per-process ledgers.

## Park Handoff

Park authority is defined in [Park Authority](park.md). The scheduler
changes above must leave room for a thread block reason that is not tied to the
process ring CQ. The frozen handoff is:

- park wait blocks the current thread, not the whole process;
- park wake makes selected generation-checked `ThreadRef` values runnable;
- timeouts use the same monotonic time base as `Timer`;
- private park keys are based on address-space identity plus user virtual
  address;
- shared-memory park keys are MemoryObject-derived identity plus offset;
- the first implementation starts with compact `CAP_OP_PARK` and
  `CAP_OP_UNPARK` operations rather than generic Cap'n Proto methods;
- park wait SQEs are thread-owned so ring dispatch cannot park a sibling
  thread under the waiter's `user_data`;
- blocking park wait is a syscall-context operation that releases runtime
  ring-client ownership before the thread parks, while `capos-rt` demultiplexes
  reserved park CQEs back to the waiting thread.

Pre-thread 4.5.4 measurement chose the compact capability-authorized shape for
failed wait and empty wake. 4.5.5 measured the real blocked/resume path through
`thread-lifecycle` under `make run-measure`, so the compact ParkSpace opcodes
remain the runtime ABI target for this slice.

## Security Invariants

- A thread never owns a separate capability table in the initial model.
- A thread cannot escape the authority of its containing process.
- A `ThreadHandle` names only a thread in the same process and is
  non-transferable in the first implementation.
- Thread creation is charged to one process-owned thread/kernel-stack ledger of
  record before the thread can become runnable.
- Process exit releases shared authority once, after all live threads are
  removed from scheduling.
- Per-process resource quotas are shared by all threads.
- `ThreadControl` changes only the current thread's FS base.
- `ThreadControl.exitThread` terminates only the current thread and is a
  capability-ring operation, not a syscall.
- Every waiter or direct handoff that can resume execution stores a generation
  checked `ThreadRef`.
- Process-owned user-buffer validation/copy/read paths hold the process
  `AddressSpace` lock; future shared-memory thread primitives still need
  mapping provenance or object pins when they derive keys from shared backing.

## Implementation Order

1. Add internal `Thread` state, make each process own one initial thread, move
   saved context / kernel stack / FS base / block state onto that thread, and
   charge the initial thread against private process ledgers.
   Done 2026-04-24 23:09 UTC.
2. Change scheduler queues, blocking, exit cleanup, and direct IPC targets from
   pid-oriented state to thread references while preserving one thread per
   process.
   Done 2026-04-24 23:33 UTC.
3. Add `ThreadSpawner`, `ThreadHandle`, and `ThreadControl.exitThread` with a
   QEMU smoke for create, join, detach, self-join rejection, second join
   rejection, and last-thread process exit.
   Done 2026-04-25.
4. [x] Implement the ParkSpace private wait/wake path from
   [Park Authority](park.md) after the scheduler can block and wake
   individual threads, then run 4.5.5 blocked/resume measurements before
   declaring the park ABI stable.
   Done 2026-04-25.

## Validation

The `thread-lifecycle` proof creates multiple threads in one process, proves
they share the address space and CapSet, proves each has an independent FS
base, rejects invalid join cases, joins one thread from another, and lets the
last thread exit the process. The existing `make run-spawn` path keeps covering
`runtime-fs-base` and `single-thread-runtime` so regressions in the pre-thread
runtime contract stay visible. `make run-measure` additionally records the
private ParkSpace blocked/resume timings and proves process exit with a parked
park waiter. Phase D fairness/Interactive/weight-change smokes
(`make run-thread-fairness`, `make run-thread-fairness-interactive`,
`make run-thread-fairness-weight-change`) exercise the `SchedulingPolicyCap`
caller-thread-bound surface; the `thread-scale` proof carries the recorded
WFQ scaling evidence. The recorded 1-to-2 work/total speedup gate is the
host-enforced Phase D acceptance criterion; the 1-to-4 row remains a
manually accepted diagnostic. Safe runtime park wrappers and a focused
`SchedulingContext` budget/donation/notification smoke remain future
capos-rt and harness work.
