# Proposal: Scheduler Evolution

capOS should evolve its scheduler in layers. The goal is not one clever
algorithm; it is a capability-shaped CPU subsystem that scales ordinary work,
admits realtime islands, allows service/runtime-specific policy, and preserves
a small auditable kernel dispatch path.

This proposal complements, rather than replaces,
[Tickless and Realtime Scheduling](tickless-realtime-scheduling-proposal.md).
That proposal owns timer/tickless/SQPOLL-nohz details. This proposal owns the
broader scheduler architecture and roadmap.

## Design Grounding

Local grounding:

- [Scheduling](../architecture/scheduling.md)
- [In-Process Threading Contract](../architecture/threading.md)
- [Design Risks Register, Q9 -- CPU accounting and scheduling contexts](../design-risks-register.md#q9--cpu-accounting-and-scheduling-contexts)
- [SMP Phase C](../backlog/smp-phase-c.md)
- [SMP](smp-proposal.md)
- [Ring v2 For Full SMP](ring-v2-smp-proposal.md)
- [Tickless and Realtime Scheduling](tickless-realtime-scheduling-proposal.md)
- [Stateful Task and Job Graphs](stateful-task-job-graphs-proposal.md)
- [Future Scheduler Architecture](../research/future-scheduler-architecture.md)
- [NO_HZ, SQPOLL, and Realtime Scheduling](../research/nohz-sqpoll-realtime.md)
- [Out-of-kernel scheduling](../research/out-of-kernel-scheduling.md)
- [Completion rings and threaded runtimes](../research/completion-ring-threading.md)
- [Multimedia pipeline latency](../research/multimedia-pipeline-latency.md)
- [Robotics realtime control](../research/robotics-realtime-control.md)

## Goals

- Keep protected dispatch, budget enforcement, interrupt handling, and idle in
  the kernel.
- Replace the single global runnable queue with per-CPU runnable ownership and
  bounded cross-CPU wake/migration.
- Add CPU accounting before adopting policy that depends on runtime charge.
- Make ordinary best-effort scheduling fair by virtual time, with EEVDF-like
  virtual-deadline scheduling as the target after accounting exists.
- Represent admitted CPU time as `SchedulingContext` capability authority.
- Represent isolated CPU ownership as `CpuIsolationLease` authority.
- Support user-space scheduler policy services for admission and tuning
  without putting user-space calls on every dispatch path.
- Provide enough telemetry to distinguish scheduler cost, serial/MMIO logging,
  TLB/CR3 effects, QEMU/KVM artifacts, and workload contention.

## Full-SMP Scalability Focus

The scheduler work after the current Phase F chain should be judged by whether
capOS can keep useful throughput and bounded scheduling overhead on 16/32-core
machines, not by another small QEMU-only speedup row. The SMP proposal owns
CPU bring-up and APIC/TLB substrate; this proposal owns the scheduler changes
needed to make that substrate useful at higher core counts.

The scheduler side of the milestone should include:

- dynamic scheduler CPU sets derived from discovered topology instead of the
  temporary four-owner mask;
- per-CPU run queues and current-thread state that do not require one shared
  lock for ordinary local pick/requeue paths;
- narrower shared metadata locks for process/thread lookup, blocking waiters,
  exit cleanup, direct IPC handoff, and timer/deadline waiters;
- bounded cross-CPU wakeup and migration that records target, source, steal,
  reschedule-IPI, and failed-placement counters;
- topology-aware placement that separates physical cores, SMT siblings, and
  later NUMA/cache groups;
- total-time accounting for spawn/join/exit and service-bound workloads, not
  only syscall-free work windows;
- hardware-run artifacts that include native Linux baselines on the same
  machine and QEMU rows only as regression or virtualization context.

The benchmark shape should include static map/reduce, uneven dynamic tasks,
barrier-heavy phase loops, independent processes, same-process threads, and a
capability-call/service-bound workload. That matrix is intentionally broader
than the old thread-scale checksum row because high core counts often expose
lock convoying, wakeup storms, timer/IPI cost, TLB-shootdown scaling, and
runtime lifecycle overhead before pure compute saturates.

## Non-Goals

- Do not import Linux CFS/EEVDF, FreeBSD ULE, or sched_ext as code.
- Do not expose arbitrary user-supplied scheduler programs in the kernel in
  the near term.
- Do not make a user-space process the mandatory next-thread dispatcher.
- Do not claim hard realtime until admission, budget enforcement, IRQ/device
  behavior, kernel-path latency, and WCET evidence exist.
- Do not make nohz/full-nohz a thread flag. It is a CPU lease plus scheduler
  proof.

## Architecture

The target scheduler has four layers:

- Kernel mechanism: per-CPU run queues, current-thread state, idle, context
  switch, cross-CPU wake/migration, timer/IPI handling, CPU accounting, budget
  enforcement, and timeout/depletion faults.
- Kernel policy primitives: best-effort weights, virtual deadlines,
  scheduling contexts, CPU masks, isolation leases, direct IPC donation, and
  realtime-island hooks.
- Privileged scheduler policy service: admission, budget/profile selection,
  CPU partitioning, isolation grants, service/runtime hints, policy reload,
  and operator diagnostics.
- Application/runtime schedulers: work stealing, actors, async reactors,
  language M:N schedulers, request queues, and service-local priority and
  batching.

The hot path remains local and bounded: timer interrupt or wakeup, charge
runtime, update runnable state, pick from a per-CPU queue or a bounded steal
path, switch context. User-space policy participates at slower boundaries:
profile changes, thread/process creation, budget depletion, realtime admission,
lease grant/revoke, or explicit operator policy updates.

Stateful task/job graph coordinators sit above these layers. They may own
graph node queues, leases, retry state, cancellation, and assignment metadata,
but they do not own CPU dispatch. A graph node's `priority`, `deadline`,
`budget`, or `queue` field is workload policy until a capability-authorized
scheduler policy service maps it to a weight, scheduling context, CPU lease,
or request deadline.

## Stage 0: Evidence Before Policy

Before changing the default policy, the active thread-scale attribution work
must keep policy conclusions separated from benchmark artifacts. Current
mainline evidence now includes:

- scheduler candidate/outcome, reschedule-IPI, serial-byte, scheduler-lock,
  timer interrupt, and CR3/TLB counters behind
  `CAPOS_THREAD_SCALE_GUEST_MEASURE=1`;
- raw guest-PC samples for user-mode timer preemption points;
- logging-suppression A/B evidence through
  `CAPOS_THREAD_SCALE_SUPPRESS_SWITCH_LOGS=1`;
- exact native Linux pthread baseline evidence, including
  compact-versus-padded result-slot diagnostics;
- larger-workload/Amdahl evidence through `CAPOS_THREAD_SCALE_TOTAL_BLOCKS` and
  `LINUX_THREAD_SCALE_TOTAL_BLOCKS`.

This evidence does not prove the primary remaining cause of non-scaling.
Per-CPU runnable ownership, accepted work/total speedup thresholds, and
optional symbolic guest attribution remain follow-on work before a scheduler
policy claim.

This protects the design from treating QEMU/KVM, serial MMIO, or benchmark
cache contention as a scheduler algorithm problem.

## Stage 1: Per-CPU Runnable Ownership

Split the scheduler's runnable state first. The accepted initial shape has
per-CPU run queues with a runnable `ThreadRef` deque or priority buckets,
current-thread state, a local reschedule flag, and local counters. Shared
scheduler state keeps process/thread metadata, sleeping/deadline waiters,
blocked waiters, migration records, and the global policy epoch.

Rules:

- A runnable `ThreadRef` is owned by exactly one CPU queue at a time.
- Cross-CPU wake enqueues to the target CPU or a policy-selected CPU and sends
  a bounded reschedule IPI when needed.
- Migration removes from one owner before publishing to another.
- Idle CPUs steal only through bounded policy, not by scanning every process.
- Process exit and thread exit keep cleanup bounded and must not allocate in
  interrupt, cancellation, or emergency paths.

This stage may still use round-robin within each CPU queue. The objective is
SMP structure and evidence, not perfect fairness.

First implementation evidence exists as commit `1a8bf909`: capOS introduced
four bounded per-scheduler-CPU FIFO runnable queues under the existing
global scheduler lock. That slice proved the basic ownership structure and
bounded steal path. Follow-up review fixes reserved per-CPU queue capacity
before a thread became runnable, using a live reservation count released on
process/thread exit or pre-publication rollback, so timer and unblock
requeues did not allocate after work moved between CPUs. Update 2026-05-02:
the per-CPU queues were collapsed back into a single global runnable queue
under the same scheduler lock with the per-CPU run-queue-collapse cleanup
slice (see `docs/backlog/scheduler-evolution.md` and
`docs/architecture/scheduling.md`). Update 2026-05-07 23:45 UTC: Phase D
Task 3 reintroduced the per-CPU runnable queues, this time ordered
ascending by `virtual_finish_ns` (Weighted Fair Queueing) and balanced by
a bounded steal path that picks the most-overdue sibling Runnable
candidate (each sibling queue's first entry the destination CPU
considers Runnable; ties broken by lower CPU id). The queue ownership
and migration contract is documented in the scheduling architecture
page. This does not close the stage: the scheduler still
needs stronger cross-CPU wake counters, further separation from shared
process/thread metadata, replacement of temporary pinning policy, and
accepted benchmark evidence before policy conclusions should change.

## Stage 2: CPU Accounting

Add a monotonic runtime charge model. `ThreadCpuAccount` records runtime,
last-start time, virtual runtime, context switches, preemptions, and voluntary
blocks. `SchedEntity` records weight, latency class, eligible time, and virtual
deadline.

Accounting must be stable enough to support fair scheduling, quotas, and
future scheduling contexts. It must account context switches, blocking
syscalls, endpoint direct handoff, timer preemption, thread exit, and idle.

Where exact cycle attribution is not yet credible, the implementation should
label the metric as diagnostic rather than enforcing policy from it.

## Stage 3: Best-Effort Fair Policy

Stage 3's first implementation slice has landed. Phase D passed its Task 6
evidence gate at commit `77caafc0` (`2026-05-10 19:39 UTC`,
`docs(scheduler): record phase d thread-scale gate`) and closed in docs commit
`1a08ec23` (`2026-05-10 21:47 UTC`, `docs(scheduler): close phase d`) with
weighted fair queueing (WFQ) as the accepted best-effort policy. The
controlled Task 6 benchmark pair recorded capOS 1-to-4 work/total
speedups `3.088x` / `2.700x` at 4 workers, materially closing the
prior single-global-queue `1.566x` / `1.538x` diagnostic gap while
the matching Linux pthread baseline on the same host and physical-core
logical CPUs `0,1,2,3` recorded `3.974x` / `3.850x`. The completed
execution plan is archived at
`docs/backlog/scheduler-evolution.md`.

After Phase D, capOS should continue ordinary best-effort scheduling
from WFQ toward virtual-time fairness with stronger eligibility
semantics only when that follow-on is explicitly selected.

The long-term target policy is EEVDF-like:

- runnable entities accrue lag against their fair share;
- eligible entities are ordered by virtual deadline;
- weights affect virtual runtime/deadline progression;
- latency-sensitive best-effort entities can request smaller slices within
  policy limits;
- migration preserves accounting so moving CPUs does not reset fairness.

The first implementation slice was intentionally narrower than EEVDF:
weighted fair queueing on top of the existing per-thread
runtime/vruntime accounting. That decision and its accepted evidence
are recorded in the next subsection.

### Phase D first-policy decision (2026-05-05 19:00 UTC)

**Decision: weighted fair queueing (WFQ) for the first Phase D slice; EEVDF
remains the deferred follow-on.** Recorded against `main` commit
`60e421ab` and the `2026-05-02 21:38 UTC` thread-scale evidence pair
against `main` commit `374f8556` (capOS work `1.566x` versus Linux
`3.963x` at 1-to-4 on the same physical-core pin set).

Rationale (concise):

- The 1-to-4 gap is dominated by single-global-queue scheduler-lock
  contention plus exit/join/block/schedule overhead, not by ordering. Any
  fair-share policy that successfully consumes a per-CPU split should
  close most of the gap. The simpler policy reaches that signal sooner
  with less risk.
- The existing `ThreadCpuAccounting` record separates the load-bearing ledger
  from benchmark diagnostics: `runtime_ns`, `virtual_runtime_ns`, and
  `last_started_ns` are unconditional, while `context_switches`,
  `preemptions`, `voluntary_blocks`, `migrations`, placement history, and
  blocked/exited stability probes stay behind `cfg(feature = "measure")`. WFQ
  needs only a per-thread weight and a virtual finish time derived from the
  unconditional vruntime; that mapping is direct. EEVDF additionally needs a
  per-thread request size, lag, eligibility deadline, and an ordered
  eligible-set structure (`BTreeMap` by virtual deadline). The runtime/vruntime
  accounting fields exist, but the eligibility/lag fields do not.
- The target environment is `no_std` plus `spin::Mutex` plus a single
  global scheduler lock. WFQ keeps the eligibility structure as a
  bucketed per-CPU FIFO ordered approximately by virtual finish time;
  that is a familiar `VecDeque`-shaped data structure that mirrors the
  current `run_queue: VecDeque<ThreadRef>` ownership. EEVDF requires an
  ordered set inside the scheduler-lock-protected dispatch state, which
  is a larger structural change than the slice the gap evidence
  motivates.
- Latency-class differentiation (interactive / batch / IPC server) is
  expressible in WFQ; Phase D pins the mapping below in the
  capability-surface section so the implementation slice and the
  short-sleeper smoke have one rule. The Phase H policy service can
  layer richer policy on top without requiring a tree representation
  underneath.
- Linux moved from CFS to EEVDF in mainline 6.6 (released 2023-10);
  WFQ has decades of stable OS lineage. Either choice is defensible.
  The weighted-fair slice does not lock capOS into WFQ permanently —
  the same accounting fields, capability surface, and migration
  contract carry directly into EEVDF when the eligibility structure
  is added.

Rejected alternative: **EEVDF-first.** It is the stronger long-term
policy and Linux's current default. We are not picking it for the first
slice because (1) the eligibility-set data structure is a larger
diff that mixes structural change with the per-CPU enqueue
reintroduction the 1-to-4 gap evidence already motivates; (2) the lag
accounting and request-size ABI are not load-bearing for closing the
single-global-queue contention bottleneck the recorded benchmark
exposes; (3) moving from WFQ to EEVDF is a localized policy-module
change once the capability surface, migration contract, and per-CPU
queue split are accepted. The deferred EEVDF follow-on is tracked as
a later policy-evaluation slice; it is not a Phase D blocker and does
not displace Phase E `SchedulingContext`, which is the next scheduler
authority phase after the accepted WFQ gate.

First-slice scope (smallest implementable surface that closes the
1-to-4 gap):

- per-thread `weight: u16` and `latency_class: LatencyClass` fields,
  default values matching the current single-class FIFO behavior;
  the cap-boundary path rejects `weight = 0` and any nonzero value
  outside `[MIN_WEIGHT, MAX_WEIGHT]` (Phase D constants) with
  `CapException::InvalidArgument` rather than silently clamping, so
  no later divide-by-zero or overflow path can be reached through
  `setWeight` and so callers see policy denial instead of a hidden
  mutation. The `invalidArgument` variant landed in `ExceptionType`
  alongside `SchedulingPolicyCap` and `LatencyClass` with Phase D
  Task 1 (commit cb8c58b1, 2026-05-07); see
  `docs/proposals/error-handling-proposal.md` for the updated
  client-response taxonomy. The full validation rule lives in the
  cap-surface authority section below; this bullet records only that
  the validation runs at the cap boundary, not the dispatch path;
- per-thread weighted vruntime charging at runtime-charge points: the
  existing `ThreadCpuAccounting.virtual_runtime_ns` advances by
  `elapsed_ns * REFERENCE_WEIGHT / weight` (instead of the current
  1:1 elapsed) on every charge_runtime call. `runtime_ns` continues
  to advance 1:1 with elapsed time so monotonic CPU accounting,
  measure-mode reporting, and snapshot APIs are unchanged. The
  weighted-vruntime change is the actual fairness mechanism; without
  it, weights affect only enqueue-order ties rather than cumulative
  share. This matches the CFS-lineage approach and keeps the WFQ
  derivation `virtual_finish = vruntime + slice * REFERENCE_WEIGHT
  / weight` purely as an ordering aid for the local bucket;
- per-thread `virtual_finish_ns: u64` recomputed at each enqueue from
  `virtual_runtime_ns + slice_ns * REFERENCE_WEIGHT / weight`. It is
  not stored across blocking and is never carried as committed
  state; it is the per-enqueue ordering tag only;
- per-CPU bounded `run_queues: [VecDeque<ThreadRef>; SCHEDULER_CPUS]`
  (reintroduced) each ordered ascending by `virtual_finish_ns`; local
  selection scans the queue by index for the first
  destination-Runnable entry (RetryLater entries left in place; the
  first Runnable hit is also the lowest `virtual_finish_ns`
  candidate the destination can accept because the queue is
  ordered), then falls back to a bounded steal scan of sibling
  per-CPU queues;
- scheduler-lock-contained migration that keeps
  `virtual_runtime_ns` with the thread (per-thread state, not per-CPU)
  and re-inserts on the destination CPU at the post-migration
  virtual finish time;
- a capability-authorized policy path (see §"Phase D capability surface"
  below) that gates weight/latency-class mutation and reads;
- one-bisect-cycle single-global-queue fallback under
  `CAPOS_SCHED_DISABLE_WFQ=1`, now retired by Phase E preflight
  before `SchedulingContext` schema work.

The first slice is accepted: the `2026-05-10 19:46 UTC`
`make run-thread-scale` evidence pair recorded in `docs/changelog.md`
and `docs/benchmarks.md` passed the harness-enforced 1-to-2 work/total gates,
and Phase D manually accepted the recorded 1-to-4 work/total diagnostics for
closeout. The historical success threshold lives in
`docs/backlog/scheduler-evolution.md`.

### Phase D capability surface (kernel-side authority, no ambient process fields)

Per `docs/capability-model.md` "the interface IS the permission", weight
and latency-class authority is granted by giving a process a
`SchedulingPolicyCap` with the appropriately scoped target. The kernel
rejects any state mutation that does not arrive through such a cap.

Schema (landed with Phase D Task 1, commit cb8c58b1, 2026-05-07; the
original sketch took a `target :ThreadHandle` per method, but the
methods carry no target argument because Phase D associates the
target through cap state, not a per-method handle parameter.
Phase D Task 2 (closeout 2026-05-07 22:51 UTC) selected the
**context-derived caller-thread fallback** binding from the three
sketched options. Every method routes to the calling thread,
looked up through `CapCallContext::caller_thread`. The kernel
cap object remains zero-sized (`SchedulingPolicyCap`); routing
moved from `call` to `call_with_context` so the dispatch path
sees the caller's `ThreadRef`. There is no per-cap-object
`ThreadHandle`, no badge-encoded thread id, and no cross-thread
or cross-process mutation in this slice; per-cap-object target
references and badge-encoded thread ids are reserved for the
Phase H privileged scheduler policy service that will need
cross-thread authority. Today the manifest grant path therefore
authorizes the holder's own threads in the strict sense -- a
holder cannot reach another thread's `weight` or `latency_class`
through this cap):

```capnp
enum LatencyClass {
    interactive @0;
    normal      @1;
    batch       @2;
    ipcServer   @3;
}

interface SchedulingPolicyCap {
    setWeight @0 (weight :UInt16) -> ();
    setLatencyClass @1 (class :LatencyClass) -> ();
    snapshot @2 ()
        -> (weight :UInt16, class :LatencyClass,
            runtimeNs :UInt64, virtualRuntimeNs :UInt64);
}
```

The snapshot return is intentionally narrow: the four fields it
exposes (`weight`, `class`, `runtimeNs`, `virtualRuntimeNs`) are
the ones the WFQ slice promotes out of `cfg(feature = "measure")`
unconditionally. The benchmark-only counters
(`context_switches`, `preemptions`, `voluntary_blocks`,
`migrations`) stay behind the `measure` feature because they are
not load-bearing for ordering and remain useful only for
benchmark instrumentation; a future operator-observability slice
can add them to a separate snapshot cap once a non-emergency-path
storage and reporting surface exists.

Authority rules:

- `setWeight` and `setLatencyClass` are kernel-checked: an SQE
  invocation must carry a live `SchedulingPolicyCap`. The methods
  carry no per-call `ThreadHandle`; the target binding (selected
  in Phase D Task 2) is the **context-derived caller-thread
  fallback**: the kernel routes through
  `CapCallContext::caller_thread`, so a holder can only mutate
  its own running thread by construction. If a future cross-
  process grant lets a holder invoke the cap without authority
  over its bound target, the call fails closed through the
  standard cap-revocation transport-error path (the
  `disconnected`-class `CapException` produced by the ring
  dispatcher when the cap is revoked or stale); the
  `ExceptionType` taxonomy has no `Denied` variant by design.
- `setWeight` validates the input at the cap boundary, not at
  the dispatch path. The validation rule is: `weight = 0` (which
  would make the WFQ derivation `slice_ns * REFERENCE_WEIGHT /
  weight` divide by zero) is rejected with
  `CapException::InvalidArgument`; any nonzero value outside
  `[MIN_WEIGHT, MAX_WEIGHT]` (Phase D constants) is also
  rejected with `CapException::InvalidArgument`. The kernel
  does not silently clamp out-of-range values, because a silent
  clamp masks caller bugs and hides cap-boundary policy from
  the audit surface. The `invalidArgument` variant landed in
  `ExceptionType` with Phase D Task 1 (commit cb8c58b1, 2026-05-07);
  the updated client-response taxonomy is in
  `docs/proposals/error-handling-proposal.md`.
- The bootstrap `SchedulingPolicyCap` is granted by manifest only.
  Its initial domain is `Self` (the holder's own threads). Wider
  authority (cross-process weight/class mutation) belongs to the
  Phase H privileged scheduler policy service; Phase D does not
  promise that grant in the default boot manifest. Phase D
  manifests grant only the focused-proof scope needed for the
  test-matrix smokes.
- Default policy: a thread without any explicit cap-driven mutation
  carries `weight = DEFAULT_WEIGHT` and
  `latency_class = LatencyClass::Normal`. Behavior with all defaults
  must preserve the pre-Phase-D default workload behavior at the limit
  (no fairness regressions for unmodified workloads).
- Stale-cap revoke: `SchedulingPolicyCap` mutations carry the
  generation/epoch model used elsewhere. A weight change submitted
  after the cap is revoked fails closed; partially applied changes
  on a thread that exits between SQE arrival and dispatch fail with
  the standard `Stale` outcome and do not leak weight state.
- The cap surface is a single typed interface; restriction is by
  granting a narrower wrapper (e.g., `SchedulingPolicyCap` whose
  authority domain is exactly one `ThreadHandle`). The kernel does
  not carry a parallel rights bitmask.

Latency-class semantics for Phase D (pinned mapping):

- `LatencyClass::Normal` is the baseline; `weight` alone determines
  the WFQ share. The selected `slice_ns` is the Phase D default
  quantum.
- `LatencyClass::Interactive` reduces the per-enqueue slice
  contribution by a Phase D constant
  (`INTERACTIVE_SLICE_DIVISOR`; Phase D Task 2 ships `2`): the
  WFQ derivation becomes
  `vruntime + (slice_ns / INTERACTIVE_SLICE_DIVISOR) *
  REFERENCE_WEIGHT / weight`. This places the entity earlier in
  the per-CPU queue on each enqueue, so a short-sleeper that wakes
  on a Timer completion runs ahead of a same-weight CPU hog within
  the same scheduling window. The cumulative share is unchanged
  because vruntime accounting still advances at
  `elapsed_ns * REFERENCE_WEIGHT / weight`; the class only affects
  the per-enqueue tag, not the runtime-charge step.
- `LatencyClass::Batch` increases the per-enqueue slice contribution
  by a Phase D constant (`BATCH_SLICE_MULTIPLIER`; Phase D Task 2
  ships `4`): the derivation becomes
  `vruntime + (slice_ns * BATCH_SLICE_MULTIPLIER) * REFERENCE_WEIGHT
  / weight`. This places the entity later in the per-CPU queue on
  each enqueue, so a CPU hog at `LatencyClass::Batch` yields wake-to-
  run latency to `LatencyClass::Normal` and `LatencyClass::Interactive`
  siblings without losing its weighted share over a long window.
- `LatencyClass::IpcServer` is treated identically to
  `LatencyClass::Normal` for the WFQ ordering tag in this slice.
  The class exists in the ABI so a Phase H policy service can
  later re-bind direct-IPC preference, server affinity, or
  scheduling-context donation rules without an ABI break; Phase D
  does not change the existing direct-IPC preference slot
  semantics for this class.
- The class is stored on `Thread` and read at every enqueue. A class
  change through `setLatencyClass` is observed on the next
  enqueue (next dequeue + re-enqueue, or next wake from blocked).
  No retroactive recomputation of an in-queue tag.

Phase D does **not** build the userspace policy service (Phase H). It
adds the kernel-side primitive that Phase H will consume.
`SchedulingContext` (Phase E) is a separate authority for
budget/period/CPU mask; weight/latency-class is the WFQ ordering knob,
not CPU-time authority. The two cap surfaces stay disjoint.

### Phase D migration fairness sketch

A thread migrating from CPU A to CPU B mid-quantum must preserve its
share. Rules:

- `virtual_runtime_ns` is per-thread, not per-CPU. It travels with
  the thread on every migration. The accounting record already
  encodes that (`ThreadCpuAccounting.virtual_runtime_ns` lives on
  `Thread`, not on a CPU slot). Phase D promotes that field
  out of `cfg(feature = "measure")` and changes the
  `charge_runtime` step so the field advances by
  `elapsed_ns * REFERENCE_WEIGHT / weight` rather than 1:1 with
  elapsed time; the migration contract is otherwise unchanged.
- Per-CPU local clocks are not used as a vruntime reference. The
  scheduler reads the global monotonic clocksource through
  `crate::arch::context::monotonic_ns()`, the same source the
  unconditional runtime/vruntime ledger uses. There is no
  per-CPU clock offset because there is no per-CPU vruntime
  reference.
- `virtual_finish_ns` is recomputed at enqueue on the destination
  CPU from the destination weight, not carried as committed state.
  The migration step is remove-from-source, recompute,
  insert-at-destination; the scheduler lock is held for the whole
  window.
- Cross-CPU steal: a CPU whose local queue has no runnable entry
  walks sibling per-CPU queues. For each sibling queue the scan
  walks indices ascending and stops at that queue's first entry
  the destination CPU considers `Runnable`; because each queue is
  ordered ascending by `virtual_finish_ns`, the first Runnable hit
  per queue is the lowest `virtual_finish_ns` candidate the
  destination can accept on that source. The steal target is then
  the source queue whose first-Runnable candidate has the
  **lowest** `virtual_finish_ns` globally — the same fair-share
  rule the local pick uses (most overdue first) — with ties broken
  by lower CPU id. The chosen entry is removed from its actual
  position on the source queue (not necessarily the head: a
  RetryLater or single-CPU-owner thread may sit at the front and
  stay there); the destination recomputes `virtual_finish_ns` and
  inserts at the destination ordered position. The steal is
  allocation-free because both queues are pre-reserved against the
  live runnable count.
- The `ThreadCpuAccounting.migrations` counter is incremented on
  each cross-CPU enqueue, both for placement-time spread and for
  steal. The behavior mirrors the prior pre-collapse counter; the
  Phase D slice keeps it under `cfg(feature = "measure")` until a
  permanent operator snapshot path lands.

The one-bisect-cycle single-global-queue fallback has been retired
before Phase E. The accepted Phase D behavior is now always the
per-CPU WFQ queue shape described above.

### Phase D test matrix

Workload shapes the implementation slice verified before close:

- **CPU hogs (existing `make run-thread-scale`).** Equal-weight
  same-process threads must split CPU share within bench tolerance.
  Different-weight threads must split CPU share approximately in
  proportion to weights (e.g., weights `2:1` → roughly `2:1`
  runtime ratio). Phase D manually accepted the recorded 1-to-4 diagnostic at
  `3.088x` work speedup versus the recorded `1.566x` baseline.
- **Short sleepers.** Threads that block on `Timer.sleep` for short
  intervals must preempt CPU hogs within one quantum's worth of
  bound after wake. Latency-class `Interactive` should have lower
  observed wake-to-run latency than latency-class `Batch`. Phase D
  closed this with focused `make run-thread-fairness` and
  `make run-thread-fairness-interactive` QEMU smokes.
- **Direct IPC server/client pairs (existing `make run-spawn`).**
  An IPC server thread woken by an endpoint CALL must keep
  paired-call timing comparable to the current direct-IPC handoff.
  The direct-IPC preference slot must keep its existing
  generation-checked semantics under WFQ; a server should not
  starve when the global vruntime advances on other CPUs.
- **Multi-process load (existing `make run-smp-process-scale`).**
  Independent worker processes with default weights must preserve
  the recorded `2026-04-30` `1.6x` 1-to-2 gate. WFQ across processes
  (no shared address space) must not regress that proof.
- **Same-process sibling load.** This is the same workload shape
  as `make run-thread-scale`; it doubles as the per-CPU-queue
  reintroduction proof.

The exact historical per-workload acceptance numbers live in
`docs/backlog/scheduler-evolution.md`.

### Phase D overload behavior

Soft overload (runnable entities × weight exceeds the selected CPU
set's capacity):

- Each entity gets less than its weighted share. No entity is
  starved; vruntime ordering guarantees that the most-behind
  thread runs next.
- The scheduler does not refuse to enqueue. Phase D's WFQ does
  not implement strict admission; that belongs to Phase E
  (`SchedulingContext` budget/period) and Phase G (`RealtimeIsland`
  admission).

Hard overload (e.g., a `RealtimeIsland` admission attempt that
collides with an active `CpuIsolationLease`):

- Use the existing isolation/admission path; Phase D defers to
  Phase F's `CpuIsolationLease` and Phase G's `RealtimeIsland`
  for that behavior. WFQ continues to schedule best-effort work
  on the housekeeping CPU set.
- If an isolation lease holds CPU N and N has runnable best-effort
  work that cannot migrate (e.g., bound by manifest pinning), the
  lease attempt fails closed; existing CPU-mask validation
  remains the gate. Phase D does not introduce new pinning policy.

Strict admission, deadline overrun, and budget depletion are
explicitly out of scope for Phase D and stay in Phase E/G.

## Stage 4: Scheduling Contexts

CPU-time authority becomes a capability. `SchedulingContext` records budget,
period, relative deadline, priority or criticality, CPU mask, remaining
budget, replenishment state, timeout endpoint, and overrun policy.

The landed Phase E slices remain narrower than the full target above. The ABI
now has `SchedulingContextSpec` authority inputs for `budgetNs`, `periodNs`,
`relativeDeadlineNs`, byte-oriented `cpuMask`, and `overrunPolicy`, plus a
read-only `SchedulingContextInfo` snapshot with context identity, lifecycle
state, binding state, remaining budget, and an explicit dispatch-effect label.
`SchedulingContext.info()` remains method id 0. `SchedulingContext.create()`
creates a same-interface result cap for a validated spec,
`bindCallerThread()` records one caller-thread binding for the current
generation, and `revoke()` advances the generation and clears the matching
thread metadata binding. Bootstrap-granted contexts and contexts returned by
`create()` draw from the same non-wrapping context-id allocator, so the
`(contextId, generation)` binding key does not alias distinct cap objects.

Bound active contexts now install a fixed per-thread dispatcher budget ledger:
runtime charge decrements `remainingBudgetNs`, runnable selection replenishes
elapsed periods, and exhausted contexts remain queued but ineligible until the
next replenishment period. The effect label is `budgetEnforced` for active
contexts and stays `infoOnlyNoDispatchChange` for stale/revoked fail-closed
paths. Deadline-driven accounting now arms a sub-tick budget-exhaustion
one-shot when the selected thread's remaining budget would deplete before the
next periodic scheduler tick, and nohz re-arm folds the leased thread's budget
deadline into its existing nearest-deadline timer. Kernel-mode budget one-shot
fires restore a live periodic timer before returning to kernel code, so the
ordinary and tick-masked paths no longer rely on a full tick quantum to observe
budget depletion.
Synchronous endpoint donation/return now covers passive receiver threads:
endpoint in-flight state carries an internal donation token, receiver runtime
charges to the caller-donated context, RETURN, application-exception RETURN,
or invalid-result RETURN restores the reduced budget to the caller before
caller wake, a donor with an in-flight token is blocked from returning to
userspace until RETURN/cancel using an atomic marker-to-block transition that
treats already-returned fast paths as normal completion, and nested donation of
an already donated context is rejected until stacked return tokens have a
dedicated design.
Timeout/depletion notifications now use fixed per-context cells allocated at
context creation/bootstrap. The cells coalesce budget-depleted and
deadline-or-timeout events with typed sequence/count metadata, holder identity,
remaining budget, next timestamp, donated-holder marking, explicit-revoke
lifecycle state, and `ok`/`revoked`/`staleGeneration` observer results through
`SchedulingContext.drainNotifications()`. Notification publishing does not
allocate in scheduler hard paths, publish result caps, append unbounded queues,
donate budget, reorder runnable entities, bypass throttling, or imply nohz
behavior. A pre-armed observer waiter/wakeup path, realtime admission, SQPOLL,
nohz, and CPU placement enforcement remain future work. Stale caps report
`staleGeneration` and cannot mutate the new generation's scheduler metadata or
budget ledger; revoked contexts report `revoked`. Ordinary non-donated
session logout now uses the same stale-generation rule: after
`UserSession.logout()` flips the liveness cell, the scheduler removes matching
non-donated bound thread contexts and marks the old cap generation stale. The
focused session-context proof covers stale `info`, `bindCallerThread`,
`create`, `revoke`, and notification-drain behavior without result-cap
publication or metadata mutation. Donated receiver logout keeps the
conservative skip policy: if logout observes a receiver thread holding an
endpoint-donated context, the hook counts the skipped donated binding and
leaves the donor blocked until endpoint RETURN/cancel commits cleanup. The
focused session-context proof covers the RETURN case by showing the receiver
logs out while holding the donation, the donor stays blocked, the hook reports
`donation_inflight_skipped=1`, and the caller observes a bound context with
reduced remaining budget after RETURN rather than fresh budget. Clean local
owner-shell exit now calls the held `UserSession.logout()` before process exit,
and the shell smoke observes the same scheduler hook with no bound local shell
`SchedulingContext`.

`cpuMask` is a canonical little-endian bitset. CPU `n` maps to bit `n % 8` of
byte `n / 8`, with bit 0 as the least-significant bit of each byte. Empty data
means no CPUs are selected, not "all CPUs"; future admission/bind validation
rejects empty masks for runnable contexts. Producers omit trailing zero bytes:
the all-zero set is encoded as empty, and any non-empty canonical mask has a
nonzero final byte. This slice only snapshots that shape and does not enforce
placement from it.

Remaining kernel responsibilities:

- prevent a thread without eligible CPU authority from running;
- charge runtime to exactly one authority target;
- add any pre-armed timeout/depletion observer wake path without allocating in
  emergency paths.

Policy-service responsibilities:

- admit or reject scheduling contexts;
- choose budget/period/priority;
- bind contexts to threads/services;
- revoke or adjust contexts safely;
- record operator-visible decisions.

`SQE.deadline_ns` remains request metadata. It may influence drop, freshness,
propagation, and telemetry, but it does not grant CPU budget.

## Stage 5: CPU Isolation Leases and SQPOLL

`CpuIsolationLease` grants placement and exclusivity, not CPU time. It records
the owner process/session/service, CPU set, mode, housekeeping exclusions,
accounting target, maximum revocation latency, and revoke endpoint.
The current Phase F implementation keeps ticks periodic but makes
housekeeping/deferred-work placement explicit: at least one online scheduler
housekeeping CPU must remain outside active lease candidates, and preflight
telemetry routes or rejects deferred cleanup, timer/deadline, network polling,
IRQ affinity, scheduler accounting, and cleanup latency before later SQPOLL or
nohz behavior can use the lease.

The Phase F substrate landed so far is:

- the one-SQ-consumer ring-ownership prerequisite that lets nohz/SQPOLL
  reason about a single submission consumer per ring;
- nohz activation telemetry that labels admit/reject decisions, rollback
  reasons, and current periodic-tick fallback state without changing
  dispatch behavior;
- housekeeping/deferred-work placement preflight, which fail-closes when
  unrelated timers, deferred cleanup, network polling, debug/watchdog work,
  or IRQ delivery would otherwise be pinned to a candidate isolated CPU;
- a bounded SQPOLL ring-mode worker (`MAX_SQPOLL_WORKERS = 16`) that
  records `tick_suppression=disabled` / `full_nohz=disabled` strings while
  the activation proof is still open, with generation-checked stale-owner
  rollback;
- a clockevent/deadline substrate independent of the periodic tick, so the
  scheduler can express "wake at deadline T" without depending on periodic
  ticks to enforce budget;
- a bounded non-periodic SQPOLL producer-wake progress path that lets a
  parked SQPOLL worker make forward progress on producer activity without
  reverting to a periodic tick.

Automatic nohz activation -- actually suppressing the periodic scheduler
tick on an admitted CPU and restoring it on rollback/revoke/stale
generation -- was closed for the first increment via
[`docs/tasks/done/2026/scheduler-phase-f-auto-nohz-activation.md`](../tasks/done/2026-05-16/scheduler-phase-f-auto-nohz-activation.md):
the `CpuIsolationLease` preflight now performs
real per-CPU periodic-tick suppression for the narrow single-runnable-entity
window, satisfying proof obligations for single runnable entity on the
target CPU, ready housekeeping CPU outside the lease, non-local
deferred-cleanup/timer/network/IRQ dependencies, valid accounting target,
bounded revocation latency, and generation-checked ring ownership, with
fail-closed rollback. SQPOLL-driven auto-nohz activation is also closed via
[`docs/tasks/done/2026/scheduler-phase-f-auto-nohz-sqpoll.md`](../tasks/done/2026-05-16/scheduler-phase-f-auto-nohz-sqpoll.md):
a ring-coupled `kernelSqpoll` lease whose bound ring is in SQPOLL
running/sleeping mode with a live owner is admitted for tick suppression,
with the SQPOLL ring-state re-check as the decisive rollback gate. The
`tick_suppression`, `auto_nohz`, and `sqpoll` telemetry counters reflect
real suppression. Generic full-nohz for ordinary budgeted compute threads is
now admitted by explicit `SchedulingContext`-targeted `CpuIsolationLease`
preflight; production realtime island admission remains deferred independently
of these closed tasks.

Activation requires scheduler proof:

- at least one housekeeping CPU remains online;
- unrelated timers, deferred cleanup, network polling, and debug/watchdog work
  are not pinned to the isolated CPU;
- the active ring has exactly one SQ consumer;
- the accounting target is valid and chargeable;
- revocation latency fits the lease policy.

The scheduler idle path is now a per-CPU CPL0 (kernel-mode) idle thread;
the user-mode idle process was removed in commit e3c0df01 (2026-05-14 UTC).
There are two CPL0 idle paths: the cooperative boot/AP path that `hlt`s at
CPL0 on the per-CPU kernel stack, and the steady-state idle-thread path
reached from the four dispatch sites (`schedule`, `capos_block_current_syscall`,
`exit_current`, `exit_current_thread`). Both are described in detail in
[Scheduling](../architecture/scheduling.md).

SQPOLL uses the ring-mode contract in
[Tickless and Realtime Scheduling](tickless-realtime-scheduling-proposal.md).
The scheduler proposal adds the CPU-ownership and policy-service side of that
contract.

## Stage 6: Realtime Islands

A `RealtimeIsland` is an admitted graph, not a single priority. It records
scheduling contexts, memory reservations, device and IRQ reservations,
rings/endpoints/notifications, any CPU isolation leases, admission evidence,
and overrun/shutdown policy.

Use cases include local audio, realtime voice, robotics control, and selected
provider/runtime loops. Admission must fail closed if the graph cannot fit
the declared period/quantum and reservations.

## Stage 7: User-Space Scheduler Policy

After kernel primitives are in place, a privileged scheduler policy service can
own:

- default resource profiles;
- session/account/service CPU policy;
- scheduling-context admission;
- CPU lease grant/revoke;
- runtime hints such as latency-sensitive, batch, driver, poller, or agent;
- AutoNoHz placement for ordinary threads that appear capable of utilizing
  a full CPU core (see
  [Policy-Service Userstories in tickless-realtime-scheduling-proposal](tickless-realtime-scheduling-proposal.md#policy-service-userstories-autonohz-placement-for-compute-capable-threads));
- operator-facing diagnostics and policy reload.

AutoNoHz placement is the policy-service surface that turns the "thread
appears capable of utilizing a full CPU core" observation into a bounded
`CpuIsolationLease` against a pre-authorized account or session CPU pool. The
lease adds isolation; it does not mint CPU-time authority. The thread still
consumes time through its existing `SchedulingContext` (or coarse
`ResourceLedger`); the lease just removes tick and scheduler noise while that
budget is being consumed. Bounds the policy service must enforce on every
auto-issued lease -- lifetime, revocation latency, accounting target,
auto-claim pool capacity, and fairness preemption -- are detailed in the
tickless proposal.

The kernel still owns emergency fallback. If the policy service is dead,
blocked, stale, or malicious, the kernel must continue to enforce safety,
revoke leases as policy permits, and schedule a minimal recovery path.

## Validation Gates

- Per-CPU queue work must preserve `run-smoke`, `run-spawn`,
  `run-thread-scale`, park/ring/process-exit smokes, and SMP smokes.
- A thread-scale milestone closeout must include repeated controlled
  `capos-bench` evidence and raw logs.
- CPU accounting must include sanity tests that measured runtime increases
  monotonically while a thread runs and stops while it is blocked.
- Fair policy changes must include adversarial tests: CPU hogs, short
  sleepers, direct IPC handoff, multi-process load, and same-process sibling
  load.
- Scheduling-context work must include admission rejection, budget depletion,
  replenishment, endpoint donation/return, timeout notification, stale cap
  revocation tests, and any future pre-armed notification waiter coverage.
- CPU leases must include revocation, process exit, session close, and
  housekeeping fallback tests.
- Realtime island proofs must show preallocation, no allocation/blocking on
  admitted paths, deadline miss telemetry, and fail-closed overrun behavior.

## Open Decisions

- ~~Whether the first best-effort fair policy should be weighted fair queueing or
  direct EEVDF.~~ Resolved 2026-05-05 19:00 UTC: WFQ first; EEVDF deferred
  follow-on. See "Phase D first-policy decision" above.
- Whether scheduling-context priority is a scalar, a criticality band, or both.
- Whether `SchedulingContext` should be bindable to a process default,
  individual thread, endpoint call path, or all three in the first ABI.
- Which scheduler telemetry is permanent ABI and which is benchmark-only.
- How much policy-service state belongs in the boot manifest versus mutable
  operator configuration.
- Whether the WFQ slice's bucketed `VecDeque` per-CPU queue is the long-term
  representation or a stepping stone to an EEVDF `BTreeMap`-based eligibility
  set. EEVDF is an evaluated follow-on policy, not a committed migration;
  re-evaluate only when the explicit Phase D follow-on EEVDF migration backlog
  item is selected. Phase F's one-SQ-consumer prerequisite, nohz telemetry,
  housekeeping/deferred-work placement, bounded SQPOLL ring mode,
  clockevent/deadline substrate, and bounded non-periodic SQPOLL producer-wake
  progress have landed on top of the closed Phase E `SchedulingContext` gate;
  the first automatic nohz activation increment is also closed via
  [`docs/tasks/done/2026/scheduler-phase-f-auto-nohz-activation.md`](../tasks/done/2026-05-16/scheduler-phase-f-auto-nohz-activation.md)
  and SQPOLL-driven auto-nohz activation is closed via
  [`docs/tasks/done/2026/scheduler-phase-f-auto-nohz-sqpoll.md`](../tasks/done/2026-05-16/scheduler-phase-f-auto-nohz-sqpoll.md);
  timeout-based auto-revoke and ordinary-thread generic full-nohz admission are
  also landed. The policy-service AutoNoHz capstone and generic SQPOLL nohz for
  arbitrary rings remain open.
  Phase F.5 (full-SMP 16/32-core scalability) is still planning.