# Scheduler Evolution Backlog

This backlog decomposes future scheduler architecture from
[Scheduler Evolution](../proposals/scheduler-evolution-proposal.md). It also
retains the completed attribution and placement history that closed the
**In-Process Threading Scalability** milestone; new selected-milestone work now
continues from `docs/tasks/README.md`.

## Design Grounding Checklist

Before implementation slices, read:

- `docs/architecture/scheduling.md`
- `docs/backlog/smp-phase-c.md`
- `docs/proposals/smp-proposal.md`
- `docs/proposals/ring-v2-smp-proposal.md`
- `docs/proposals/tickless-realtime-scheduling-proposal.md`
- `docs/proposals/stateful-task-job-graphs-proposal.md`
- `docs/proposals/scheduler-evolution-proposal.md`
- `docs/proposals/system-performance-benchmarks-proposal.md`
- `docs/proposals/hpc-parallel-patterns-proposal.md`
- `docs/research/future-scheduler-architecture.md`
- `docs/research/nohz-sqpoll-realtime.md`
- `docs/research/out-of-kernel-scheduling.md`
- `docs/research/completion-ring-threading.md`
- `docs/research/hpc-parallel-patterns.md`

For realtime or isolation slices, also read:

- `docs/research/multimedia-pipeline-latency.md`
- `docs/research/robotics-realtime-control.md`
- `docs/research/x2apic-and-virtualization.md`

## Phase A: Attribution and Guardrails

- [x] Finish first-pass thread-scale attribution guardrails. Scheduler
      candidate/outcome, reschedule-IPI, serial-byte, scheduler-lock, timer
      interrupt, CR3/TLB, raw guest-PC sample, logging-suppression A/B, exact
      Linux pthread baseline, compact-versus-padded result-slot diagnostic, and
      larger-workload/Amdahl evidence now exist. The evidence does not identify
      the primary remaining non-scaling cause; it keeps per-CPU runnable
      ownership, accepted threshold-passing work/total evidence, and optional
      symbolic attribution as follow-on work.
- [x] Add bounded scheduler-lock site attribution before a structural lock
      split. As of 2026-05-01 09:52 UTC, measure builds keep the compatible
      aggregate `scheduler_lock` line and also emit aggregate plus per-phase
      `scheduler_lock_site` counters for generic, timer pre-ring, timer
      select, blocking, process exit, thread exit, start/idle selection,
      wake/unblock, and metadata classes. This is split-prep attribution only;
      it does not accept the in-process thread-scale milestone.
- [x] Add timer-fast-path attribution for the bounded continuation path. As of
      2026-05-01 10:58 UTC, measure builds extend the aggregate and per-phase
      `timer` counter lines with fast-path attempts, continues, and fallback
      reasons for slow-required/dirty summaries, skip-budget exhaustion,
      pending reschedule IPIs, no-current/non-idle CPUs, and inactive/invalid
      scheduler CPUs. The thread-scale harness requires those fields only for
      `CAPOS_THREAD_SCALE_GUEST_MEASURE=1`. This is attribution only; it does
      not change scheduler behavior and does not close the current
      `accepted=false` work or total gates. Local one-run evidence in
      `target/thread-scale/20260501T110157Z/` passed with the new fields
      present in every 1/2/4-thread `measure.log`; the timed work phase
      recorded `fast_path_continues=0` for all three rows.
- [x] Add timer slow-summary reason attribution for dirty fast-path summaries.
      As of 2026-05-01 11:28 UTC, measure builds emit aggregate and per-phase
      `timer_slow_summary` lines with required/clean counts plus reason fields
      for nonempty run queues, direct IPC targets, handoff-current state,
      pending process termination/drop/stack release, timer sleeps, and timed
      cap-enter versus park waiters. The harness requires those lines only for
      `CAPOS_THREAD_SCALE_GUEST_MEASURE=1`. Local one-run evidence in
      `target/thread-scale/20260501T112359Z/` passed with the new lines present
      in every 1/2/4-thread `measure.log`; the timed work phase reported
      dirty summaries attributable to `run_queue_nonempty` and
      `handoff_current` only, with `required=2/4/8`, `clean=0`, and timer
      sleeps/timed waiters at zero for the 1/2/4-thread rows. The subsequent
      fairness-only behavior slice keeps the same fields, but `required` now
      means direct IPC, deferred cleanup, timer sleeps, or timed waiter work
      still force the next locked timer pass.
- [x] Complete thread-scale shared-kernel-state contention attribution
      guardrails beyond the first measure-only lock-counter slice. As of
      2026-05-01 08:07 UTC,
      `CAPOS_THREAD_SCALE_GUEST_MEASURE=1` emits aggregate and per-phase
      `shared_kernel_lock` counters for frame allocator alloc/free locks,
      ring-dispatch cap-table and ring-scratch locks before
      `cap::ring::process_ring`, endpoint inner/cancellation scratch locks,
      direct per-process address-space locks, and heap allocator locks.
      As of 2026-05-01 08:29 UTC, fresh thread-scale rows also carry explicit
      benchmark-class fields and the harness requires, validates, and exports
      those fields to `results.csv`; local one-run evidence is retained in
      `target/thread-scale/20260501T083254Z/`. As of 2026-05-01 08:49 UTC,
      guest-measure runs also emit and require aggregate and per-phase
      `network_poll` counters for initialized virtio-net
      scheduler/runtime/interface polling, the built-in TCP HTTP proof poll,
      virtqueue poll spins and completions, and pending network waiter scans.
      Local one-run evidence in `target/thread-scale/20260501T093505Z/` passed
      and retained zero aggregate and per-phase network/poll counters for the
      1/2/4-thread rows. The default thread-scale manifest has no virtio-net
      device, and the scheduler poll entry returns before the driver mutex in
      that no-device case. Those counters are expected zero-evidence for the
      CPU-bound thread-scale benchmark. They do not prove service throughput;
      future service/network benchmarks still need their own hot-section
      attribution and acceptance evidence.
- [x] Add a benchmark-kernel mode that suppresses per-context-switch logging
      during measured cases so serial MMIO cannot masquerade as scheduler cost.
      Completed with `CAPOS_THREAD_SCALE_SUPPRESS_SWITCH_LOGS=1`; benchmark
      proof/error output and measure lines remain enabled.
- [x] Decide which counters are permanent observability and which stay behind
      `measure`. Completed 2026-05-01 04:55 UTC in
      `docs/architecture/scheduling.md`: all existing `kernel/src/measure.rs`
      counters remain benchmark-only behind the `measure` feature. Permanent
      scheduler observability should be added later through a separate
      low-overhead operator snapshot surface after the Phase C runtime
      accounting ledger exists, starting with runtime, context-switch,
      preemption, voluntary-block, migration, queue-depth, reschedule-IPI,
      TLB-shootdown, and policy admission/denial counts. Phase/cycle
      attribution, scheduler-lock wait/hold cycles, serial byte attribution,
      timer/TLB benchmark totals, raw user-PC samples, and thread-scale phase
      checkpoints stay behind `CAPOS_THREAD_SCALE_GUEST_MEASURE=1`.
      Grounding read: `docs/architecture/scheduling.md`,
      `docs/proposals/scheduler-evolution-proposal.md`,
      `docs/research/future-scheduler-architecture.md`,
      `docs/research/out-of-kernel-scheduling.md`,
      `docs/research/nohz-sqpoll-realtime.md`, and
      `docs/research/completion-ring-threading.md`.
- [ ] Record controlled benchmark-VM evidence before and after each scheduler
      structure change.
      Latest follow-up after the first Phase C runtime-accounting slice reran
      the in-process thread-scale diagnostic at main commit `a88e7906` with
      QEMU pinned to physical-core logical CPUs `0-3` and SMT logical CPUs
      `0-7`. All rows remained `accepted=false`: physical 1/2/4 work speedups
      were `1.000x` and `0.999x`, and SMT 1/2/4/8 work speedups were `1.000x`,
      `1.001x`, and `0.333x`.
      Follow-up after the total-speedup host-summary gate landed reran current
      `main` commit `f198b099` on the benchmark VM with QEMU pinned to `0-3` and
      `0-7`. The harness now reports total-speedup diagnostics explicitly:
      physical 1/2/4 work speedups were `1.002x` and `1.002x`, total speedups
      were `0.911x` and `0.601x`; SMT diagnostic 1/2/4/8 work speedups were
      `1.001x`, `0.998x`, and `0.333x`, total speedups were `0.913x`,
      `0.621x`, and `0.200x`. Both host-summary gates remain unsatisfied.

## Phase B: Per-CPU Runnable Ownership

- [x] Land the first bounded per-CPU runnable queue slice. Commit `1a8bf909`
      replaces the single global scheduler `VecDeque` with four
      per-scheduler-CPU FIFO queues under the existing global scheduler lock,
      centralizes enqueue/requeue/removal helpers, keeps single-owner
      capability processes on CPU0, prefers local work before bounded stealing,
      preserves direct IPC preference, and removes stale runnable entries for
      process/thread exit. Review fixes track live run-queue reservations,
      reserve all per-CPU queues to that count before publishing a new runnable
      thread, and release reservations on process/thread exit or
      pre-publication rollback, keeping timer and unblock requeue paths
      allocation-free after cross-CPU steals. Verification covered `run-spawn`,
      `run-smp2-smokes`, and controlled benchmark-VM 1/2/4/8-thread
      diagnostics. The default
      workload and total-case 64 MiB rows remain `accepted=false`, so this is
      structure evidence, not milestone closeout.
- [x] Finish `PerCpuRunQueue` ownership invariants as a documented contract.
      Completed 2026-05-01 02:13 UTC in
      `docs/architecture/scheduling.md`: a live generation-checked
      `ThreadRef` has at most one runnable dispatch owner across current slots,
      per-CPU run queues, and the direct IPC target; migration is a
      scheduler-lock-contained remove-before-publish transfer; local-first
      stealing is bounded by the scheduler CPU slots; and live run-queue
      reservations keep timer, unblock, direct-IPC fallback, steal retry, and
      steal requeue paths allocation-free.
- [x] Split current-thread and runnable ownership from shared process/thread
      metadata without widening emergency-path allocation. Completed
      2026-05-01 04:22 UTC in commit `d7221648`:
      `Scheduler::processes` remains
      the shared process/thread metadata table, while `SchedulerDispatch` now
      owns per-CPU run queues, current and handoff slots, idle slots, the
      direct IPC target, run-queue reservation count, pending process drops,
      and pending thread stack releases. The existing global scheduler lock and
      generation checks are unchanged, and the dispatch split keeps the
      pre-reserved run-queue capacity model for timer, unblock, direct-IPC
      fallback, steal retry, and steal requeue paths. Verification passed
      `make fmt-check`, `cargo build --features qemu`, a cached
      `make run-spawn` rerun, and `make run-smp2-smokes` in
      `target/smp2-smokes/20260501T042343Z/`.
      Controlled benchmark-VM timing after merge `56458b12` stayed
      `accepted=false`:

      | Pinning | Workers | Work Median | Total Median | Work Speedup | Total Speedup |
      | --- | ---: | ---: | ---: | ---: | ---: |
      | physical `0-3` | 1 | `56275842` | `140953762` | `1.000x` | `1.000x` |
      | physical `0-3` | 2 | `56290542` | `153327094` | `1.000x` | `0.919x` |
      | physical `0-3` | 4 | `56315094` | `237018874` | `0.999x` | `0.595x` |
      | SMT `0-7` | 1 | `56258010` | `140620194` | `1.000x` | `1.000x` |
      | SMT `0-7` | 2 | `56313324` | `153367860` | `0.999x` | `0.917x` |
      | SMT `0-7` | 4 | `56352472` | `237971426` | `0.998x` | `0.591x` |
      | SMT `0-7` | 8 | `169006414` | `727393630` | `0.333x` | `0.193x` |
- [x] Add a bounded timer continuation fast path before a broader scheduler
      lock split. Completed 2026-05-01 10:29 UTC: user-mode LAPIC timer ticks
      can continue the current non-idle thread without calling
      `sched::schedule()` only when a previous locked timer slow path published
      a clean hard-work summary, the current CPU is a valid active scheduler
      slot, no reschedule IPI is pending for that CPU, and the per-CPU one-skip
      budget is not exhausted. Dirty producers still force at least one locked
      pass before bypass, but the 2026-05-01 11:40 UTC follow-up lets that pass
      classify remaining nonempty run queues and handoff-current markers as
      fairness/protection-only state. Direct IPC targets, deferred
      termination/drop/stack cleanup, Timer sleeps, and timed cap-enter/Park
      waiters still keep the hard slow-path bit set; ordinary ring SQEs and
      indefinite cap wait scans are still serviced by forced slow-path ticks.
      This is a correctness-first split-prep slice, not a replacement for
      narrower scheduler metadata locks or accepted thread-scale evidence.
      Controlled benchmark-VM physical-core `0-3` before/after runs for the
      initial strict-clean version retained
      `accepted=false`: baseline
      `target/thread-scale/timer-fastpath-baseline-main-physical-20260501T102938/`
      recorded work speedups `0.998x` and `0.998x` plus total speedups
      `0.907x` and `0.620x`; after-change
      `target/thread-scale/timer-fastpath-after-physical-20260501T104700/`
      recorded work speedups `1.001x` and `0.999x` plus total speedups
      `0.909x` and `0.602x`. Controlled benchmark-VM physical-core `0-3`
      before/after runs for the fairness-only follow-up stayed
      `accepted=false`: baseline `target/thread-scale/20260501T120224Z/`
      recorded work speedups `1.001x` and `0.999x` plus total speedups
      `0.913x` and `0.587x`; after-change
      `target/thread-scale/20260501T120709Z/` recorded work speedups `1.001x`
      and `1.000x` plus total speedups `1.125x` and `0.828x`.
- [x] Add cross-CPU wake policy for endpoint, timer, park, process wait, and
      thread join completions. Completed 2026-05-01 03:06 UTC: queued wakeups
      now target the selected per-scheduler-CPU FIFO owner instead of scanning
      all idle scheduler CPUs.
- [x] Add explicit placement evidence and placement policy for newly runnable
      same-process worker threads. Completed 2026-05-01 12:37 UTC, refined
      2026-05-01 13:20 UTC, and repaired 2026-05-01 14:58 UTC after the
      blocking-parent benchmark exposed a placement regression. Measure builds
      emit aggregate and per-phase `thread_placement` lines with single-owner
      publish buckets, normal publish buckets, caller-current publish buckets,
      caller-aware avoid, fallback, and strict-load fallback counts, selected
      CPU buckets, first-selected CPU buckets, and migration totals/targets for
      CPU slots 0-3. `publish_created_thread()` receives the caller thread from
      `ThreadSpawner.create`, keeps single-owner processes on CPU0, and avoids
      the caller's current CPU only when another active ready scheduler CPU has
      a strictly lower non-idle dispatch load. On equal load, an active-ready
      caller CPU wins the tie instead of falling through to CPU0-biased
      least-loaded scanning; if the caller slot is unknown or ineligible,
      publication falls back to the least-loaded active scheduler CPU behavior.
      Timer, unblock, direct-IPC fallback, steal retry, and steal requeue paths
      keep their existing allocation-free targeting behavior.

      The earlier avoid-caller policy passed the old spinning-parent 1-to-2
      gate but failed the repaired blocking-parent shape: before the strict-load
      fix, controlled capOS evidence regressed to 1-to-2 work/total speedups
      `0.886x`/`0.928x` because children were biased onto the non-caller queue
      even when the caller CPU had equal load. The repaired benchmark shape uses
      blocking parent join, 262,144 blocks (16 MiB), and `work_rounds=64`. The
      matching Linux baseline scales on the selected physical CPU set with
      1-to-4 work/total speedups `3.958x`/`3.834x`. Controlled capOS evidence
      on the same CPU set passed the enforced 1-to-2 work/total gates with
      `1.828x`/`1.687x`; the unsuppressed 1-to-4 diagnostic recorded
      `3.029x`/`2.386x`, and scheduler-switch-log-suppressed diagnostics
      recorded `3.272x`/`2.303x`. Remaining four-worker limits are now
      scheduler implementation issues, not benchmark-shape excuses: serial
      switch logging, global `Scheduler` lock contention, total-time
      exit/join/block/schedule overhead, and the temporary four-owner CPU mask.
- [x] Add bounded reschedule IPI behavior for idle-to-runnable transitions.
      Completed 2026-05-01 03:06 UTC: queued wakeups target at most one
      queue-owner CPU, direct IPC targets at most one eligible idle scheduler
      CPU, and measure builds emit wake scan, eligible idle CPU, target, sent,
      pending-skip, not-ready-skip, missing-target, and failure counters.
- [x] Preserve direct IPC handoff as a scheduling preference without bypassing
      per-CPU ownership or generation checks. Completed 2026-05-01 03:06 UTC:
      direct IPC still uses the single preference slot when available and falls
      back to the normal queued owner path when the target cannot run directly.
- [x] Prove process/thread exit cleanup cannot leave a stale runnable entry on
      any CPU queue. Completed 2026-05-01 03:14 UTC: process termination,
      current-process exit, and `ThreadControl.exitThread` cleanup now assert
      under the scheduler lock that the exiting process or thread no longer
      appears in any per-scheduler-CPU FIFO or in the direct IPC target slot.
      The focused spawn smoke asserts the serial proof markers emitted by the
      exercised process/thread exit paths.
- [x] Rerun `make run-thread-scale`, `make run-smp2-smokes`, ordinary smoke,
      spawn/thread, park, ring, and process-exit focused proofs. Completed
      2026-05-01 04:18 UTC: local serial reruns passed normal
      `make run-thread-scale` in
      `target/thread-scale/scheduler-phaseb-rerun-local-normal-20260501T034800Z/`
      and `make run-smp2-smokes` in `target/smp2-smokes/20260501T034414Z/`.
      Controlled benchmark-VM reruns at main commit `87be6e25` pinned QEMU to
      physical-core logical CPUs `0-3` and SMT logical CPUs `0-7`; all rows
      remained `accepted=false`, so this closes the Phase B rerun-evidence gate
      but not the selected in-process speedup milestone.

## Phase C: CPU Accounting

- [x] Add monotonic runtime charge points when a running thread leaves the CPU
      at context switch, preemption, blocking syscall, direct IPC handoff, and
      thread exit. Completed 2026-05-01 05:08 UTC: running intervals are
      charged with `crate::arch::context::monotonic_ns()` when a current thread
      stops running through timer preemption, blocking `cap_enter`/ParkSpace,
      thread/process exit, and direct switch or handoff paths that select the
      next current thread.
- [x] Observe blocked runtime stability at unblock without charging non-running
      time. Completed 2026-05-01 05:08 UTC: unblock paths check the blocked
      runtime snapshot before making the thread ready.
- [x] Track per-thread runtime, virtual runtime seed, context switches,
      preemptions, voluntary blocks, and migrations. Completed 2026-05-01
      05:08 UTC: `ThreadCpuAccounting` is stored on each `Thread` record and
      updated under the scheduler/process lock. Context switch counters
      increment when a thread is selected, preemptions increment only for
      timer-driven running-to-ready requeue, voluntary blocks increment for
      blocking `cap_enter` and ParkSpace waits, and migrations increment when
      a thread runs on a different scheduler CPU than its previous run.
- [x] Add process/session/service aggregation only after the per-thread record
      has a single ledger of record. Completed 2026-05-22 13:50 UTC: a
      per-`Process` `ProcessCpuAccounting` ledger sums `runtime_ns` and a
      process-level `context_switches` dispatch count incrementally at the same
      scheduler/process-lock charge points that update `ThreadCpuAccounting`,
      so it captures exited threads' contributions. Only the always-present
      (non-`measure`) per-thread quantities are rolled up; the measure-gated
      `preemptions`/`voluntary_blocks`/`migrations` counters are intentionally
      not aggregated so the default-build proof stays meaningful. The kernel
      emits a `sched: process_cpu_accounting pid=... runtime_ns=...
      context_switches=...` line at per-process exit and `make run-spawn`
      asserts a nonzero aggregate. Session/service aggregation remains a
      stretch follow-on.
- [x] Add tests or QEMU diagnostics proving runtime increases while running and
      stops while blocked. Completed 2026-05-01 05:08 UTC: `make run-spawn`
      now asserts a compact scheduler proof line that requires nonzero runtime,
      context switches, preemptions, and voluntary blocks, plus stable blocked
      and exited runtime observations.
- [x] Keep runtime accounting independent of tickless idle by using the
      monotonic clocksource layer. Completed 2026-05-01 05:08 UTC: normal
      accounting uses `monotonic_ns()` and does not read
      `kernel/src/measure.rs` cycle counters.

## Phase D: Best-Effort Fair Scheduling

Phase D accepted its Task 6 diagnostic closeout at commit `77caafc0`
(`2026-05-10 19:39 UTC`, `docs(scheduler): record phase d thread-scale gate`)
and closed in docs commit `1a08ec23` (`2026-05-10 21:47 UTC`,
`docs(scheduler): close phase d`). The first
Phase D policy is weighted fair queueing on top of the existing
per-thread `runtime_ns` / `virtual_runtime_ns` accounting, with a
capability-authorized `SchedulingPolicyCap` for weight and latency-class
mutation. The controlled Task 6 benchmark pair passed the harness-enforced
1-to-2 work/total gates; capOS recorded 1-to-4 work/total diagnostics
`3.088x` / `2.700x` at 4 workers versus the prior single-global-queue baseline
`1.566x` / `1.538x`, and that 1-to-4 row was manually accepted for Phase D
closeout. The matching Linux pthread baseline on the same host and
physical-core logical CPUs `0,1,2,3` recorded `3.974x` / `3.850x`. EEVDF is
now a follow-on policy evaluation, not a Phase D blocker. The design content is
in
`docs/proposals/scheduler-evolution-proposal.md` "Phase D
first-policy decision", "Phase D capability surface", "Phase D
migration fairness sketch", "Phase D test matrix", and "Phase D
overload behavior" sections. The completed implementation plan is
archived at `docs/backlog/scheduler-evolution.md`.

The bullets below retain the closed acceptance gates and the
Phase D follow-ons that should be selected explicitly. Phase E
`SchedulingContext` is the next scheduler authority phase, followed
by Phase F auto-nohz / SQPOLL / tickless idle; generic full-nohz
remains deferred behind those prerequisites.

- [x] Choose initial weighted-fair or EEVDF-like policy based on accounting and
      queue data. Resolved `2026-05-05 19:00 UTC`: WFQ first; EEVDF deferred.
      See `docs/proposals/scheduler-evolution-proposal.md` "Phase D
      first-policy decision".
- [x] Add scheduler entity weights and latency class metadata through a
      capability-authorized policy path, not ambient process fields.
      Closed by `docs/backlog/scheduler-evolution.md` Tasks 1-2:
      `SchedulingPolicyCap` schema + kernel cap, per-thread
      `weight`/`latency_class` fields, weighted vruntime, and
      caller-thread cap binding.
- [x] Preserve fairness across CPU migration. Implementation tracked in
      `docs/backlog/scheduler-evolution.md` Task 4 (vruntime travels with
      the thread, `virtual_finish_ns` recomputed at destination
      enqueue, bounded steal targets the queue whose head has the
      lowest `virtual_finish_ns`, matching the local pick rule of
      taking the front of the ascending per-CPU queue). Closed
      `2026-05-08 00:53 UTC`: invariants made explicit on
      `refresh_virtual_finish_ns_locked` and at the steal-insert site;
      the `cfg(feature = "measure")`-gated `ThreadCpuAccounting.migrations`
      counter moved from the dispatch-time `scheduled_measure` path
      to enqueue-time `record_placement_spread_migration_locked` and
      `record_steal_migration_locked` arms; weight-change-while-
      enqueued contract proved by construction with a `debug_assert!`
      reinforcement in `Process::refresh_thread_virtual_finish_ns`.
- [x] Test CPU hogs, short sleepers, direct IPC server/client pairs,
      multi-process load, and same-process sibling load. Implementation
      tracked in `docs/backlog/scheduler-evolution.md` Task 5 (test matrix
      smokes) and Task 6 (the controlled `make run-thread-scale` evidence
      pair: harness-enforced 1-to-2 gates plus a manually accepted 1-to-4
      diagnostic closeout row). Closed `2026-05-10 19:46 UTC`: the
      benchmark-VM Task 6 run at commit `76025f0963a4` recorded capOS 1-to-4
      work/total diagnostics `3.088x` / `2.700x`; the 1-to-2 gate stayed green
      at `1.809x` / `1.774x`. The matching Linux pthread baseline
      on the same physical-core logical CPUs `0,1,2,3` recorded
      `3.974x` / `3.850x`.
- [x] Define overload behavior when runnable entities exceed the selected CPU
      set or when migration cannot keep up. Resolved at the design
      level `2026-05-05 19:00 UTC`: soft overload uses vruntime
      ordering (no entity is starved); hard overload defers to Phase F
      `CpuIsolationLease` and Phase G `RealtimeIsland`. See
      `docs/proposals/scheduler-evolution-proposal.md` "Phase D
      overload behavior".
- [ ] Phase D follow-on: EEVDF migration. Once the WFQ slice has
      accepted thread-scale evidence, evaluate replacing the bucketed
      per-CPU `VecDeque` with an EEVDF eligibility set
      (`BTreeMap`-by-virtual-deadline) plus per-thread request size
      and lag accounting. The accounting fields, capability surface,
      and migration contract carry directly; the change is localized
      to the dispatch ordering structure. Promote to its own design
      slice if and when selected; do not bundle it into the WFQ
      first-slice plan.

## Phase E: SchedulingContext Capability

Phase E policy follow-ups are closed. Local owner-shell logout propagation is
recorded in
[`scheduler-phase-e-local-owner-shell-logout-propagation`](../tasks/done/2026-05-11/scheduler-phase-e-local-owner-shell-logout-propagation.md).
Endpoint donation/return, timeout/depletion notifications, and the
scheduler-observable session lifecycle hook are recorded on `main`:
[`scheduler-phase-e-endpoint-donation`](../tasks/done/2026-05-11/scheduler-phase-e-endpoint-donation.md),
[`scheduler-phase-e-timeout-depletion-notifications`](../tasks/done/2026-05-11/scheduler-phase-e-timeout-depletion-notifications.md), and
[`scheduler-session-lifecycle-hook`](../tasks/done/2026-05-11/scheduler-session-lifecycle-hook.md).
The donated-context logout policy is also closed as a conservative
counted/skipped return-path proof:
[`scheduler-phase-e-session-logout-donated-context-policy`](../tasks/done/2026-05-11/scheduler-phase-e-session-logout-donated-context-policy.md).
Timeout/depletion notifications now use fixed per-context notification cells
allocated at context creation/bootstrap. The ordinary non-donated
session-logout stale-context proof is complete through the
`UserSession.logout()` hook. In-flight endpoint donation uses the conservative
counted/skipped policy during logout and relies on endpoint RETURN/cancel to
finish the in-flight transfer/clear without returning donor budget early. Local
owner-shell exit now calls the same `UserSession.logout()` path on clean REPL
exit or terminal-close completion; the shell proof observes the scheduler hook
with no bound local shell `SchedulingContext`, while the focused
session-context proof remains the ordinary bound-context stale evidence.

- [x] Phase E preflight: retire the transitional
      `CAPOS_SCHED_DISABLE_WFQ=1` / `WakePolicy::QueueAny`
      single-global-queue fallback that Phase D kept for one
      bisect cycle. This is a scheduler-surface cleanup before
      `SchedulingContext` claims budget/period authority; do not treat
      it as an EEVDF blocker. Completed 2026-05-10 22:20 UTC:
      the source-level opt-out, queue-0 enqueue funnel, and `QueueAny`
      wake policy are gone.
- [x] Define the first `SchedulingContext` object shape. Phase E Task 1
      adds the minimal schema/control-plane cap shape: `SchedulingContextSpec`
      carries budget, period, relative deadline, byte-oriented CPU mask, and
      overrun policy; `SchedulingContextInfo` is a read-only snapshot with
      `remainingBudgetNs` as derived info-only state; and the kernel/runtime
      expose an info-only `SchedulingContext.info()` cap stub for focused
      grant/discovery and client decode coverage. The `cpuMask` field is a
      canonical little-endian bitset: CPU `n` is bit `n % 8` of byte
      `n / 8`, empty means no CPUs selected, producers omit trailing zero
      bytes, and non-empty canonical masks end in a nonzero byte. Dispatcher
      budget enforcement, replenishment, bind/revoke rules, donation/return,
      depletion notifications, realtime islands, SQPOLL, and nohz remain
      deferred.
- [x] Add capability creation/bind/revoke rules and generation identity. The
      second Phase E control-plane slice keeps `info()` method id 0 stable,
      adds same-interface context creation as a bounded result-cap transfer,
      records at most one caller-thread binding per context generation, and
      revokes by advancing the context generation and clearing the matching
      thread metadata binding. Bootstrap grants and created contexts use the
      same non-wrapping context-id allocator so distinct caps cannot alias the
      `(contextId, generation)` binding key. The focused
      `make run-scheduling-context` QEMU smoke proves distinct bootstrap
      identities, create result-cap adoption, bind/revoke, stale-generation
      calls, release cleanup, and the explicit `infoOnlyNoDispatchChange`
      dispatch-effect marker. Stale caps report `staleGeneration` and cannot
      mutate scheduler metadata; revoked contexts report `revoked`. Dispatch
      selection, WFQ ordering, runtime charging, replenishment, donation/return,
      timeout/depletion notification, realtime islands, SQPOLL, auto-nohz, and
      CPU placement enforcement remain future work.
- [x] Enforce budget and replenishment in the kernel dispatcher. First Phase E
      budget enforcement landed 2026-05-11 08:38 UTC: `bindCallerThread()`
      now installs a fixed per-thread budget ledger under the scheduler/process
      locking model, runtime charge decrements the bound context budget at the
      existing dispatch charge points, runnable selection replenishes elapsed
      periods without allocation, and exhausted contexts stay queued but
      `RetryLater` until their next period. Deadline-driven accounting closed
      the previous periodic-tick granularity caveat on 2026-06-04: the ordinary
      dispatch path arms a sub-tick budget-exhaustion one-shot when the selected
      thread's remaining budget would deplete before the next scheduler tick,
      kernel-mode one-shot fires restore a live periodic timer, nohz re-arm
      folds the leased thread's budget deadline into its existing nearest
      deadline, and nohz budget depletion restores the periodic tick with
      `reason=scheduling-context-budget-throttled`. `make run-scheduling-context`
      proves visible charge, replenishment to full budget, stale/revoked
      fail-closed behavior, and a throttled wall-clock window with
      `dispatch_effect=budgetEnforced`; the representative 5 ms deadline marker
      recorded `elapsed_since_arm_ns=5474819`, `overshoot_ns=474819`,
      `remaining_after_ns=0`, and `bounded_charge=true`. At that slice's landing,
      donation/return, depletion notifications, realtime islands, SQPOLL,
      auto-nohz, and CPU placement enforcement remained future work.
- [x] Add endpoint donation/return semantics for synchronous calls and passive
      services. Completed 2026-05-11 10:51 UTC: endpoint in-flight call state
      now carries a bounded internal donation token when a caller with a bound
      `SchedulingContext` delivers a synchronous CALL to a receiver thread
      without its own context. The scheduler charges pre-donation caller
      runtime before moving the ledger, charges passive-server runtime before
      returning the ledger, and returns the remaining budget to the caller before
      waking it when RETURN commits, commits an application exception, or fails
      with an invalid caller result buffer. RETURN
      preflight failures keep the in-flight donation intact;
      delivery/return cancellation paths return or clear the donation without
      allocating. A donor with an in-flight token is blocked from returning to
      userspace until the endpoint call returns or is canceled. Nested donation
      of an already donated context is rejected until stacked return tokens have
      a dedicated design. The focused
      `make run-scheduling-context` smoke now includes a same-process endpoint
      round trip with `endpoint_donation=ok`, `endpoint_return=ok`,
      `endpoint_exception_return=ok`, `endpoint_invalid_return=ok`, and
      `endpoint_nested_rejected=ok`, plus an `endpoint_donor_block=ok`
      delayed-server `cap_enter(0, 0)` proof, an `endpoint_donor_fast=ok`
      fast-return race proof, and remaining-budget fields for successful
      RETURN, application-exception RETURN, invalid-result RETURN,
      nested-donation rejection, donor blocking, and fast donor return. This is
      synchronous endpoint donation/return only; depletion
      notifications, realtime islands, SQPOLL, auto-nohz, CPU placement
      enforcement, and session-logout stale-context coverage remain future
      work.
- [x] Add a scheduler-observable session lifecycle hook from
      `UserSession.logout()` into scheduler-owned `SchedulingContext`
      stale-marking. The hook covers explicit logout plus the remote DTO
      gateway logout/connection-teardown paths that already call
      `UserSession.logout()`: after the liveness cell flips to logged out, the
      scheduler scans process/thread metadata for the same session liveness
      cell, removes non-donated matching bindings from its ledger, and advances
      the bound context generation as revoked so ordinary old grants become
      stale. The hook preserves the scheduler as the binding authority and
      avoids scheduler-lock to context-record-lock inversion by taking one
      binding under the scheduler lock, dropping that lock, and then marking
      the context stale through its cleanup token. In-flight endpoint donation
      bindings are explicitly skipped because returning donor budget before
      endpoint cancellation would violate the donor-blocking invariant. This
      hook unblocks focused stale-context proofs: ordinary non-donated logout,
      donated-context policy, and local owner-shell propagation are now closed
      by their dedicated task records.
- [x] Add timeout/depletion notifications with preallocated emergency-path
      storage. Completed in the timeout/depletion notification slice: every
      `SchedulingContext` owns a fixed
      notification cell allocated at context creation/bootstrap, with
      coalescing slots for budget depletion and deadline/timeout, sequence
      counters, bounded coalesced-event counts, holder identity,
      donated-holder marking, remaining budget, and next timestamp snapshots.
      Scheduler charging, timeout/deadline observation, donation-return, and
      cancellation paths update only that fixed state; they do not allocate,
      publish result caps, append unbounded queues, or require hard-path
      logging. `SchedulingContext.drainNotifications()` exposes typed
      `ok`, `revoked`, and `staleGeneration` observer results, plus
      `explicitRevoke` lifecycle state. The focused
      `make run-scheduling-context` smoke proves repeated budget-depletion
      coalescing, deadline notification, explicit revoke, stale observer
      labels, and endpoint-donated notification accounting. A pre-armed
      observer waiter/wakeup path remains a separate follow-up.
- [x] Extend stale-context proofs beyond the first revoke/generation contract
      to process and thread exit. The focused SchedulingContext smoke now
      proves that a context bound by an exiting thread becomes unbound without
      minting fresh budget on rebind, while process-exit and explicit
      process-termination children bind contexts and run the process cleanup
      path before cap-table release.
- [x] Extend stale-context proofs to session logout. Completed for ordinary
      non-donated contexts at 2026-05-11 17:44 UTC. This remains separate from
      process/thread exit because logout propagation is owned by the session
      lifecycle surface, not the scheduler dispatch loop. The focused
      session-context smoke now binds a `SchedulingContext` in a
      session-owned child, calls `UserSession.logout()`, observes the scheduler
      hook line, and proves the old cap is stale before budget refresh,
      caller-thread rebind, result-cap publication, or metadata mutation.
      Process/thread exit cleanup remains covered by `make run-scheduling-context`.
- [x] Prove donated receiver logout policy. Completed at 2026-05-11 18:19 UTC.
      Logout keeps the existing conservative counted/skipped behavior for
      receiver threads holding endpoint-donated `SchedulingContext` bindings.
      The focused session-context smoke has a donor call a guest-session
      receiver, the receiver logs out while holding the donated binding, the
      scheduler hook reports `stale_marked=0 donation_inflight_skipped=1`, the
      donor remains blocked in `cap_enter(0, 0)` until endpoint RETURN, and the
      donor context returns bound with reduced remaining budget rather than a
      refreshed or minted budget. Local owner-shell lifecycle propagation was
      closed separately by
      `scheduler-phase-e-local-owner-shell-logout-propagation`.
- [x] Propagate local owner-shell exit to session logout. Completed at
      2026-05-11 19:36 UTC. Clean local REPL `exit` and terminal-close
      completion now call the held `UserSession.logout()` before process exit,
      so the session liveness cell is marked logged out through the same
      kernel hook used by explicit logout and the remote DTO gateway. The shell
      smoke asserts the scheduler-observable hook line with
      `stale_marked=0 donation_inflight_skipped=0`; ordinary bound
      `SchedulingContext` stale behavior remains proven by the focused
      session-context smoke through the same hook. Process/thread-exit cleanup
      remains separate and unchanged.

## Phase F: CPU Isolation Lease and SQPOLL

The Phase E gates and the first Ring/SQPOLL ownership prerequisite are now
closed. Dispatch through
[`scheduler-phase-f-auto-nohz-sqpoll`](../tasks/done/2026-05-16/scheduler-phase-f-auto-nohz-sqpoll.md)
only through its own Phase F authority, telemetry, rollback, and nohz/SQPOLL
tasks; this backlog entry does not implement Phase F behavior. The concrete
ring prerequisite is
[`scheduler-phase-f-one-sq-consumer-ring-ownership`](../tasks/done/2026-05-11/scheduler-phase-f-one-sq-consumer-ring-ownership.md),
closed on 2026-05-11: ring endpoints now have generation-checked syscall-mode
SQ-consumer leases, duplicate future SQPOLL acquisition is rejected while that
owner is live, stale owner generations cannot advance SQ head, teardown
releases the owner without clearing accepted completions, and bounded SQPOLL
admission metadata exists without starting a poller.
The first executable Phase F child task,
[`scheduler-phase-f-cpu-isolation-lease-scaffold`](../tasks/done/2026-05-16/scheduler-phase-f-cpu-isolation-lease-scaffold.md),
closed on 2026-05-12 12:02 UTC. It is limited to `CpuIsolationLease` authority,
activation preflight telemetry, and rollback scaffolding. It does not enable
SQPOLL, automatic nohz, tick suppression, automatic CPU isolation, or generic
full-nohz behavior. The second executable child task,
[`scheduler-phase-f-nohz-activation-telemetry`](../tasks/done/2026-05-16/scheduler-phase-f-nohz-activation-telemetry.md),
closed on 2026-05-12 14:18 UTC. It turns the disabled preflight into observable
activation/deactivation and rollback decisions while still leaving tick
suppression, SQPOLL, automatic CPU isolation, and generic full-nohz disabled.
The housekeeping/deferred-work placement child closed on 2026-05-12 18:36 UTC
by
[`scheduler-phase-f-housekeeping-deferred-work-placement`](../tasks/done/2026-05-16/scheduler-phase-f-housekeeping-deferred-work-placement.md):
the scheduler now records an explicit online housekeeping CPU placement input,
selected housekeeping mask, deferred cleanup/timer/network/IRQ/accounting
placement or rejection labels, and bounded revoke, process-exit,
service-replacement, and session-logout cleanup placement while ticks remain
periodic.
The bounded SQPOLL ring-mode child closed on 2026-05-12 20:29 UTC by
[`scheduler-phase-f-sqpoll-ring-mode-bounded-poller`](../tasks/done/2026-05-12/scheduler-phase-f-sqpoll-ring-mode-bounded-poller.md):
ring endpoints now transition explicitly through syscall, SQPOLL starting,
running, sleeping, stopping, and rollback modes; a `kernelSqpoll`
`CpuIsolationLease` admits one bounded periodic-tick poller for the caller
thread's ring; producer wakeups use `NEED_WAKEUP`; stale SQ owners fail before
SQ-head consumption; and poller stop/revoke preserves accepted CQEs while
releasing SQ ownership. Actual tick suppression is blocked until the
SQPOLL progress path no longer depends on periodic scheduler ticks. The
clockevent/deadline substrate child closed on 2026-05-12 23:07 UTC by
[`scheduler-phase-f-clockevent-deadline-substrate`](../tasks/done/2026-05-12/scheduler-phase-f-clockevent-deadline-substrate.md):
normal QEMU/x86_64 `monotonic_ns()` is backed by the calibrated TSC rather
than `TICK_COUNT`, the periodic LAPIC tick disciplines the TSC epoch while nohz
is disabled, `Timer.sleep`, finite `cap_enter`, and park waiters store
absolute monotonic deadlines, and the LAPIC clockevent backend can program a
bounded one-shot deadline and restore periodic mode. The substrate's firing
precision is now proven, not only its programming: the
`scheduler-lapic-oneshot-subtick-firing-precision` child (closed
2026-06-04 03:26 UTC, commit 49b36129) arms a `TICK_NS/2` one-shot over the live
periodic timer during boot and
measures the *actual* countdown-to-fire instant, asserting via
`make run-scheduling-context` that it fires sub-tick (~5 ms for a 5 ms request,
well under the 10 ms tick) with the current-count correctly reset to the
sub-tick value -- ruling out the suspected "`INITIAL_COUNT` write does not reset
the running countdown" root cause -- and that the kernel-mode-fire periodic
restore leaves a live timer (no lost-timer hang). Automatic nohz, tick
suppression, SQPOLL nohz, generic full-nohz, and production realtime admission
remain disabled. Known pre-existing gate flake (independent of the
firing-precision proof, which passed in 100% of measured boots): the
`scheduling-context-smoke` budget-timing proof exited early in ~20% of boots on
both `main` and this branch under host load -- its wall-clock budget-throttle
assertions are sensitive to host scheduling jitter. Run `make
run-scheduling-context` on an otherwise-idle host until the budget proof is
stabilized (own follow-up); it is orthogonal to the clockevent firing assertions.
A second substrate prerequisite surfaced 2026-06-04 from
`scheduler-deadline-driven-budget-accounting`'s Attempt 2: even with the LAPIC
one-shot firing precisely sub-tick, the monotonic clocksource *discipline* floored
a sub-tick interval to a full tick. A boot probe measured a real 5.0 ms interval
advancing `monotonic_ns` by 10.0 ms after one `discipline_clocksource_tick` step
(`monotonic_delta_ns=10000020` for `real_ns=5000118`, `floored=true`), because
`discipline_clocksource_tick` took `max(tsc_interpolated, epoch + TICK_NS)` on
every fire. That was the real cause of that task's Attempt 1 "9.85 ms" -- not the
LAPIC firing (fixed) and not the ordinary-path timer-ISR rechecks (which provably
no-op when no nohz/idle window is active). The prerequisite
[`scheduler-monotonic-clocksource-subtick-discipline`](../tasks/done/2026-06-04/scheduler-monotonic-clocksource-subtick-discipline.md)
closed it (2026-06-04): `discipline_clocksource_tick` now trusts the TSC
interpolation at sub-tick granularity, falling back to the `TICK_NS` floor only
when the interpolated advance is below `MIN_DISCIPLINED_ADVANCE_NS` (`TICK_NS / 8`)
so a degenerate (stalled/backward/mis-calibrated-slow) TSC still keeps a minimum
forward rate; the tick-derived fallback is unchanged. A boot proof
(`context::qemu_clocksource_subtick_discipline_proof`, emitted on
`make run-scheduling-context`) runs one real `TICK_NS / 2` discipline step and
asserts `monotonic_ns()` tracked the sub-tick delta -- measured
`monotonic_delta_ns=5055612` for `real_ns=5000474` (`floored=false`,
`subtick_tracked=true`). Deadline-driven budget accounting and generic full-nohz
can now observe a sub-tick deadline through the accounting clock.
The SQPOLL nohz-progress child closed on 2026-05-13 00:06 UTC by
[`scheduler-phase-f-sqpoll-nohz-progress`](../tasks/done/2026-05-13/scheduler-phase-f-sqpoll-nohz-progress.md):
`cap_enter` now has a bounded current-thread SQPOLL service entry for
producer wakes and syscall kicks that borrows the SQPOLL owner lease, charges
the admitted accounting target, and reports non-periodic progress evidence
while ordinary periodic service remains active. Automatic policy-service nohz
issuance and production realtime admission remain future work; generic SQPOLL
nohz for explicitly leased caller-thread rings landed in the later Step 14
slice.
The tickless-idle child closed on 2026-05-23 09:12 UTC by
[`scheduler-tickless-idle-step6`](../tasks/done/2026-05-23/scheduler-tickless-idle-step6.md):
the CPL0 idle loop now admits an idle-only tickless window when no non-idle
work is runnable, no nohz lease is active, no local deferred cleanup is
pending, no cap-enter polling dependency is present, and the LAPIC one-shot
clockevent plus monotonic clocksource are available. The periodic tick is
restored before non-idle dispatch and on rollback. Legacy cap-enter polling
surfaces, including the terminal shell path, remain periodic until they gain
explicit deadline or housekeeping placement.

- [x] Define `CpuIsolationLease` authority separately from CPU-time budget.
      Completed 2026-05-12 12:02 UTC by
      `docs/tasks/done/2026/scheduler-phase-f-cpu-isolation-lease-scaffold.md`.
- [x] Add scheduler activation proof for housekeeping, deferred cleanup,
      timers, networking, IRQ affinity, live accounting target,
      one-SQ-consumer state, and revocation latency. The scaffold reports
      blocked eligibility and leaves ticks/nohz/SQPOLL disabled.
- [x] Enforce one live SQ consumer per ring before SQPOLL. Completed
      2026-05-11 by
      `docs/tasks/done/2026/scheduler-phase-f-one-sq-consumer-ring-ownership.md`.
- [x] Integrate SQPOLL ring mode only after this ownership prerequisite and
      `docs/tasks/done/2026/scheduler-phase-f-housekeeping-deferred-work-placement.md`
      have landed. Completed 2026-05-12 20:29 UTC by
      `docs/tasks/done/2026/scheduler-phase-f-sqpoll-ring-mode-bounded-poller.md`.
- [x] Add lease revocation on explicit revoke, process exit, service
      replacement, and session close. Completed by the focused
      `make run-scheduler-cpu-isolation-lease` proof.
- [x] Add nohz activation/deactivation telemetry. Completed 2026-05-12 14:18 UTC by
      `docs/tasks/done/2026/scheduler-phase-f-nohz-activation-telemetry.md`.
      The proof records active-candidate rejection, stale/revoked rollback,
      ready housekeeping CPUs under `-smp 4`, exactly-one-runnable target CPU
      evidence, deferred cleanup/timer/network/IRQ labels, valid accounting
      targets, explicit clocksource/accounting readiness or refusal, live
      syscall SQ-consumer state, revocation-latency policy, and disabled
      tick/SQPOLL/full-nohz guardrails.
- [x] Assign housekeeping and deferred-work placement before behavior.
      Completed 2026-05-12 18:36 UTC by
      `docs/tasks/done/2026/scheduler-phase-f-housekeeping-deferred-work-placement.md`.
      The proof keeps periodic ticks, SQPOLL, automatic CPU isolation, and
      generic full-nohz disabled.
- [x] Add bounded SQPOLL ring mode only after housekeeping/deferred-work
      placement. Completed 2026-05-12 20:29 UTC by
      `docs/tasks/done/2026/scheduler-phase-f-sqpoll-ring-mode-bounded-poller.md`.
      The proof covers one poller owner, bounded polling, stale queue-owner
      rejection, wake/sleep ordering, and teardown without losing completions
      while periodic ticks remain active.
- [x] Add clockevent/deadline substrate before automatic nohz activation.
      Completed 2026-05-12 23:07 UTC by
      `docs/tasks/done/2026/scheduler-phase-f-clockevent-deadline-substrate.md`.
      It split clocksource reads from clockevent programming, added a
      one-shot/restore timer backend, and converted tick-count waiters to
      absolute monotonic deadlines while ordinary scheduling remains periodic.
- [x] Add SQPOLL nohz progress that does not depend on periodic scheduler
      ticks. Completed 2026-05-13 00:06 UTC by
      `docs/tasks/done/2026/scheduler-phase-f-sqpoll-nohz-progress.md`.
      The proof preserves the one-SQ-consumer, `NEED_WAKEUP`, bounded
      polling, stale-owner rollback, and teardown/completion invariants while
      keeping periodic fallback service active.
- [x] Add automatic nohz activation only after placement, bounded SQPOLL
      behavior, the deadline substrate, and non-periodic SQPOLL progress.
      Completed 2026-05-14 09:01 UTC by
      `docs/tasks/done/2026/scheduler-phase-f-auto-nohz-activation.md`.
      The `CpuIsolationLease` activation preflight now performs real per-CPU
      periodic-tick suppression for the narrow single-runnable-entity window
      (`namedRing = none` compute lease on the preflight CPU): it masks the
      periodic LAPIC tick and arms a bounded one-shot deadline at
      `min(nearest pending timer wakeup, now + max revocation latency)`.
      Network polling and IRQ affinity stay read-only fail-closed admission
      gates -- any ring-coupled or device-owning mode keeps the conservative
      refusal. Every disqualifying change (stale lease generation, a second
      runnable entity, stealable sibling work, a local deferred-cleanup
      dependency, a target-CPU mismatch, or a one-shot backend that can no
      longer arm a deadline) rolls the CPU back to the periodic tick first.
      The `make run-scheduler-cpu-isolation-lease` proof asserts the
      activation and rollback log lines. Generic full-nohz and the broader
      SQPOLL-driven nohz state machine landed in later slices.
- [x] Measured suppressed-tick proof on the lease path (harness-hardening).
      Completed 2026-06-02 19:53 UTC by
      `docs/tasks/done/2026-06-02/scheduler-cpu-isolation-measured-suppressed-tick-proof.md`.
      Closes the review-identified honesty gap that the lease path proved
      suppression only by the `tick_suppression=active periodic_tick=masked`
      marker plus a no-hang progress loop, never that periodic timer interrupts
      actually stopped arriving. The kernel now counts genuine periodic LAPIC
      fires per CPU (`account_timer_fire` in the timer ISR increments only when
      neither the lease-backed nor idle tick-suppression bit is set, so the
      one-shot replacement is never miscounted), snapshots the count at
      activation, and on rollback emits
      `cpu-isolation: nohz suppressed-ticks cpu=<n> window_ns=<w>
      expected_periodic=<e> actual_periodic=<a> suppressed=<e-a>`; a bounded
      post-rollback `cpu-isolation: nohz restored-rate` line proves the periodic
      rate returns. The demo holds a childless compute lease on CPU 0 across a
      ~150 ms masked window, then a busy restore window; the harness asserts a
      masked window with `actual_periodic` near zero (`expected_periodic >= 10`,
      `suppressed >= 8`) and a restored window with `actual_periodic` tracking
      `expected_periodic` (`>= 8`). No activation behavior changed; the
      mask/one-shot mechanism is untouched. A durable `ticks_suppressed{cpu,mode}`
      telemetry field on a monitoring/status surface remains future work.
- [x] Timeout-based auto-revoke primitive on `CpuIsolationLease`. Landed via
      `docs/tasks/done/2026-05-30/scheduler-cpu-isolation-lease-timeout-auto-revoke.md`.
      Adds `leaseLifetimeNs @6` to `CpuIsolationLeaseSpec` (`0` = no expiry,
      preserving every existing producer); `read_spec` clamps to a one-hour
      ceiling and rejects a non-zero lifetime below `maxRevocationLatencyNs`
      (`invalidSpec`). A lease records `expires_at_ns` at creation; the first
      observation past the deadline auto-revokes through the existing
      generation-advancing cleanup (`reason=lease-expired`, registry
      unregister, SQPOLL stop, `rollback_nohz_for_lease`) and every subsequent
      `info`/`activationPreflight`/`revoke` reports `staleGeneration`. The nohz
      activation record carries the lifetime deadline so a tickless CPU under a
      lease that crosses its lifetime rolls back at the next timer/IPI recheck
      (`lease-lifetime-expired` disqualifier), bounded by `maxRevocationLatencyNs`.
      `make run-scheduler-cpu-isolation-lease` asserts the expiry release line,
      the post-expiry `staleGeneration`, and the `invalidSpec` rejection.
- [x] Enable tickless idle only when there is no runnable non-idle work and no
      cap-enter polling dependency. Completed 2026-05-23 09:12 UTC by
      `docs/tasks/done/2026/scheduler-tickless-idle-step6.md`. The idle path
      masks the periodic LAPIC tick only for true idle, arms a bounded
      one-shot at the nearest `Timer`/`ParkSpace` deadline or 100 ms
      housekeeping floor, and restores periodic mode before ordinary work.
      Ready-but-budget-throttled `SchedulingContext` retry windows remain
      periodic so budget replenishment and deadline notification timing stay on
      the existing scheduler accounting path.
- [ ] Keep automatic full-nohz behind the completed one-SQ-consumer ownership
      prerequisite and the narrower `CpuIsolationLease` telemetry/rollback
      proof. Generic full-nohz is not the first Phase F implementation task.

## Phase F.5: Full-SMP Hardware Scalability

This phase is the planning slot for the next visible SMP milestone when the
project is ready to answer whether capOS uses 16/32-core machines well. It
does not replace the current Installable System selected milestone and should
not be dispatched as a QEMU-only benchmark cleanup. QEMU remains regression
infrastructure; the primary performance record should come from direct capOS
execution on a dedicated high-core perf runner or bare-metal/cloud-bare-metal
machine.

- [ ] Replace temporary four-owner scheduler assumptions with dynamic CPU
      topology: discovered scheduler CPU set, physical-core versus SMT sibling
      labeling, APIC id mapping, per-CPU allocation sizing, and boot/status
      output that makes the selected CPU set auditable.
- [ ] Add or select the APIC backend needed for high-core machines. xAPIC MMIO
      can remain the current low-core path, but x2APIC selection is the likely
      larger-APIC-id follow-up from `docs/research/x2apic-and-virtualization.md`.
- [ ] Shrink scheduler shared-state serialization. Local pick/requeue should
      avoid one global scheduler-lock critical section where possible, while
      shared process/thread metadata, blocking waiters, direct IPC handoff,
      timers/deadlines, and cleanup keep explicit ownership and rollback rules.
- [ ] Add topology-aware placement and observable migration policy. The record
      should distinguish local enqueue, cross-core wake, steal, SMT sibling
      placement, failed placement, reschedule IPI, and TLB-shootdown costs.
- [ ] Build the hardware benchmark profile from existing benchmark proposals:
      static map/reduce, uneven dynamic task pool, barrier phase loop,
      independent processes, same-process threads, and one
      capability-call/service-bound workload. Each workload reports work-window
      and total-time rows at 1/2/4/8/16/32 workers when hardware exists.
- [ ] Record matching native Linux rows on the same machine, plus capOS raw
      artifacts with source commit, toolchain, topology, frequency/isolation
      policy, run count, warmup policy, verifier output, medians, variance,
      speedup, efficiency, and scheduler counters.

## Phase G: Realtime Islands

- [ ] Define `RealtimeIsland` admission inputs: scheduling contexts, memory
      reservations, device/IRQ reservations, communication paths, CPU leases,
      and overrun policy.
- [ ] Add a small local-audio or synthetic periodic-control proof before
      robotics or provider workloads.
- [ ] Prove no allocation, blocking endpoint call, paging, or logging on the
      admitted realtime path.
- [ ] Record deadline misses and overrun handling as observable output.

## Phase H: Policy Service

- [ ] Define a privileged scheduler policy service interface for admission,
      budget/profile updates, CPU lease grant/revoke, and diagnostics.
- [ ] Keep kernel fallback scheduling independent of policy-service liveness.
- [ ] Add manifest/config hooks for default profiles without making policy
      changes require kernel rebuilds.
- [ ] Add operator diagnostics that explain why a thread or island was denied,
      throttled, migrated, or revoked.
- [ ] Define how stateful task/job graph assignment metadata maps into
      scheduler policy inputs: graph priority to weight/latency class, graph
      deadline to request freshness or admission input, graph budget to
      `SchedulingContext` reference, and graph queue to policy-service
      placement. The graph coordinator must not mint CPU authority by itself.
- [ ] Design the user-space policy-service AutoNoHz placement heuristic for
      ordinary threads that appear capable of utilizing a full CPU core. The
      policy service synthesizes the "thread appears capable of utilizing a
      full CPU core" decision from a future monitoring/status surface and
      issues a bounded `CpuIsolationLease` against a pre-authorized account
      or session CPU pool. The lease is placement only; it does not mint
      CPU-time authority. Required bounds on every auto-issued lease:
      lifetime shorter than admin-issued leases by default and renewable
      only by re-observing the signal; `max_revocation_latency_ns` bounded
      by `NoHzEligibility`; accounting target a live `SchedulingContext` or
      coarse `ResourceLedger`; CPU set restricted to the operator-declared
      auto-claim pool; priority-aware fairness preemption that terminates
      the lease (not just rolls back tick suppression) on arrival of an
      equal-or-higher priority runnable entity. Prerequisites:
      (a) a timeout-based auto-revoke primitive on `CpuIsolationLease`
      -- LANDED 2026-05-30 as `leaseLifetimeNs @6` (`0` = no expiry) with
      enforced first-observation auto-revoke and a `lease-lifetime-expired`
      nohz rollback; the auto-claim placement lease can now be granted with a
      bounded lifetime. The bounded `renew` half LANDED as
      `CpuIsolationLease.renew @4`, which pushes the deadline forward by at most
      the original lifetime while keeping the lease's identity / accounting /
      nohz state, leaving only the renewal-by-re-observation *heuristic* (when to
      call `renew`) to Phase H;
      (b) the monitoring/status surface that exports per-thread saturation
      observation -- LANDED 2026-05-30 as the non-`measure` per-thread
      saturation status surface. `voluntary_blocks` and `preemptions` were
      promoted out of `cfg(feature = "measure")`, an always-built
      `runnable_accumulated_ns` runnable-but-not-running accumulator was
      added (stamped at the run-queue enqueue chokepoint, accumulated at
      selection), and all three plus `runtime_ns` are exported through
      `SchedulingPolicyCap.snapshot @2` (proof `make run-thread-fairness`:
      hog `voluntary_blocks=0` with live `preemptions`/`runnable_ns`).
      `migrations` stays `measure`-gated. This read-side surface exports raw
      cumulative counters only; windowing and the saturation decision remain
      policy-service work;
      (c) the pool-grant authority shape that lets an operator pre-authorize
      an account's auto-claim pool. Declared-pool descriptor LANDED 2026-05-30:
      the `CpuIsolationLeaseSpec` carries `poolId @7` (`0` = the implicit
      default pool over every scheduler CPU), the kernel seeds a fixed
      declared-pool registry (`CpuIsolationPoolDescriptor`: default pool `0` plus
      one declared non-default pool `1` over a single CPU), and `read_spec`
      admits a lease only when its `poolId` is declared and its `allowedCpuMask`
      is a subset of the pool's CPU mask -- echoing the admitting pool's
      id/mask through `CpuIsolationLeaseInfo` (proof
      `make run-scheduler-cpu-isolation-lease`: `nondefault_pool=invalidSpec`
      (undeclared id), `declared_pool=ok admitted_pool_id=1
      admitted_pool_cpu_mask_subset=true`, `declared_pool_mask_violation=invalidSpec`,
      `default_pool_id=0`). Manifest-sourced pool table LANDED 2026-05-30: the
      declared-pool registry is sourced from the boot manifest
      `SystemConfig.cpuIsolationPools @14` (each entry a
      `CpuIsolationPoolDescriptor`), with the in-kernel constant as the
      fail-closed default when the manifest omits/empties the list; the kernel
      validates each entry fail-closed at boot (canonical CPU mask subset of the
      scheduler mask, default pool `0` synthesized if omitted, duplicate ids
      rejected) and emits `cpu-isolation: declared-pools source=manifest count=3
      ...` (proof `make run-scheduler-cpu-isolation-lease`; kernel-default
      fallback proven by `cargo test-config` decode/empty assertions). Per-pool
      live-lease capacity bound LANDED 2026-05-31: `CpuIsolationPoolDescriptor`
      carries `poolMaxLeases @2` (`0` = unbounded); a non-zero value caps the
      number of simultaneously live (non-revoked, current-generation) leases the
      kernel admits against that pool at create-time, counted from the existing
      `LEASE_REGISTRY` after `prune_dead`, rejecting an over-capacity create
      fail-closed `resourceExhausted`. The manifest bounds pool `2` at
      `poolMaxLeases: 2`; the proof admits two live leases, refuses a third
      (`cpu-isolation: pool-capacity-rejected admitted_pool_id=2 live_leases=2
      pool_max_leases=2 result=resourceExhausted`,
      `pool_capacity_exceeded=resourceExhausted`), and reclaims after a revoke
      (`pool_capacity_reclaimed=ok`) -- live-count, not cumulative. This is the
      count+reject mechanism the per-account `N` policy keys onto. Account
      identity + per-account `N` LANDED 2026-05-31:
      `CpuIsolationLeaseSpec` carries `accountId @8 :UInt64` (`0` =
      unattributed, caller-asserted and inert until counted, echoed read-only
      through `CpuIsolationLeaseInfo.accountId @6`) and
      `CpuIsolationPoolDescriptor` carries `poolMaxLeasesPerAccount @3 :UInt32`
      (`0` = unbounded per account). After the pool-wide check, `register`
      counts the requesting account's live entries (`admitted_pool_id` AND
      `account_id` both matching) against the per-account bound and rejects an
      over-bound create fail-closed `resourceExhausted` (`0` account or `0`
      bound skips the gate). The manifest bounds pool `2` at
      `poolMaxLeasesPerAccount: 1`; the proof admits one account-7 lease,
      refuses a second account-7 create (`cpu-isolation:
      account-capacity-rejected admitted_pool_id=2 account_id=7
      account_live_leases=1 pool_max_leases_per_account=1
      result=resourceExhausted`, `account_capacity_exceeded=resourceExhausted`),
      admits a different account-9 lease on that CPU
      (`account_capacity_other_account=ok` -- per-account, not pool-wide), and
      reclaims after revoking account-7 (`account_capacity_reclaimed=ok`). The
      account id is caller-asserted, not yet authenticated.
      Bootstrap pool-grant authentication LANDED 2026-05-31:
      `CpuIsolationPoolGrant` (`schema/capos.capnp`, source
      `cpu_isolation_pool_grant`, kernel
      `kernel/src/cap/cpu_isolation_pool_grant.rs`) introduced a
      bootstrap-staged grant binding one authenticated account to one declared
      pool.
      `createLease` stamps the bound account/pool onto the minted lease,
      overriding any caller-asserted `accountId`/`poolId`, and reuses the exact
      lease-create admission path (`cpu_isolation::create_lease_for_caller`), so
      the per-account bound is unforgeable: a holder can no longer assert another
      account to evade `poolMaxLeasesPerAccount`. The initial proof used one
      account-7/pool-2 grant; the current manifest-sourced proof below exercises
      multiple seeded grants.
      Manifest-declared multi-account grant table LANDED 2026-06-01: the grant
      binding is now operator-declared via `SystemConfig.cpuIsolationPoolGrants`
      (`schema/capos.capnp`, decoded in `capos-config`, seeded at boot by
      `cpu_isolation_pool_grant::seed_pool_grants` after `seed_declared_pools`),
      mirroring the manifest-sourced `cpuIsolationPools` table; the
      `cpu_isolation_pool_grant` / `cpu_isolation_pool_grant_secondary` sources
      stage seeded binding index `0` / `1`, so a manifest can pre-authorize
      multiple distinct `(account, pool)` grants, each staged as its own
      bootstrap cap. An absent/empty list falls back to one in-kernel binding at
      index `0`: account `7` bound to preferred pool `1` when active, otherwise
      account `7` bound to synthesized default pool `0`, so manifest-sourced
      pool tables that omit pool `1` still stage a usable default grant. Proof
      `make run-scheduler-cpu-isolation-pool-grant` now boots a two-entry grant
      table (account `5`/pool `1`, account `8`/pool `2`), holds both grant caps,
      and proves each stamps its OWN bound account (`pool-grant: create ok
      bound=A stamped_account_id=5 ...` / `bound=B stamped_account_id=8 ...`) with
      the per-account bound still enforced fail-closed under the manifest-sourced
      path; boot evidence `cpu-isolation: pool-grants source=manifest count=2`.
      Fallback proof
      `make run-scheduler-cpu-isolation-pool-grant-default` boots a
      manifest-sourced pool table that declares pool `2` and omits pool `1` plus
      an empty grant list; the kernel stages one default grant as `(account 7,
      pool 0)` and the smoke proves it can mint a stamped lease.
      Runtime grant minting landed (`CpuIsolationGrantMinter`): one cap mints a
      fresh `CpuIsolationPoolGrant` for an operator-chosen `(account, pool)` at
      call time, bounded by the declared
      `SystemConfig.cpuIsolationGrantMinterAllowlist` (an out-of-allowlist mint is
      refused `unauthorized`, so it is never an ambient grant-any authority; the
      minted grant reuses the same unforgeable `createLease` admission path). The
      same `run-scheduler-cpu-isolation-pool-grant` smoke now also mints a grant
      for the allowed `(account 6, pool 2)`, proves its `createLease` stamps
      account `6` and stays bounded by the per-account gate, and proves an
      out-of-allowlist `(account 99, pool 2)` mint is refused; boot evidence
      `cpu-isolation: grant-minter-allowlist source=manifest count=1`.
      Grant-revocation lifecycle landed (`CpuIsolationGrantMinter.revokeGrant`):
      a runtime-minted grant gets a revocable `(grantId, generation)` identity;
      `revokeGrant(grantId)` advances the grant generation so a stale grant
      handle's `createLease` fails `staleGeneration`, and cascades to every live
      lease minted through it -- reusing the landed fairness-termination cleanup
      (`reason=grant-revoked`, periodic-tick rollback, registry unregister) so the
      per-pool/per-account live-lease capacity frees immediately and a fresh grant
      is admitted into the reclaimed slot. Double-revoke is `alreadyRevoked` and an
      unknown `grantId` is `unknownGrant`, both fail-closed. The same
      `run-scheduler-cpu-isolation-pool-grant` smoke proves the full lifecycle.
      This closes Track C (prerequisite (c)) -- operator grant authority is now
      mint + revoke complete.
      Detailed design in
      `docs/proposals/tickless-realtime-scheduling-proposal.md`
      "Policy-Service Userstories: AutoNoHz Placement for Compute-Capable
      Threads".

### AutoNoHz Decomposition: Roadmap to Full Auto-NoHz

The status bullet above narrates *what landed*. This subsection is the
discrete dispatchable decomposition from the current landed state to full
operator-driven auto-nohz, so the path is written as concrete slices rather
than "future work" prose. Grounding: the proposal's "Policy-Service
Userstories: AutoNoHz Placement", "Bounds the policy service must enforce",
"Telemetry Requirements", and Implementation Sequence steps 7/14/17.

Landed substrate (not repeated below): the narrow manual per-CPU LAPIC
tick-mask for the single-runnable compute window and the SQPOLL-coupled
window, tickless idle, prerequisite (a) `leaseLifetimeNs @6` timeout
auto-revoke, prerequisite (b) the `SchedulingPolicyCap.snapshot @2`
saturation observation surface, and prerequisite (c) pool-grant authority now
mint + revoke complete (the manifest-declared multi-account
`cpuIsolationPoolGrants @15` table, runtime grant minting through
`CpuIsolationGrantMinter`, and the grant-revocation lifecycle that cascades to
minted leases). Fairness lease *termination* (Track D) and a *measured*
suppressed-tick proof have also landed, as have network-poll and IRQ-affinity
housekeeping routing, kernel-side generic full-nohz admission for ordinary
budgeted compute threads, and generic SQPOLL nohz admission for explicitly
leased caller-thread rings. What the name "auto nohz" still oversells today:
there is no production policy service, and broader userspace-poller/device-queue
issuance remains future work. Each remaining slice below closes one of those.

Conflict-domain note: every kernel slice here shares
`resource:scheduler-cpu-isolation` and writes `kernel/src/cap/cpu_isolation*`
or `kernel/src/sched.rs`, so they serialize against each other -- dispatch
the chain head first; the rest convert from this list into
`docs/tasks/` records as their `depends_on` closes. Slices marked
**ready** have a task record under `docs/tasks/`; the rest stay here
until their prerequisite lands.

Next increment (decomposed 2026-06-04 00:18 UTC; updated 2026-06-07 after
generic SQPOLL nohz landed): Track C, Track D, and the measured suppressed-tick
proof are all landed, and the ordinary-thread and SQPOLL-ring kernel admission
leaves are now done.
Records under `docs/tasks/` capture:
`scheduler-cpu-isolation-lease-renewal-on-reobservation` (renewal residual),
`scheduler-nohz-irq-affinity-housekeeping-routing`,
`scheduler-nohz-network-poll-housekeeping-routing`,
`scheduler-deadline-driven-budget-accounting`, and
`scheduler-generic-full-nohz-arbitrary-threads` as done. The remaining
operator-driven AutoNoHz capstone is the policy service.
These scheduler CPU-isolation slices serialize against each other on
`resource:scheduler-cpu-isolation` but are parallel-safe against the in-flight
Phase C network-stack lane, so the scheduler lane stays runnable whenever Phase
C 7c holds the kernel `cap/` surface.

Track C -- complete operator grant authority (prerequisite (c) residual):

- [x] `scheduler-cpu-isolation-runtime-grant-minting` -- behavior, normal,
      **LANDED 2026-06-02 22:24 UTC**. One cap (`CpuIsolationGrantMinter`) mints a fresh
      `CpuIsolationPoolGrant` for an operator-chosen `(account, pool)` at call
      time, bounded by the declared `SystemConfig.cpuIsolationGrantMinterAllowlist`
      (an out-of-allowlist pair is refused `unauthorized`), instead of only the
      boot-seeded table. The minted grant reuses the same unforgeable
      `createLease` admission path. Proof `make run-scheduler-cpu-isolation-pool-grant`.
      depends_on: manifest-multi-account grant table (landed).
- [x] `scheduler-cpu-isolation-grant-revocation-lifecycle` -- behavior,
      normal, **LANDED 2026-06-03 17:11 UTC**. `CpuIsolationGrantMinter.revokeGrant`
      revokes a runtime-minted grant by advancing its `(grantId, generation)` so
      later `createLease` through the stale handle fails `staleGeneration` and
      mints nothing; revocation cascades to every live lease minted through that
      grant, driving the landed fairness-termination cleanup
      (`reason=grant-revoked`, periodic-tick rollback, registry unregister) once
      per tagged lease so per-pool/per-account capacity frees immediately (a fresh
      grant's lease is admitted into the reclaimed slot in the proof). Double-revoke
      is `alreadyRevoked`, unknown `grantId` is `unknownGrant`, seeded grants stay
      un-revocable. Closes Track C. Proof
      `make run-scheduler-cpu-isolation-pool-grant`.
      depends_on: `scheduler-cpu-isolation-runtime-grant-minting` (landed),
      `scheduler-cpu-isolation-priority-aware-lease-termination` (landed).

Track D -- fairness preemption (proposal `fairness_preemption`):

- [x] `scheduler-cpu-isolation-priority-aware-lease-termination` -- behavior,
      normal, **LANDED 2026-06-02 21:17 UTC**. On arrival of an equal-or-higher
      policy-priority runnable on the leased CPU when no other CPU authorized by
      both the admitted pool and the lease `allowedCpuMask` is eligible, the
      kernel now terminates (revokes) the lease itself at the existing nohz
      rollback site (`fairness-preempted ... result=lease-terminated`), not just
      restores the periodic tick, bounded by `maxRevocationLatencyNs`. The
      recheck compares the static WFQ policy priority (`latency_class`, `weight`)
      of the arriving entity against the captured leased thread; a strictly-lower
      arrival or an eligible sibling CPU inside both masks keeps the existing
      tick-restore-only behavior. The termination runs the same
      generation-advancing cleanup `leaseLifetimeNs` expiry uses
      (`reason=fairness-preempted`) immediately after the scheduler restores the
      periodic tick, so a subsequent `info`/`revoke` reports `staleGeneration`
      and placement/account capacity is freed without waiting for the holder's
      next cap call. Proven in `make run-scheduler-cpu-isolation-lease` (default
      pool `0` with `allowedCpuMask=0x01`: an equal-priority sibling terminates
      and capacity is reclaimed, a strictly-lower sibling restores only). Out: no
      re-placement onto an eligible sibling CPU (the "no sibling eligible"
      condition is recorded; actual migration is generic-full-nohz work).
      depends_on: auto-nohz-activation (landed).

Lease lifetime renewal (proposal `lifetime_ns` renewal residual):

- [x] `scheduler-cpu-isolation-lease-renewal-on-reobservation` -- behavior,
      normal, **landed**. `CpuIsolationLease.renew @4` pushes `expires_at_ns`
      forward to `now + leaseLifetimeNs` (clamped to the same one-hour ceiling
      `read_spec` enforces), keeping the same `(leaseId, generation)`, accounting
      binding, and nohz activation state. Callable only before expiry: a revoked,
      auto-revoked, or past-deadline lease stays stale (`staleGeneration`) and is
      not resurrected, and an unbounded `leaseLifetimeNs = 0` (or factory) lease
      reports `notRenewable`. The renewed deadline is propagated to a tickless
      CPU's nohz activation record (`renew_nohz_lifetime_deadline_for_lease`) so
      the `lease-lifetime-expired` disqualifier no longer rolls it back at the old
      deadline. `CpuIsolationLeaseInfo.expiresAtNs` echoes the deadline read-only.
      The kernel primitive the policy service uses to renew an auto-issued lease
      by re-observing the saturation signal; the re-observation heuristic itself
      stays Phase H policy-service work. Proof
      `make run-scheduler-cpu-isolation-lease`.
      depends_on: timeout-auto-revoke (landed).

Honesty / telemetry (proposal Telemetry `ticks_suppressed{cpu,mode}`):

- [x] `scheduler-cpu-isolation-measured-suppressed-tick-proof` --
      harness-hardening, normal, **LANDED 2026-06-02 19:53 UTC**
      (`docs/tasks/done/2026-06-02/scheduler-cpu-isolation-measured-suppressed-tick-proof.md`).
      A kernel expected-vs-actual periodic-tick counter (`account_timer_fire`,
      counted only when no tick-suppression bit is set) over a bounded nohz
      window is asserted in `make run-scheduler-cpu-isolation-lease`
      (`cpu-isolation: nohz suppressed-ticks ...` plus a `restored-rate` line),
      so the proof shows the periodic tick actually stopped firing, not only that
      the mask write was issued and the CPU made progress. Closed the
      review-identified honesty gap. A durable `ticks_suppressed{cpu,mode}`
      telemetry field on a monitoring/status surface remains future work.
      depends_on: auto-nohz-activation (landed).

Step 7 -- network poll housekeeping/deadline routing:

- [x] `scheduler-nohz-network-poll-housekeeping-routing` -- behavior, normal,
      **landed 2026-06-04 04:48 UTC**. The in-kernel virtio-net poll
      (`virtio::poll_scheduler`) now routes off a lease-isolated (tickless) CPU:
      it consults `sched::current_cpu_lease_nohz_active()` and skips, emitting a
      bounded `cpu-isolation: network-poll routed ... result=skipped-on-isolated-cpu`
      record, while the always-ticking housekeeping CPU the admission requires
      keeps the poll progressing. The `network_polling` admission gate flips from
      the hard `rejected-periodic-network-polling-not-routed-to-housekeeping`
      refusal to a housekeeping-conditioned
      `routed-periodic-network-polling-to-housekeeping-cpu` admit (eligibility
      accepts the `routed-` prefix), and fails closed
      (`rejected-network-polling-no-housekeeping-cpu-to-relocate`) when no
      housekeeping CPU exists. The admitted `named_ring=None` lease carries the
      routed label tick-suppressed; the `CallerThread` compute-with-ring lease's
      network refusal is removed but it stays `ForcedPeriodic` because IRQ
      affinity routing is the separate slice below. Proof
      `make run-scheduler-cpu-isolation-lease`; regression `make run-net`.
      depends_on: housekeeping-deferred-work-placement (landed),
      auto-nohz-activation (landed).
- [x] `scheduler-nohz-irq-affinity-housekeeping-routing` -- behavior, normal,
      **landed** (`docs/tasks/done/2026-06-04/`). The activation path reroutes
      an opting-in leased CPU's legacy IO-APIC redirection-entry destinations
      onto the selected housekeeping CPU (mask-before-reprogram + read-back,
      restored on rollback/revoke) before admitting tick suppression, and keeps
      the conservative `rejected-irq-affinity-not-routed-to-housekeeping`
      refusal for a ring-coupled IRQ dependency that cannot be safely rerouted.
      Proof `make run-scheduler-cpu-isolation-lease`
      (`irq-affinity ok ... routed_admitted=true restored_on_revoke=true
      residual_forced_periodic=true`); DDF `run-interrupt-grant` /
      `run-devicemmio-grant` stay green. Scoped to a quiescent housekeeping
      destination: under the in-kernel KVM irqchip, reprogramming an IO-APIC
      redirection-entry destination onto an actively-scheduling CPU stalls that
      CPU's forward progress, so the live reroute is gated to a focused proof
      lease (reroute sentinel `maxRevocationLatencyNs`) whose destination is
      idle. A general busy-destination reroute remains future work behind a
      destination-quiescence gate or a non-KVM-irqchip delivery backend.
      depends_on: auto-nohz-activation (landed).

Step 14 -- generic SQPOLL nohz for arbitrary rings:

- [x] `scheduler-generic-sqpoll-nohz-arbitrary-rings` -- behavior, normal,
      done 2026-06-07. The SQPOLL nohz state machine now admits explicitly
      leased caller-thread rings when the SQPOLL worker is live, the ring is
      running/sleeping with a non-stale owner, exactly one SQ consumer is
      present, and producer wake/deadline rollback are bounded. The focused
      `make run-scheduler-generic-sqpoll-nohz` proof drives eligible entry,
      producer wake, SQPOLL service, rollback, and stale-owner rejection.
      Broader `AutoUserspacePoller` userspace-poller/device-queue issuance
      remains future policy-service work.
      depends_on: auto-nohz-sqpoll (landed),
      `scheduler-nohz-network-poll-housekeeping-routing`.

Generic full-nohz for arbitrary threads (the kernel half of "auto"):

- [x] `scheduler-generic-full-nohz-arbitrary-threads` -- behavior, normal,
      done 2026-06-06. Ordinary budgeted compute threads can now enter
      full-nohz through an explicit `SchedulingContext`-targeted
      `CpuIsolationLease` when the single-runnable, budget-deadline,
      housekeeping, network-poll, IRQ-affinity, timer, lifetime, and rollback
      gates all pass. Missing thread budget, multiple runnable work, revoked or
      expired leases, unrouted dependencies, and no-housekeeping cases still
      fail closed. Issuance is still policy-service future work; this is only
      the kernel admission half.
      depends_on:
      `scheduler-cpu-isolation-priority-aware-lease-termination`,
      `scheduler-nohz-network-poll-housekeeping-routing`,
      `scheduler-nohz-irq-affinity-housekeeping-routing`.

Step 17 -- user-space AutoNoHz policy service (capstone):

- [x] `scheduler-autonohz-policy-service-saturation-local-proof` -- behavior,
      normal, done 2026-06-07. A userspace AutoNoHz policy-service smoke now
      holds an operator-declared `CpuIsolationPoolGrant`, consumes
      `SchedulingPolicyCap.snapshot @2` runtime / runnable / voluntary-block /
      preemption counters, denies a voluntarily blocking worker, issues a
      bounded full-nohz lease only after a local saturation window, renews only
      after re-observing saturation, and proves stopped-renewal expiry leaves
      fallback periodic scheduling intact. The proof records the grant-stamped
      account/pool and the single allowed CPU mask that the kernel admitted.
      depends_on: `scheduler-cpu-isolation-runtime-grant-minting`,
      `scheduler-cpu-isolation-lease-renewal-on-reobservation`,
      `scheduler-cpu-isolation-priority-aware-lease-termination`.
- [ ] `scheduler-autonohz-production-policy-daemon` -- behavior, normal,
      blocked. Replace the local smoke's fixed single-process proof with a
      privileged reusable policy daemon: profile-driven smoothing/window
      selection, cross-process target discovery, operator policy plumbing,
      structured observability, and revocation/non-renewal decisions for
      multiple accounts and pools. The landed local proof keeps this future work
      replaceable without ABI churn.
      depends_on: `scheduler-autonohz-policy-service-saturation-local-proof`.

Independent hardening (makes auto-nohz budget-safe):

- [x] `scheduler-deadline-driven-budget-accounting` -- behavior, normal,
      done 2026-06-04. Charge `SchedulingContext` budget at
      monotonic-deadline granularity rather than per-periodic-tick so an
      auto-nohz thread cannot overshoot its budget by a full tick quantum while
      the tick is masked. Closes the "enforcement remains periodic-tick
      granularity" caveat that auto-nohz made load-bearing; the task ledger is
      `docs/tasks/done/2026-06-04/scheduler-deadline-driven-budget-accounting.md`.
      depends_on: Phase E budget enforcement (landed),
      `scheduler-lapic-oneshot-subtick-firing-precision` (done),
      `scheduler-monotonic-clocksource-subtick-discipline` (done).

## Cleanup: Retire Benchmark-Driven Scaffolding Before Phase E

This section captures simplification work identified during the post-thread-scale
SMP/threading architecture review on 2026-05-01 23:20 EEST. None of these items
are regressions: the affected code is correct, gated behind the `measure`
feature where it should be, and was added intentionally during attribution and
placement slices that closed the **In-Process Threading Scalability** milestone.
They are recorded here so the next selected scheduler milestone does not extend
or formalize speculative SMP scaffolding that the current per-CPU WFQ scheduler
does not need.

The cleanup is **subordinate to the current selected milestone** and to
already-open review-finding task records. Pick it up as Phase E preflight work
before `SchedulingContext` claims the scheduler surface. Each removal must
preserve the documented runnable-ownership invariants from
`docs/architecture/scheduling.md` (single dispatch owner per live `ThreadRef`
across per-CPU `current`/`handoff_current` slots, the per-CPU WFQ run queues,
and the direct IPC target; scheduler-lock-contained migration; allocation-free
timer/unblock/direct-IPC-fallback/requeue/steal-requeue paths) and the recorded
benchmark-only counter policy. The 2026-05-02 per-CPU run-queue collapse and
the accepted 2026-05-10 Phase D WFQ reintroduction are now both historical
evidence: the single-global-queue shape had accepted 1-to-2 evidence but a
1-to-4 diagnostic gap (capOS `1.566x`/`1.538x` vs Linux `3.963x`/`3.858x`),
and Phase D manually accepted the 2026-05-10 per-CPU WFQ 1-to-4 diagnostic
(capOS `3.088x`/`2.700x`; matching Linux `3.974x`/`3.850x` on the same pin
set) after the harness-enforced 1-to-2 gates stayed green.

Grounding read before any slice:

- `docs/architecture/scheduling.md`
- `docs/proposals/scheduler-evolution-proposal.md`
- `docs/proposals/smp-proposal.md`
- `docs/backlog/smp-phase-c.md`
- `kernel/src/sched.rs`
- `kernel/src/process.rs`
- `kernel/src/measure.rs`
- `kernel/src/arch/x86_64/{smp.rs,lapic.rs,percpu.rs,tlb.rs}`

Acceptance rule for every slice below: each removal must land with a host or
QEMU test that fails without it, so a future reintroduction is explicit
authority work rather than silent regression of an undocumented feature.

- [x] 2026-05-02 08:07 UTC: Retired the timer continuation fast path,
      its per-CPU skip budget, and the slow-path-required mirror flags.
      Deleted `try_continue_current_on_timer_tick`, `mark_timer_slow_path_required`,
      `reset_current_cpu_timer_fast_path_skip_count`,
      `note_timer_slow_path_completed_locked` (both feature variants),
      `scheduler_has_hard_timer_slow_path_work_locked_excluding_endpoint_queue`,
      `scheduler_timer_slow_path_reasons_locked`, the
      `TimerBlockedWaiterKind` / `blocked_thread_*` helpers, and the four
      atomic mirrors `TIMER_SLOW_PATH_REQUIRED`,
      `TIMER_FAST_PATH_SKIP_COUNTS`, `CURRENT_NON_IDLE_CPUS`, and
      `TIMER_FAST_PATH_MAX_CONSECUTIVE_SKIPS`.
      `set_current_thread_locked` no longer publishes
      `CURRENT_NON_IDLE_CPUS`. The timer interrupt entry in
      `kernel/src/arch/x86_64/context.rs` now always calls
      `crate::sched::schedule(context)` instead of trying the lock-free
      fast path. Eight `mark_timer_slow_path_required()` call sites in
      `kernel/src/sched.rs` (run-queue publish, pending process drop,
      park-with-deadline, process termination queue, direct-IPC handoff,
      timer sleep enqueue, cap-enter-with-deadline, pending thread stack
      release, pending endpoint cancellation push) also dropped — they
      are no-ops once the fast path no longer exists. Verified that
      `make run-spawn` exits cleanly (`[init] Spawn cap-table exhaustion
      check ok.`, `proc: process 2 exited with code 0`,
      `sched: last process exited, halting`) and `make run-smoke` runs
      the scripted login flow to operator session. `cargo build --features
      qemu` is warning-free (project rule). Reintroduce the fast path only
      if a future Phase D or Phase F slice ships an evidence pair where it
      measurably reduces scheduler-lock hold time on a contended SMP run.

      Follow-up partial 2026-05-02 08:39 UTC: `kernel/src/measure.rs`
      lost the eight public API entry points (`timer_fast_path_attempt`,
      `timer_fast_path_continue`,
      `timer_fast_path_slow_required_fallback`,
      `timer_fast_path_skip_budget_fallback`,
      `timer_fast_path_pending_reschedule_fallback`,
      `timer_fast_path_no_current_non_idle_fallback`,
      `timer_fast_path_inactive_invalid_cpu_fallback`, and
      `timer_slow_summary`) plus the now-orphaned `TimerSlowSummaryReasons`
      struct and its `requires_slow_path` impl. `cargo build --features
      qemu,measure` is back to warning-free.

      Follow-up complete 2026-05-02 21:00 UTC: the deeper deletion slice
      removed the seven `TIMER_FAST_PATH_*` static counters, the
      `TimerCounter::FastPath*` enum variants, the
      `TimerSlowSummaryCounter` enum, the `TIMER_SLOW_SUMMARY_*` counter
      arrays (`TIMER_SLOW_SUMMARY_COUNTER_VALUES`,
      `CASE_START_TIMER_SLOW_SUMMARY_COUNTERS`,
      `PREVIOUS_TIMER_SLOW_SUMMARY_COUNTERS`,
      `PHASE_TIMER_SLOW_SUMMARY_COUNTERS`), the
      `(TimerSlowSummaryCounter, &str)` reporting table, the
      `Snapshot.timer_slow_summary_counters` field, and the matching
      reset/diff/print helpers and accessors. `TIMER_COUNTER_COUNT`
      shrank from 11 to 4 (interrupts, user_scheduler, kernel_only,
      bsp_tick_advances). The `measure: timer ...` line is now compact
      and the `measure: timer_slow_summary ...` line is no longer
      emitted at all. `tools/qemu-thread-scale-harness.sh` dropped the
      `fast_path_*` clauses and the `timer_slow_summary` aggregate /
      per-phase grep checks in the same slice, satisfying the
      "removal must land with a host or QEMU test that fails without it"
      acceptance rule. Verified with `make fmt-check`,
      `cargo build --features qemu` (warning-free),
      `cargo build --features qemu,measure` (warning-free),
      `cargo test-lib` (171 passed), `make run-spawn`, and `make
      run-measure` (proof line emitted, exit 0). A local one-iteration
      `CAPOS_THREAD_SCALE_RUNS=1 CAPOS_THREAD_SCALE_GUEST_MEASURE=1 make
      run-thread-scale` was used solely as functional verification of
      the harness parser against the new measure-output shape (no CPU
      pinning, single iteration; the run reported `qemu taskset cpus:
      none` and the resulting medians/speedups are diagnostic only).
      This slice is a measure-output cleanup, not a scheduler-structure
      change, so it does not require controlled benchmark-VM timing
      evidence under the Phase A "before/after each scheduler structure
      change" rule; the harness fail-without-the-kernel-change pairing
      is the acceptance gate.
- [x] 2026-05-01 22:01 UTC: Collapsed the asymmetric scheduler CPU sizing.
      `MAX_SCHEDULER_CPUS = 64` was deleted, `MAX_SCHEDULER_CLEANUP_CPUS = 4`
      was renamed to a single `SCHEDULER_CPUS = 4`, and
      `SchedulerDispatch.current[]` resized from 64 to `SCHEDULER_CPUS` to
      match `run_queues`, `handoff_current`, `idle_pids`, `idle_threads`,
      `pending_thread_stack_release`, `TIMER_FAST_PATH_SKIP_COUNTS`, and
      `SCHEDULER_CPU_MASK`. The dual `current_cpu_slot()` /
      `current_cleanup_slot()` helpers collapsed into a single
      `current_cpu_slot()` that bounds-checks against `SCHEDULER_CPUS` and
      panics on overflow with `"scheduler: CPU id {} exceeds scheduler-owned
      mask"`. `scheduler_cpu_slot(cpu_id) -> Option<usize>` retained for the
      non-panicking lookup. The earlier "raw CPU id 0..63 vs scheduler slot
      0..3" indexing distinction is gone. Reintroduce a wider id-to-slot
      mapping only when a Phase D/F slice grows the scheduler-owned mask
      beyond the current four. Verified with `cargo build --features qemu`
      and `cargo build --features qemu,measure` (both warning-free) plus
      `make run-smoke` and `make run-spawn` on 2026-05-01.
- [x] 2026-05-02 09:26 UTC: Replaced the per-CPU run-queue array with a
      single global `run_queue: VecDeque<ThreadRef>`. `SchedulerDispatch`
      keeps `run_queue_live_reservations` as a single counter; the
      `reserve_run_queue_capacity_for_thread_locked` /
      `release_run_queue_capacity_reservations_locked` /
      `push_reserved_run_queue_locked` triple still bounds growth but
      operates on the single queue. `enqueue_ready_thread_on_cpu_locked`,
      `run_queue_target_cpu_locked`, the `created_thread_target_cpu_locked`
      placement chain (`active_ready_scheduler_cpu_mask`,
      `non_idle_dispatch_load_locked`, `least_loaded_scheduler_cpu_*`,
      `caller_current_scheduler_cpu_slot_locked`), the
      `CreatedThreadPublishPolicy` / `CreatedThreadTarget` types, the
      `scheduler_cpu_scan_order` helper, and the
      `crate::measure::thread_placement_publish_caller_*` reporting
      surface are all gone. `WakePolicy::QueueCpu(usize)` collapsed to
      `WakePolicy::QueueAny`. `wake_idle_scheduler_cpus_locked` walks
      eligible idle scheduler CPUs and stops only after the first one
      that accepts a fresh reschedule IPI; CPUs that already have a
      pending IPI (or that fail LAPIC delivery) are skipped without
      breaking, so a burst of ready work cross-wakes more than one
      neighbor for both queue and direct-target wakes.
      `publish_created_thread` no longer takes a `caller_thread`
      argument and no longer emits a per-CPU placement record: under the
      single global queue there is no per-CPU publish target, and
      hard-coding CPU0 misclassified normal worker publishes as
      single-owner-CPU0. Phase D later reintroduced the per-CPU split
      without restoring those publish counters; reintroduce them only
      through a separate operator-observability slice.

      Verified with `cargo build --features qemu` and `cargo build
      --features qemu,measure` (both warning-free) plus `make run-spawn`
      and `make run-smoke`. A post-collapse 3-run diagnostic
      `make run-thread-scale` on the benchmark VM (`taskset 0,1,2,3`,
      enforcement disabled) on 2026-05-02 10:42 UTC measured
      1-to-2 work/total `1.890x`/`1.792x` (slight improvement over the
      pre-collapse 1-to-2) and 1-to-4 work/total `1.504x`/`1.436x`
      (clear regression vs the pre-collapse 1-to-4): single-queue
      scheduler-lock contention dominates at 4 workers. The numbers
      live in `docs/benchmarks.md` as diagnostic. Phase D later
      brought per-CPU queues back with a fair-share enqueue policy and
      formal accepted evidence (capOS plus Linux baseline, full
      enforcement, multiple runs, recorded host caveats).
- [x] 2026-05-02 07:00 UTC: Lifted endpoint-cancellation retry storage out
      of the scheduler lock. The `pending_endpoint_cancellations: VecDeque`
      field is gone from `Scheduler`; it now lives in a dedicated
      `static PENDING_ENDPOINT_CANCELLATIONS: Lazy<Mutex<VecDeque<...>>>`
      with bounded `try_reserve_exact(MAX_PENDING_ENDPOINT_CANCELLATIONS)`
      reservation, eagerly forced in `init_idle` via `Lazy::force` so the
      allocation never lands in a timer/exit cleanup path. The queue's
      `len()` under its own mutex is the single source of truth for
      `pending_endpoint_cancellations` non-emptiness. Producers
      (`queue_pending_endpoint_cancellation`,
      `remove_pending_endpoint_cancellations_for_pid`,
      `remove_pending_endpoint_cancellations_for_thread`) and the drain
      (`drain_pending_endpoint_cancellations`) take only the queue mutex;
      the scheduler lock is acquired only briefly inside
      `queue_pending_endpoint_cancellation` to validate the target thread
      is live and has a ring scratch. `defer_endpoint_cancellation`
      previously re-acquired the scheduler lock just to push to the fallback
      queue; that re-acquisition is gone.

      `note_timer_slow_path_completed_locked` (consumer) holds the queue
      mutex across both the `!is_empty()` check and the
      `TIMER_SLOW_PATH_REQUIRED.store`, and the producer
      `queue_pending_endpoint_cancellation` stores
      `TIMER_SLOW_PATH_REQUIRED = true` inside the queue lock alongside
      its push, so a concurrent producer cannot push between the
      consumer's read and store and have its slow-path mark be overwritten.

      The functional contract is preserved: a cancellation that cannot
      deliver immediately because the target ring scratch is contended
      still falls back to the bounded retry queue, still raises
      `TIMER_SLOW_PATH_REQUIRED`, and is still drained on the next
      scheduler tick. Bound is unchanged
      (`MAX_PENDING_ENDPOINT_CANCELLATIONS = MAX_CAP_SLOTS *
      MAX_ENDPOINT_CANCELLATION_OBJECT_SWEEPS *
      MAX_ENDPOINT_CANCEL_NOTIFICATIONS_PER_ENDPOINT * SCHEDULER_CPUS`);
      the open size-tightening question (whether the `SCHEDULER_CPUS`
      multiplier is still load-bearing now that producers no longer hold
      the scheduler lock) is deferred to a future slice with bench evidence.

      A possible follow-on slice would move retry storage to per-endpoint
      bounded slots so each endpoint object owns its own queue, but that
      requires reshaping the `(thread, user_data)` payload to be addressable
      from an endpoint object and is non-trivial. The current move is
      sufficient to get the storage out of the scheduler lock and unblock
      future scheduler-lock-hold-time analysis.

      Verified with `cargo build --features qemu` and
      `cargo build --features qemu,measure` (both warning-free) plus
      `make run-spawn` and `make run-smoke` on 2026-05-02. Review found and
      fixed a Lazy-init in interrupt paths and a slow-path-clearing race
      against producer publication.
- [x] 2026-05-01 21:38 UTC: Feature-gated the first
      `ThreadCpuAccounting` experiment end-to-end behind
      `cfg(feature = "measure")`. That slice temporarily compiled the whole
      accounting record, its accessors, and scheduler call sites only when the
      feature was enabled. Phase D later superseded this temporary shape:
      `runtime_ns`, `virtual_runtime_ns`, and `last_started_ns` are now
      unconditional normal-build fields because WFQ ordering,
      `SchedulingPolicyCap.snapshot`, and `SchedulingContext` budget charging
      depend on them. The remaining diagnostic counters
      (`context_switches`, `preemptions`, `voluntary_blocks`, `migrations`,
      `last_cpu`, blocked/exited stability observations, placement buckets,
      and per-phase attribution counters) stay behind
      `cfg(feature = "measure")`. The 2026-05-01 slice was verified with
      `cargo build --features qemu` and `cargo build --features qemu,measure`
      (both warning-free) plus `make run-spawn` (non-measure default) on
      2026-05-01. `make run-measure` was broken on `main` at the time of
      this slice for unrelated reasons; that regression was repaired on
      `2026-05-02 20:23 UTC` (see `docs/backlog/scheduler-evolution.md` and
      the `docs/changelog.md` Measure Mode Repair entry).
- [x] 2026-05-01 21:02 UTC: Retired the
      `RUNNABLE_PROCESS_EXIT_CLEANUP_PROOF_PRINTED`,
      `RUNNABLE_THREAD_EXIT_CLEANUP_PROOF_PRINTED`, and
      `CPU_ACCOUNTING_PROOF_PRINTED` once-flag log lines along with their
      `Atomic*` gating booleans, the three `print_*_once` /
      `maybe_print_*_for_thread_locked` helpers in `kernel/src/sched.rs`,
      and their four call sites. The runnable-cleanup invariants remain
      enforced by the unconditional `assert_no_runnable_pid_entry_locked`
      and `assert_no_runnable_thread_entry_locked` panics already in
      `kernel/src/sched.rs`; a regression that leaves stale runnable owner
      state still panics the kernel and fails `make run-spawn`. The
      `tools/qemu-spawn-smoke.sh` harness lost its three matching
      `grep -Fq` lines for the same reason. The orphaned
      `Process::account_thread_exited_stable_observed` /
      `ThreadCpuAccounting::observe_exited_stable` helpers were deleted
      with the print; the remaining `ThreadCpuAccounting` writes stay
      untouched for the upcoming feature-gate slice. The
      `pub fn thread_cpu_accounting` accessor moved behind
      `cfg(feature = "measure")` because its only remaining caller is the
      measure-gated `account_thread_selected_locked` placement counter
      bridge.
- [ ] Cache the active CPU id in the per-CPU GS-relative slot.
      `arch::percpu::current_cpu_id` reads the LAPIC ID MMIO register
      and then linearly scans `CPU_LAPIC_IDS[0..64]` on every call.
      The timer fast-path consumer was retired on 2026-05-02 (see the
      "Retired the timer continuation fast path" entry above), but the
      function still runs from the syscall path and from non-syscall
      kernel contexts: `arch::context::advance_bsp_tick`, the
      scheduler's CPU-slot accounting and dispatch lookups in
      `sched.rs`, `arch::tlb::flush_pending_for_current_cpu`, and
      `mem::paging` invalidation paths. The hot caller is the syscall
      entry path; the non-syscall callers are why a drop-in GS-relative
      replacement is harder than the cleanup item first suggested. The single-`mov` lookup
      conceptually wants `mov %gs:offset, %eax`, but the slice is
      blocked on a kernel-mode GS-base invariant: today the kernel
      sets `KernelGsBase` via `set_kernel_gs_base` and only the syscall
      assembly does `swapgs` to make `gs:0..16` resolve at PerCpu while
      handling a syscall. In normal kernel context (timer ISR,
      scheduler from non-syscall paths, paging init, AP bring-up), the
      active GS base is whatever Limine left, not the PerCpu address.
      A drop-in replacement of `current_cpu_id` with `gs:[offset]`
      therefore faults outside syscall context (verified 2026-05-02:
      reordering `init_bsp` to set `KernelGsBase` before
      `set_kernel_entry_stack` is necessary but not sufficient because
      the active GS base is still not the PerCpu address). The
      enabling work is establishing a kernel-mode invariant that
      GS_BASE = PerCpu in CPL0 (typically by `swapgs`-ing on every
      kernel entry/exit, including interrupt handlers), or by adopting
      a hybrid: GS-relative read in the syscall path plus the existing
      LAPIC-based path everywhere else. Both paths are larger than a
      single retirement slice and should land with their own gates.
      Until then this item stays open and `current_cpu_id` keeps the
      LAPIC MMIO + `CPU_LAPIC_IDS` scan.
- [ ] Reassess the scheduler-lock-site instrumentation breadth.
      `SchedulerLockSite`, the `SchedulerLockGuard`/`measured_lock` wrappers,
      the dual `cfg(feature = "measure")` `scheduler_lock`/`scheduler_lock_site`
      paths, and the eight per-site counter axes in `kernel/src/measure.rs`
      were added when the global scheduler lock was the suspected scaling
      bottleneck. After the runqueue/dispatch split landed and the documented
      per-CPU ownership invariants stabilized, decide which sites still
      justify dedicated counters and which should fold back into the
      aggregate `scheduler_lock` line. Keep the `cfg(feature = "measure")`
      gating; reduce the surface so reading the scheduler still reads as one
      lock acquisition path under non-measure builds.
- [ ] Reassess `single_cpu_owner_pids`, `direct_ipc_target`, and
      `handoff_current` before Phase E starts. The single-owner pinning policy,
      the one-slot direct-IPC handoff, and the per-CPU handoff guard each
      special-case a small subset of the dispatch flow; document or delete
      each one against the accepted Phase D fair-policy behavior before
      `SchedulingContext` work depends on it. Do not delete them speculatively:
      the
      cross-process IPC and process/thread exit cleanup proofs depend on the
      current direct-IPC and handoff invariants.
- [x] Keep an honest scaling proof when scheduler work resumes.
      Completed `2026-05-02 21:38 UTC` on the benchmark VM against `main`
      commit `374f8556`. Five-run controlled paired evidence, both runs
      pinned to physical-core logical CPUs `0,1,2,3` on a 4-core/8-thread
      `n2-highcpu-8` host with KVM:

      | Comparison | capOS | Linux pthread | capOS gate | capOS verdict |
      | --- | ---: | ---: | ---: | --- |
      | 1→2 work  | `1.883x` | `1.988x` | ≥ `1.6x` | accepted |
      | 1→2 total | `1.787x` | `1.987x` | ≥ `1.6x` | accepted |
      | 1→4 work  | `1.566x` | `3.963x` | ≥ `1.6x` | diagnostic |
      | 1→4 total | `1.538x` | `3.858x` | ≥ `1.6x` | diagnostic |

      Linux scales near-linearly on the same physical CPU set (1-to-2
      `1.99x`, 1-to-4 `3.96x`), so the workload shape is sound and the
      capOS 1-to-4 gap is a scheduler bottleneck, not a benchmark
      artifact. The 1-to-2 result was the formal accepted gate against
      the single-global-queue scheduler. The 1-to-4 result became the
      bottleneck-attribution diagnostic that justified Phase D's fair-share
      enqueue policy; Phase D later manually accepted the `2026-05-10` WFQ
      1-to-4 diagnostic pair recorded above while the harness-enforced gates
      remained the 1-to-2 work/total speedups.

      Benchmark shape: blocking parent join, 262,144 blocks (16 MiB),
      `work_rounds=64`, 5 runs per case (the capOS harness default is 3
      runs; this collection explicitly set `CAPOS_THREAD_SCALE_RUNS=5`
      for parity with the Linux baseline default). Host caveats:
      internal benchmark VM in a single GCP zone, status `RUNNING`
      during collection, machine `n2-highcpu-8` with nested
      virtualization enabled, `/dev/kvm` readable+writable without
      sudo, SSH operator account, kernel `Linux 6.17.0-1012-gcp
      x86_64`, CPU `Intel(R) Xeon(R) CPU @ 2.80GHz`, distinct
      physical-core layout (logical CPUs 0-3 are core IDs 0-3 thread
      0; logical CPUs 4-7 are the SMT siblings), `qemu-system-x86_64
      8.2.2`, `rustc 1.97.0-nightly (c935696dd 2026-04-29)`.

      Exact commands:

      ```sh
      # capOS
      PATH="$HOME/.cargo/bin:$PATH" \
        CAPOS_THREAD_SCALE_QEMU_TASKSET_CPUS=0,1,2,3 \
        CAPOS_THREAD_SCALE_RUNS=5 \
        CAPOS_THREAD_SCALE_REQUIRE_SPEEDUP=1 \
        CAPOS_THREAD_SCALE_REQUIRE_TOTAL_SPEEDUP=1 \
        CAPOS_THREAD_SCALE_TIMESTAMP=20260502T213544Z \
        make run-thread-scale

      # Linux pthread baseline
      PATH="$HOME/.cargo/bin:$PATH" \
        LINUX_THREAD_SCALE_TASKSET_CPUS=0,1,2,3 \
        LINUX_THREAD_SCALE_RUNS=5 \
        LINUX_THREAD_SCALE_TIMESTAMP=20260502T213445Z \
        make run-linux-thread-scale-baseline
      ```

      Raw artifacts on the benchmark VM at
      `target/thread-scale/20260502T213544Z/` and
      `target/linux-thread-scale/20260502T213445Z/`. The instance was
      stopped after collection.