# SMP Phase C Backlog

> **ARCHIVED — milestones complete; residual full-SMP-hardware work tracked in
> [`scheduler-evolution.md`](scheduler-evolution.md) "Phase F.5: Full-SMP
> Hardware Scalability".** Both visible milestones this backlog tracks landed:
> Multi-Process SMP Concurrency (the `make run-smp-process-scale` proof is
> complete) and In-Process Threading Scalability (closed at commit `136b72de`,
> `2026-05-01 14:58 UTC`). No SMP track is active in `docs/tasks/README.md`. This file is
> retained as historical context and as the proof-contract reference; do not
> select new work from it -- the next visible SMP milestone is the planning slot
> in `scheduler-evolution.md` Phase F.5.

Detailed context for the selected SMP Phase C AP scheduler-owner proof and the
remaining full-concurrent-SMP and in-process thread-scaling follow-on work.

## Visible Goal

Move from a single scheduler owner to multiple CPUs that can run independent
scheduler-owned kernel/user work concurrently, and prove that capability-owned
processes can improve wall-clock performance on a deterministic CPU-bound
workload under QEMU/KVM.

This backlog tracks two distinct visible milestones:

1. **Multi-Process SMP Concurrency**: `make run-smp-process-scale` should boot
   a focused manifest, run a deterministic SMP scaling demo across independent
   worker processes, print verified workload output, and report comparable
   1/2/4-process timing. The proof is complete only when repeated KVM-backed
   `-smp 1` and `-smp 2` runs show near-linear speedup for the selected
   workload, while the ordinary manifest, ring, thread, park, and process-exit
   smokes still pass under `-smp 2`.
2. **In-Process Threading Scalability**: `make run-thread-scale` should run a
   deterministic workload across sibling threads inside one process, verify the
   result, and report comparable 1/2/4-thread timing. This milestone closed
   at commit `136b72de` (`2026-05-01 14:58 UTC`) against the pre-collapse
   per-CPU placement model: caller-aware child publication and the existing
   timer fast-path slices produced repeated KVM-backed physical-core evidence
   above the configured 1-to-2 work and total speedup thresholds. The 4-worker
   row remained diagnostic rather than a linear-scaling claim. The 2026-05-02
   per-CPU run-queue collapse retired that placement chain (caller-aware
   publication, per-CPU runnable queues, local-first stealing, the
   `WakePolicy::QueueCpu(usize)` variant). A post-collapse 3-run diagnostic
   on `capos-bench` 2026-05-02 10:42 UTC measured 1-to-2 work/total
   `1.890x`/`1.792x` (slight improvement) and 1-to-4 work/total
   `1.504x`/`1.436x` (clear regression on single-queue scheduler-lock
   contention). The formal capOS+Linux accepted-evidence pair landed
   against the same single-global-queue scheduler on `capos-bench`
   2026-05-02 21:38 UTC against `main` commit `374f8556`: capOS work
   `1.883x` / total `1.787x` clear the configured 1-to-2 gates, while
   the 1-to-4 row (capOS `1.566x`/`1.538x` vs Linux `3.963x`/`3.858x`)
   is the diagnostic gating Phase D's fair-share enqueue policy.
   Reintroducing per-CPU runnable queues with that policy must
   materially close the capOS-vs-Linux 1-to-4 gap before per-CPU
   queues land back in the scheduler. See
   `docs/architecture/scheduling.md`, `docs/benchmarks.md`, and
   `docs/backlog/scheduler-evolution.md` for the current state.

Full concurrent SMP scheduling remains the underlying kernel goal for the
multi-process milestone. It means more than one CPU can own scheduler work
simultaneously, including per-CPU runnable ownership, cross-CPU
idle-to-runnable handoff, reschedule IPIs, safe current-thread tracking, and
reviewed lock/residency rules. The multi-process scaling demo is the first
user-visible acceptance test for that kernel capability.

## Completed Gates

- [x] Ground the multi-CPU scheduling slice in the SMP proposal, scheduler and
      threading docs, and relevant `docs/research/` files.
- [x] Migrate syscall entry/exit to the GS-base/`swapgs` per-CPU path,
      including non-`sysretq` scheduler/exit paths.
- [x] Add LAPIC timer, EOI, and IPI support for per-CPU ticks and cross-CPU
      coordination. The active backend is PIT-calibrated xAPIC MMIO with
      PIT/PIC fallback; x2APIC remains a later backend.
- [x] Add TLB shootdown before any user address space can run on more than one
      CPU over its lifetime.
- [x] Extend scheduler state from BSP-only ownership to per-CPU current-thread
      tracking with AP idle/runnable handoff. The first AP scheduler proof uses
      one AP as scheduler owner while the BSP stays in kernel idle, preserving
      the process-wide ring invariant.
- [x] Add QEMU proof that AP cpu=1 executes scheduler-owned work and the
      existing manifest/ring/thread/park smokes still pass under `-smp 2`.

## In-Process Threading Closeout Rules

- [x] Resolve the scheduler hot-lock blocker before calling the selected
      milestone a scalability proof. The implementation at the time had
      per-CPU runnable queues and dispatch state, but they remained under
      one global `Scheduler` lock. A closeout branch should either split
      the hot dispatch path so ordinary timer preemption, local run-queue
      selection, and sibling CPU-bound thread requeue do not serialize on
      one global lock, or explicitly narrow the milestone to "functional
      in-process threading" and select a follow-on scheduler-lock
      scalability milestone. Completed 2026-05-01 14:58 UTC after
      repairing the benchmark shape against Linux baseline evidence and
      tightening caller-aware child publication: the repaired
      blocking-parent 16 MiB/64-round shape scales on Linux, and
      controlled physical-core capOS evidence passed the enforced 1-to-2
      work and total gates. Four-worker capOS scaling remained a separate
      follow-up because total time still showed scheduler/exit/join
      overhead. (Update 2026-05-02: the per-CPU runnable queues and the
      caller-aware child publication described here were later collapsed
      into a single global runnable queue with the per-CPU
      run-queue-collapse cleanup slice; the recorded 1-to-2 capOS gates
      were against that pre-collapse placement model. The current
      single-global-queue scheduler now has its own formal accepted
      1-to-2 pair on `capos-bench` 2026-05-02 21:38 UTC against `main`
      commit `374f8556` (capOS work `1.883x` / total `1.787x`; Linux
      baseline `1.988x`/`1.987x`); the 1-to-4 row remains the
      diagnostic gating Phase D's fair-share enqueue policy. Per-CPU
      queues and caller-aware placement return when that policy ships
      and materially closes the capOS-vs-Linux 1-to-4 gap. See
      `docs/architecture/scheduling.md`, `docs/benchmarks.md`, and
      `docs/backlog/scheduler-evolution.md` for current state.)
- [x] Add a bounded timer continuation fast path as a conservative split-prep
      slice. Completed 2026-05-01 10:29 UTC: a user-mode LAPIC timer tick may
      keep running the current non-idle thread without entering
      `sched::schedule()` only when a previous locked slow path has published a
      clean hard-work summary, the CPU has no pending reschedule IPI, and the
      per-CPU one-skip budget has not been exhausted. The 2026-05-01 11:40 UTC
      follow-up keeps every dirty producer forcing at least one locked timer
      pass, then allows remaining run queues and handoff-current markers alone
      to be treated as fairness/protection state for one continued tick. Direct
      IPC, deferred cleanup, Timer sleeps, and timed cap-enter/Park waiters
      still keep the hard slow-path bit set. The full scheduler path remains
      authoritative and still runs regularly for ring SQEs, cap-wait scans,
      cleanup, and accounting. This narrows timer-side scheduler-lock
      contention but does not by itself close the selected scalability
      milestone. Controlled `capos-bench` physical-core `0-3` before/after
      evidence for the initial strict-clean version stayed `accepted=false`:
      baseline
      `target/thread-scale/timer-fastpath-baseline-main-physical-20260501T102938/`
      reported work speedups `0.998x` and `0.998x`; after-change
      `target/thread-scale/timer-fastpath-after-physical-20260501T104700/`
      reported work speedups `1.001x` and `0.999x`. Controlled
      `capos-bench` physical-core `0-3` evidence for the fairness-only
      follow-up also stayed `accepted=false`: baseline
      `target/thread-scale/20260501T120224Z/` recorded work speedups `1.001x`
      and `0.999x` plus total speedups `0.913x` and `0.587x`; after-change
      `target/thread-scale/20260501T120709Z/` recorded work speedups `1.001x`
      and `1.000x` plus total speedups `1.125x` and `0.828x`.
- [x] Add timer-fast-path attribution counters for guest-measure thread-scale
      runs. Completed 2026-05-01 10:58 UTC: aggregate and per-phase `timer`
      lines now report fast-path attempts, continues, and fallback reasons for
      slow-required/dirty summaries, skip-budget exhaustion, pending
      reschedule IPIs, no-current/non-idle CPUs, and inactive/invalid
      scheduler CPUs. These counters answer whether the bounded continuation
      path fires inside benchmark phases. They are benchmark-only
      instrumentation and do not close the current `accepted=false` speedup
      gate. Local one-run evidence in
      `target/thread-scale/20260501T110157Z/` passed with the new fields
      present in every 1/2/4-thread `measure.log`; the timed work phase
      recorded `fast_path_continues=0` for all three rows.
- [x] Add timer slow-summary reason attribution for guest-measure thread-scale
      runs. Completed 2026-05-01 11:28 UTC: aggregate and per-phase
      `timer_slow_summary` lines now report required/clean counts and the
      predicate reasons that keep `TIMER_SLOW_PATH_REQUIRED` set after a locked
      timer slow path. Reason fields cover nonempty run queues, direct IPC
      targets, handoff-current state, pending process termination/drop/stack
      release, timer sleeps, and timed cap-enter versus park waiters. Local
      one-run evidence in `target/thread-scale/20260501T112359Z/` passed; the
      work phase showed `required=2/4/8`, `clean=0`,
      `run_queue_nonempty=2/4/8`, `handoff_current=2/4/8`, and zero timer
      sleeps/timed waiters for the 1/2/4-thread rows. The behavior follow-up
      keeps the output shape but changes `required` to mean hard timer work,
      not run queues or handoff markers alone. This attribution does not close
      the selected `accepted=false` speedup gate.
- [x] Add explicit thread-placement evidence and conservative new-child
      publication spreading. Completed 2026-05-01 12:37 UTC, refined
      2026-05-01 13:20 UTC, and repaired 2026-05-01 14:58 UTC after the
      blocking-parent benchmark exposed a placement regression. Guest-measure
      runs now emit aggregate and per-phase `thread_placement` lines for
      publish targets, caller-current publish buckets, caller-aware avoid,
      fallback, and strict-load fallback counts, selected CPUs, first-selected
      CPUs, and migration events across CPU slots 0-3. Newly created
      non-single-owner threads avoid the caller's current CPU only when another
      active ready scheduler CPU has a strictly lower non-idle dispatch load
      under the scheduler lock; on equal load, an active-ready caller CPU wins
      the tie instead of falling through to CPU0-biased least-loaded scanning.
      Single-owner processes stay pinned to CPU0. Timer, unblock, direct-IPC
      fallback, steal retry, and steal requeue paths keep their existing
      allocation-free targeting behavior. (Update 2026-05-02: the per-CPU
      run queues described here were later collapsed into a single global
      run queue, retiring the caller-aware placement and steal scans. See
      `docs/architecture/scheduling.md` and the per-CPU run-queue collapse
      entry in `docs/backlog/scheduler-evolution.md` for current state.
      Per-CPU queues return with the fair-share enqueue policy that Phase D
      will own.)

      The earlier avoid-caller rule passed the old spinning-parent 1-to-2 gate
      but was wrong for the repaired blocking-parent benchmark: a controlled
      run before the strict-load fix regressed to 1-to-2 work/total speedups
      `0.886x`/`0.928x` because the children were biased away from an otherwise
      available caller CPU. After the strict-load fix, controlled physical-core
      evidence passed the enforced 1-to-2 work/total gates with
      `1.828x`/`1.687x`. The same run recorded diagnostic 1-to-4 work/total
      speedups `3.029x`/`2.386x`; with scheduler switch diagnostics suppressed,
      those 1-to-4 diagnostics recorded `3.272x`/`2.303x`. Four-worker capOS
      scaling remains a follow-up, not a completed linear-scaling claim.
- [x] Preserve correctness gates while narrowing the lock: generation-checked
      `ThreadRef` ownership, no stale runnable queue entries after process or
      thread exit, direct-IPC preference without bypassing ownership checks,
      allocation-free timer/unblock runnable publication, and clean
      `run-smp2-smokes` evidence. Completed 2026-05-01 14:58 UTC: the
      caller-aware publication change preserves single-owner pinning and leaves
      timer/unblock/requeue/direct-IPC targeting unchanged; ordinary `-smp 2`
      regression coverage passed.
- [x] Rerun controlled physical-core evidence after any scheduler hot-lock
      change. The milestone should stay open until host-summary work and total
      gates pass, or until the milestone scope is intentionally changed and
      recorded in `docs/tasks/README.md`, `docs/roadmap.md`, and this backlog.
      Completed 2026-05-01 14:58 UTC after benchmark repair: the matching Linux
      baseline validated the repaired blocking-parent 16 MiB/64-round shape on
      the selected physical CPU set with 1-to-2 work/total speedups
      `1.991x`/`1.990x` and 1-to-4 work/total speedups `3.958x`/`3.834x`.
      Controlled capOS evidence passed the enforced 1-to-2 work/total gates
      with `1.828x`/`1.687x`.
- [x] Track post-closeout 4-worker scalability caveats separately from
      the recorded 1-to-2 milestone. The repaired benchmark proved the
      configured 1-to-2 work and total thresholds only against the
      pre-2026-05-02 per-CPU placement model. Linux now scales under the
      same repaired shape, so the remaining 4-worker capOS gap was not a
      benchmark-shape excuse. The strongest evidence at that time was:
      unsuppressed capOS 1-to-4 work/total speedups `3.029x`/`2.386x`,
      scheduler-switch-log-suppressed diagnostics `3.272x`/`2.303x`, and
      guest-measure runs that showed global `Scheduler` lock wait/hold
      cycles plus exit/join/block/schedule overhead while shared kernel
      locks were not visibly contended. Treat those numbers as
      historical; superseded by the formal `capos-bench` 2026-05-02
      21:38 UTC pair against `main` commit `374f8556` (capOS work
      `1.883x` / total `1.787x` clears the configured 1-to-2 gates;
      1-to-4 capOS `1.566x`/`1.538x` vs Linux `3.963x`/`3.858x`
      remains the diagnostic that gates Phase D's fair-share enqueue
      policy). Future four-core scaling claims should add an explicit
      1-to-4 gate, keep placement evidence enabled, separate
      work-window from total-time attribution, and continue splitting
      hot scheduler metadata/lock paths.

## Multi-Process SMP Concurrency Gates

- [x] Split the current one-owner scheduler latch into per-CPU scheduler run
      queues or equivalent ownership that can keep more than one CPU executing
      scheduler-owned work at the same time. Completed in commit `20f6894`
      (`2026-04-30 05:30 UTC`) with per-CPU scheduler ownership, current and
      handoff tracking, per-CPU idle/fallback cleanup slots, and temporary BSP
      pinning for endpoint-, launcher-, spawner-, and thread-authority holders
      so process-wide ring paths remain single-owner during this milestone.
- [x] Add reschedule IPIs for idle-to-runnable handoff across scheduler owners.
      The current scheduler tree tracks pending reschedule IPIs per target CPU,
      wakes halted scheduler-owner loops for newly runnable work, and uses the
      same serialized fixed LAPIC IPI send path as TLB shootdown without
      claiming a general preemptive reschedule interrupt.
- [x] Prove concurrent scheduler-owned work on more than one CPU with
      independent worker processes first. This avoids process-wide capability
      ring races while still proving real multi-core execution. The focused
      proof harness is on mainline as of commit `c2790c0`
      (`2026-04-30 07:38 UTC`), and the completed milestone is recorded at
      commit `3fb89923` (`2026-04-30 09:45 UTC`).
- [x] Add an SMP scaling demo binary and focused manifest. The first workload
      is segmented prime counting over generated ranges. It partitions work
      statically by worker index, avoids hot-path syscalls and serial output,
      produces aggregate prime-count/checksum verification, and prints one
      compact result line per accepted case.
- [x] Add a host harness for `make run-smp-process-scale` that runs the same workload
      under `-smp 1`, `-smp 2`, and optionally `-smp 4`, captures raw logs, and
      reports worker count, CPU count, ticks or cycles, output checksum, and
      speedup. A single noisy QEMU run is not enough evidence for a scaling
      claim; keep raw repeated-run artifacts for review.
      `tools/qemu-smp-process-scale-harness.sh` builds/uses
      `capos-smp-process-scale.iso`, stores serial logs under
      `target/smp-process-scale/<timestamp>/`, defaults to five repetitions,
      reports per-case medians, and enforces the 1.6x 1-to-2 median threshold
      only when KVM-backed evidence is available.
- [x] Treat near-linear 1-to-2 CPU speedup as the first publishable target.
      Use a threshold high enough to reject accidental concurrency illusions
      but low enough for QEMU/KVM variance, for example at least 1.6x median
      speedup over repeated runs. Record the exact threshold in the harness
      when this milestone is selected for implementation.

## `make run-smp-process-scale` Proof Contract

This target is the acceptance test for **Multi-Process SMP Concurrency**. It
must stay narrower than the later in-process threading milestone: one process
ring per worker process, no sibling threads in the timed section, no shared
ParkSpace words, no IPC throughput loop, and no completion-ring demux claim.

The first implementation should add:

- a focused `system-smp-process-scale.cue` manifest;
- a coordinator binary that receives the manifest-granted `ProcessSpawner`,
  spawns a fixed set of worker process cases, waits for each child, verifies
  aggregate results, and prints the compact result lines;
- a worker binary or a small family of worker binaries that execute one static
  partition of the deterministic workload and report only their final result
  through a parent endpoint or other existing spawn-result path after the
  timed section finishes;
- a `tools/qemu-smp-process-scale-harness.sh` host harness wired to
  `make run-smp-process-scale`.

The workload should be segmented prime counting over generated integer ranges.
Each run case divides the same total range into `workers` contiguous segments.
Worker `i` handles segment `i` without terminal output, IPC calls, heap-heavy
allocation, or capability operations in the timed region. The coordinator
collects one post-compute result per worker and verifies the aggregate prime
count plus a stable checksum or hash against known constants before it accepts
timing evidence.

The guest must print one line per accepted run case in this shape:

```text
[smp-process-scale] cpus=<n> workers=<n> range=<lo>..<hi> primes=<count> checksum=<hex> elapsed=<ticks-or-cycles> verified=true
```

The exact time source can be monotonic ticks or a cycle counter, but it must be
an in-guest measurement that brackets only the worker-process computation after
spawn/setup and before serial reporting. If timer granularity makes the proof
too noisy, increase the total range instead of measuring host wall time as the
primary signal. Host wall time may be reported as secondary harness metadata.

The host harness policy is:

- default to `CAPOS_SMP_SCALE_RUNS=5` complete repetitions per CPU-count case;
- run and report the advertised 1/2/4-worker timing cases. At minimum that
  means `-smp 1`/one worker, `-smp 2`/two workers, and a 4-worker timing case;
  the preferred 4-worker case is `-smp 4` when the local QEMU/KVM host exposes
  four usable vCPUs, otherwise the harness must still report the 4-worker case
  under the largest available SMP count and mark why a 4-vCPU run was not
  collected;
- require KVM for a speedup claim. If `/dev/kvm` or QEMU KVM acceleration is
  unavailable, the target may run a functional verification mode, but it must
  report that publishable speedup evidence was not collected;
- keep raw serial and terminal logs under a stable `target/` subdirectory such
  as `target/smp-process-scale/<timestamp>/`;
- summarize the median verified elapsed value for each case and require at
  least `1.6x` median speedup from the `-smp 1`/one-worker baseline to the
  `-smp 2`/two-worker case before accepting the near-linear 1-to-2 speedup
  claim;
- rerun the ordinary manifest, ring, thread, park, and process-exit smokes
  under `-smp 2` before marking the selected milestone complete.

As of commit `3fb89923` (`2026-04-30 09:45 UTC`), the focused manifest,
process-scale demo, and
host-side harness wiring produce passing default repeated KVM-backed speedup
evidence. The accepted run in
`target/smp-process-scale/cycle-balanced-default/` recorded medians
`smp1=1693`, `smp2=1053`, `smp4=2314`, or `1.608x`, satisfying the required
`1.6x` threshold. The worker-reported elapsed value is a scaled user-mode cycle
count, and the static worker ranges are contiguous but cost-balanced for the
prime-counting loop. The ordinary `-smp 2` smoke gate also passed:
`target/smp2-smokes/run-smoke.log` covers the default manifest smoke, and
`target/smp2-smokes/run-spawn.log` covers endpoint roundtrip, ring-reserved
opcodes, timer/runtime children, thread lifecycle, park cleanup, generic child
waits, and process exit. The Multi-Process SMP Concurrency milestone is
complete. The harness fails closed when the focused manifest, ISO, expected
compact proof lines, or speedup evidence are unavailable instead of fabricating
timing evidence.

`tools/linux-smp-process-scale-baseline.sh` is the reference-OS comparison for
this proof. It builds a tiny static Linux initramfs that runs the same forked,
deterministic prime-counting workload under the same QEMU/KVM CPU and memory
envelope, records raw logs under `target/linux-smp-process-scale/`, and uses
the same default five-run median policy. The script defaults now match capOS'
balanced contiguous splits; rerun the Linux comparison before publishing a new
OS-comparison table for the accepted capOS evidence.

The process-scale harnesses also expose an opt-in `smp8-smt` diagnostic through
`CAPOS_SMP_SCALE_INCLUDE_SMT=1` and `LINUX_SMP_SCALE_INCLUDE_SMT=1`. It uses
the same range and aggregate verifier with eight contiguous ranges and is
collected only when the host reports at least eight logical CPUs. This case is
for SMT behavior on 4-core/8-thread hosts; it must not be treated as 8-core
evidence or included in the accepted 1-to-2 speedup gate.

The proof must not depend on KVM paravirtual APIC, IPI, or TLB-flush features.
The current architectural xAPIC MMIO LAPIC timer/IPI path remains the correctness
surface; paravirtual APIC acceleration is future performance work.

Before the scheduler implementation branch claims this target, review the
non-blocked findings that could invalidate the evidence:

- panic-surface hardening for guarded unwraps, stale queues, blocking waits,
  process/thread exit, endpoint cancellation, and rollback restoration paths
  touched by scheduler ownership changes;
- quota/exhaustion behavior for the child-process, process-handle, outstanding
  call, scratch, frame, and invalid-SQE paths used by the coordinator and
  workers;
- release/revoke epoch behavior only for capabilities the demo actually grants.

Findings unrelated to this proof, such as DMA provenance, shared ParkSpace
unmap/reuse, or same-process per-thread ring routing, should stay tracked in
the migrated review-finding task records but must not be represented as
blockers for independent worker-process SMP scaling.

## SMP Review-Finding Reconciliation

This section classifies the review-finding task records for the selected
multi-process SMP proof. It does not close those findings; it defines what the
next scheduler and harness branches must satisfy before they can depend on the
paths involved in the proof.

Blocking or proof-invalidating for this milestone:

- **Scheduler panic surfaces touched by ownership changes.** A branch that
  changes scheduler ownership, per-CPU queues, idle-to-runnable handoff, or
  process/thread exit cleanup must audit and either harden or explicitly test
  the relevant `docs/panic-surface-inventory.md` scheduler rows:
  `block_current_on_cap_enter`, `capos_block_current_syscall`, stale run-queue
  process references, `exit_current`, `current_ring_and_caps`, scheduler
  `start`, and context-restore CR3 assumptions. The branch should add targeted
  host or QEMU coverage for each panic surface it claims to close.
- **Process/resource exhaustion on paths used by the coordinator.** The proof
  depends on `ProcessSpawner`, `ProcessHandle.wait`, result-cap adoption, and
  likely a parent endpoint or equivalent post-compute result path. Those paths
  must keep controlled failures for cap-slot exhaustion, process-handle
  exhaustion, endpoint queue pressure, scratch/result-buffer pressure,
  outstanding call pressure, and frame-grant/frame-exhaustion pressure from
  loading worker ELF pages, stacks, and TLS. Existing endpoint pending-RECV and
  queued-CALL overload coverage can be reused, but new coordinator-specific
  resource pressure introduced by the demo needs matching coverage before the
  proof is used as milestone evidence.
- **Runtime invalid-SQE flood handling if the harness exercises malformed
  submissions.** The process-scaling demo should not need malformed SQEs. If a
  future scheduler or harness branch adds invalid-submission stress to this
  target, it inherits whatever invalid-submission review-finding task records
  remain open at that time. Runtime flood handling and log/rate-limit
  suppression should be evaluated separately because active remediation may
  close one without closing the other. Otherwise invalid-submission remediation
  remains a separate track and should not block the pure scaling proof.

Guardrails that must be preserved but are not standalone blockers for the
independent worker-process proof:

- **Explicit revoke/epoch tests.** The demo should use only the capabilities
  needed to spawn workers and collect their final results. It must not claim
  peer revocation, stale session rejection, or object-epoch behavior unless it
  grants revocable/session-sensitive authority and adds flow-specific revoke or
  expiry tests.
- **ParkSpace unmap/reuse enforcement.** Independent worker processes should
  avoid shared ParkSpace words in the timed workload. The ordinary park smoke
  still has to pass under `-smp 2` before milestone completion.
- **Process-wide capability ring constraint.** The proof remains valid only
  because each worker has its own process ring and the timed section avoids
  ring traffic. It must not be cited as evidence for same-process sibling
  thread scalability, per-thread completion routing, or Ring v2.
- **Raw evidence retention.** Local repeated KVM logs are enough for this
  development milestone, but production/reproducibility claims remain governed
  by the provenance finding. Keep raw `target/smp-process-scale/<timestamp>/`
  artifacts for review and avoid implying third-party reproducibility.

Out of scope for this milestone unless a branch expands the demo surface:

- DMA owner state, generation-checked DMA/MMIO/IRQ handles, stale interrupt
  proofs, and DMA ResourceLedger/OOM implementation;
- shared ParkSpace unmap/reuse beyond preserving existing park smokes;
- same-process thread creation, join, TLS, per-thread rings, and Ring v2
  completion routing.

## In-Process Threading Scalability Gates

- [x] Define the per-thread capability-ring/completion-routing contract needed
      before same-process sibling threads can claim independent scaling.
      Completed 2026-04-30 10:19 UTC in
      `docs/proposals/ring-v2-smp-proposal.md`: the first Ring v2 slice uses
      kernel-chosen child-thread ring mappings, a shared `RingEndpoint` record
      for initial and child rings, and `ThreadRef -> RingEndpoint` as the
      routing model.
- [x] Move capability-ring waiting/completion routing to the per-thread
      `ThreadRef` model before claiming same-process sibling threads scale
      independently on different CPUs. Endpoint, timer, park, process-wait,
      thread-join, deferred-cancel, and direct IPC completion paths must all
      route through the target thread's `RingEndpoint` before same-process
      scaling can be claimed. Completed through the Ring v2/thread-scale
      substrate: spawned child threads receive independent ring endpoints, and
      local/controlled thread-scale evidence verifies child rings.
- [x] Ensure thread creation, FS/TLS setup, thread exit, join, park waits,
      and process exit remain generation-checked and safe when sibling threads
      can be resident on different CPUs. Completed through the reviewed
      thread-scale implementation and the closeout `run-smp2-smokes` pass.
- [x] Add an in-process thread scaling demo that uses the same class of
      deterministic CPU-bound workload as the multi-process proof, but splits
      work across sibling threads in one process. Prefer fixed-size
      parallel hashing/checksum chunks over prime counting for this milestone:
      equal-byte chunks have much more uniform work than trial division over
      increasing integer ranges, still keep the timed region syscall-free, and
      verify through one deterministic root hash. Print one compact result line
      per run.
      Completed with the `demos/thread-scale` proof and reusable
      `demos/thread-scale-workload` crate.
- [x] Add a host harness for `make run-thread-scale` that runs 1/2/4-thread
      cases under matching QEMU CPU counts, captures raw logs, and rejects
      results until the verified median speedup reaches the accepted threshold.
      Completed 2026-05-01 14:58 UTC after benchmark repair: the harness
      enforces KVM-backed 1-to-2 work and total thresholds when requested,
      carries `parent_wait` and `work_rounds` through CSV metadata, and the
      repaired blocking-parent 16 MiB/64-round run passed both enforced
      physical-core gates.
      2026-04-30 12:34 UTC functional checkpoint: this branch adds the
      same-process demo and QEMU harness as diagnostic evidence only. The
      harness retains raw serial logs under `target/thread-scale/<timestamp>/`,
      parses exactly one verified `[thread-scale]` line per 1/2/4-thread
      case, and reports median elapsed values plus diagnostic speedups.
      Focused phase diagnostics now add guest cycle fields for `spawn_ready`,
      `work`, `shutdown`, and `total` to separate thread creation/ready time,
      the syscall-free workload window, and thread exit/join time. `elapsed`
      remains the workload value and is an alias of `work`, so harness speedup
      calculations continue to use only the timed workload. The retained
      artifacts are raw QEMU serial/terminal/stdout/stderr logs plus
      `results.csv` and `summary.log`. Host-side QEMU profiling is opt-in
      through `CAPOS_THREAD_SCALE_PROFILE=1`; it requires `perf` and stores
      `perf.data`, `perf.script`, `perf.report.txt`, and
      `profile-command.txt` plus `qemu.status` in each case-run artifact
      directory. These are host samples of the QEMU process and the preserved
      workload exit status, not guest symbol attribution by themselves, so the
      guest phase counters remain the default diagnostic. Guest-side kernel
      measurement is separately opt-in through
      `CAPOS_THREAD_SCALE_GUEST_MEASURE=1`; it rebuilds the thread-scale ISO
      with the benchmark-only kernel `measure` feature and retains release
      symbols for that benchmark build only. It writes the kernel `measure:`
      segment summaries from each case-run serial log to that case-run's
      `measure.log` and records the per-case userspace symbol map path in
      `results.csv` under `guest_symbol_map`. It also writes a
      `user-pc-symbols.log` report beside each `measure.log` and records that
      path under `user_pc_symbol_report`; the report maps aggregate and
      per-phase `user_pc_samples` exact-RIP buckets to the nearest userspace
      symbol address not greater than the PC. Those segment counters cover
      scheduler choice, schedule save/requeue, timer and park wake paths,
      cap-wait scans, thread exit/join cleanup, and process exit/drop cleanup.
      First-slice shared-kernel contention counters now add aggregate and
      per-phase `shared_kernel_lock` lines for frame allocator alloc/free lock
      acquisitions, contention, and spin loops, plus the ring-dispatch
      cap-table and ring-scratch locks before `cap::ring::process_ring`.
      Follow-up counters also cover endpoint inner queue locks, endpoint
      cancellation scratch locks, and all direct per-process address-space lock
      sites. Heap attribution now routes the global allocator mutex through
      `SharedKernelLock::Heap` in measure builds; one-run guest-measure evidence
      recorded zero timed-work-phase heap acquisitions for the syscall-free
      benchmark and nonzero spawn/shutdown allocator activity. These remain
      benchmark-only `measure` attribution and do not close the broader
      shared-service contention finding.
      Fresh result rows now explicitly classify the benchmark hot section as
      syscall-free CPU work with ring and allocator activity limited to
      setup/shutdown, no endpoint or network activity, and result-only logging.
      The harness requires those benchmark-class fields for new QEMU parses,
      validates the expected values for this benchmark, carries them into
      `results.csv`, and keeps summary-only replay tolerant of legacy CSV files
      that predate the class columns. Local one-run evidence is retained in
      `target/thread-scale/20260501T083254Z/`.
      Network/polling attribution now adds aggregate and per-phase
      `measure: network_poll` lines for initialized virtio-net scheduler,
      runtime, and interface polling; the built-in TCP HTTP proof poll;
      virtqueue poll spins and completions; and pending network waiter scans.
      The guest-measure harness requires those lines. Local one-run evidence in
      `target/thread-scale/20260501T093505Z/` passed and retained zero
      aggregate and per-phase network/poll counters for the 1/2/4-thread rows.
      The default thread-scale manifest has no virtio-net device, and the
      scheduler poll entry returns before the driver mutex in that no-device
      case. For this CPU-bound benchmark they are zero-evidence guardrails, not
      service-throughput proof and not milestone acceptance.
      The symbol map and resolved report are benchmark-only nearest-symbol
      attribution aids for interpreting raw `user_pc_samples` buckets, not
      line-level profiling, a complete guest profiler, or normal-build guest
      attribution. These diagnostics are for reviewers, not speedup acceptance.
      The guest result line deliberately prints `accepted=false` as
      diagnostic guest-side state. Host acceptance is a separate summary
      decision: `CAPOS_THREAD_SCALE_REQUIRE_SPEEDUP=1` requires KVM-backed
      evidence and the configured 1-to-2 median `work`/`elapsed` speedup
      threshold, but it does not fail merely because parsed guest rows carry
      `accepted=false`. The total-case summary gate is separate and opt-in:
      `CAPOS_THREAD_SCALE_REQUIRE_TOTAL_SPEEDUP=1` requires KVM-backed
      evidence and the configured
      `CAPOS_THREAD_SCALE_TOTAL_SPEEDUP_THRESHOLD` against the 1-to-2 median
      `total` speedup. It is also supported by summary-only replay and is not
      enforced by default.
      `capos-bench` diagnostic run
      `target/thread-scale/capos-bench-thread-20260430T125613Z/` used
      `n2-highcpu-8` KVM with QEMU pinned to physical CPUs `0-3` for five runs
      per case. Median elapsed cycles were thread1 `56244112`, thread2
      `84429072`, and thread4 `140666438`; diagnostic speedups were
      thread1-to-thread2 `0.666x` and thread1-to-thread4 `0.400x`, with all
      rows still `accepted=false`.
      After phase diagnostics landed, `capos-bench` run
      `target/thread-scale/capos-bench-phase-20260430T134301Z/` used the same
      pinned physical CPU set and recorded five-run medians: thread1
      `elapsed/work=56285136`, `spawn_ready=43054612`, `shutdown=57693626`,
      `total=157008630`; thread2 `84432724`, `76247932`, `142200058`,
      `303096216`; thread4 `140768008`, `205527230`, `395434364`,
      `741943554`. The phase output shows shutdown/join cost increasing
      sharply with worker count, but all rows still remain `accepted=false`.
      After child-ring endpoints and the optional SMT8 diagnostic landed,
      `capos-bench` physical-core run `target/thread-scale/20260430T151909Z/`
      recorded five-run medians pinned to logical CPUs `0-3`: thread1
      `elapsed/work=56215128`, `spawn_ready=41692656`, `shutdown=57753172`,
      `total=155536564`; thread2 `84420848`, `74791942`, `142065130`,
      `301170274`; thread4 `140697028`, `143691606`, `395397620`,
      `679786606`. Final SMT diagnostic run
      `target/thread-scale/capos-bench-final-smt8-20260430T154058Z/` at commit
      `19f2fc66` used logical CPUs `0-7` and recorded medians: thread1
      `56272620`, `54277322`, `57824172`, `168448508`; thread2 `84343990`,
      `72757730`, `142229724`, `299693446`; thread4 `140992614`,
      `144614212`, `396264522`, `681167764`; thread8 `253352976`,
      `290422132`, `1239856304`, `1786188514`. All rows remain
      `accepted=false`, and thread8 is informational SMT evidence only.
      Scheduler-unpin final diagnostic run
      `target/thread-scale/scheduler-unpin-final2-20260430T160700Z/` removed the
      scheduler's transient same-pid pinning and verified 1/2/4-thread cases
      without the child-ring map/unmap TLB shootdown panics seen during this
      slice. One-run medians were thread1 `elapsed/work=56293734`,
      `spawn_ready=39202342`, `shutdown=34848540`, `total=130344694`; thread2
      `57101752`, `95921604`, `69869786`, `222894030`; thread4 `274828354`,
      `275826356`, `407818252`, `958473044`. Diagnostic speedups were
      thread1-to-thread2 `0.986x` and thread1-to-thread4 `0.205x`; all rows
      remain `accepted=false`.
      Follow-up local checks passed `make run-smp2-smokes` in
      `target/smp2-smokes/20260430T160936Z/` and reran three thread-scale
      samples in `target/thread-scale/scheduler-unpin-rerun-20260430T161104Z/`.
      That rerun kept correctness intact but recorded thread4 `902520658`
      cycles under local oversubscription, so it remains diagnostic only.
      After guest-side measurement landed, `capos-bench` runs at commit
      `a5c4f789` recorded five-run medians with QEMU pinned to host logical
      CPUs `0-3`, which map to distinct physical cores on that host: thread1
      `56341030`, thread2 `56166300`, thread4 `70122044`
      (`1.003x`, `0.803x`). The SMT diagnostic pinned to logical CPUs `0-7`
      recorded medians thread1 `56315082`, thread2 `56233080`, thread4
      `62630052`, thread8 `125488946` (`1.001x`, `0.899x`, `0.449x`).
      The one-run guest-measure pass in
      `target/thread-scale/20260430T182824Z/` recorded per-case
      `measure.log` files. Top measured guest-side cycle totals were
      `ring_processing` and `method_body`, with `sched_choose_next` and
      `thread_exit_join_cleanup` growing at higher thread counts. A follow-up
      local phase-aware guest-measure pass in
      `target/thread-scale/20260430T184532Z/` verified that each case
      `measure.log` now includes final-summary `measure: checkpoint` and
      `measure: phase` attribution for `spawn_ready`, `work`, `shutdown`, and
      `final_total`; the harness rejects guest-measure runs missing any of
      those phase summaries. These runs remain diagnostic and
      `accepted=false`.
      After phase-aware guest measurement landed on main at commit `da92ed42`,
      `capos-bench` reran the diagnostic with QEMU pinned to host logical CPUs
      `0-3`, which map to distinct physical cores on that host. Run
      `target/thread-scale/capos-bench-phase-main-20260430T191146Z/` recorded
      five-run medians: thread1 `elapsed/work=56242252`,
      `spawn_ready=38789562`, `shutdown=34859130`, `total=130093430`;
      thread2 `56233998`, `91718518`, `61923280`, `205126974`; thread4
      `62926552`, `109723566`, `119015960`, `297970796`. SMT diagnostic run
      `target/thread-scale/capos-bench-phase-smt8-main-20260430T191408Z/`
      pinned QEMU to logical CPUs `0-7` and recorded medians: thread1
      `56198166`, `41134070`, `34781494`, `132161420`; thread2 `56196302`,
      `42453050`, `63546086`, `162449504`; thread4 `62361512`, `87093620`,
      `109458814`, `258043804`; thread8 `125378372`, `249877254`,
      `528656458`, `904149404`. A one-run host-profile plus guest-measure
      sample in
      `target/thread-scale/capos-bench-profile-phase-main-20260430T191703Z/`
      used temporary host perf access with QEMU pinned to logical CPUs `0-3`,
      then restored `kernel.perf_event_paranoid=4`. The host reports still
      show QEMU/KVM execution, `ioctl`, QEMU mutexes, and MMIO/read helpers
      near the top; guest phase counters show no ring dispatches in the
      measured work phase, while shutdown/join and scheduler choice costs grow
      with worker count. These results remain diagnostic and `accepted=false`.
      Artifact content verification after collection checked `summary.log` and
      `results.csv` for the two five-run diagnostics and the one-run profile
      sample, plus the profile sample's `measure.log` and `perf.report.txt`,
      against the recorded medians, pinning, `accepted=false` status, guest
      phase claims, and host-profile claims.
      Join-cleanup optimization follow-up on branch
      `workplan/thread-scale-join-cleanup` adds per-thread pending join-waiter
      accounting so exiting worker threads that never blocked in
      `ThreadHandle.join` skip the thread-handle waiter scan. Local evidence:
      `target/thread-scale/join-cleanup-local-20260430T193657Z/` passed
      functional guest-measure verification, and `target/thread-scale-join-cleanup-run-spawn.log`
      passed `make run-spawn`; local timing remains diagnostic because the
      host was not a controlled benchmark environment. Controlled
      `capos-bench` reruns for this branch kept all rows
      `accepted=false`: physical-core run
      `target/thread-scale/capos-bench-join-cleanup-20260430T194536Z/`
      recorded medians thread1 `56173118`, thread2 `56166224`, thread4
      `62070170` (`1.000x`, `0.905x`), and SMT diagnostic
      `target/thread-scale/capos-bench-join-cleanup-smt8-20260430T194734Z/`
      recorded medians thread1 `56251116`, thread2 `56197306`, thread4
      `62519276`, thread8 `122089762` (`1.001x`, `0.900x`, `0.461x`).
      Scheduler-choice cleanup follow-up on branch
      `workplan/thread-scale-scheduler-choice` removes a redundant
      blocked-thread scan from the idle fallback in `choose_next_locked`.
      Local functional evidence:
      `target/thread-scale/scheduler-choice-local-20260430T200257Z/` passed
      guest-measure verification. Controlled `capos-bench` run
      `target/thread-scale/capos-bench-scheduler-choice-20260430T201041Z/`
      recorded medians thread1 `56171526`, thread2 `56301462`, thread4
      `62433702` (`0.998x`, `0.900x`), so the cleanup does not close the
      milestone.
      The immediate review-finding note that the scheduler still had a two-CPU
      owner mask is addressed by raising the temporary scheduler-owned CPU slot
      count and wake mask to four, so the 4-thread diagnostic can exercise
      four scheduler owners. This is only a blocker-removal step. The open
      attribution, serial/logging, scheduler-lock counter, workload-baseline,
      and per-CPU run-queue findings in the migrated review-finding task
      records remain required before accepting a speedup claim. Initial local
      build gates passed. The
      first `make run-smp2-smokes` attempt in
      `target/smp2-smokes/four-scheduler-cpus-20260430T202129Z/` exposed an
      early boot failure after the enlarged static scheduler value crossed a
      fragile initialization path. The implementation now uses a
      capacity-reserved deferred process-drop queue instead of embedding one
      `Process` slot per scheduler CPU in the `Scheduler` static. Bounded
      `run-spawn` smoke
      evidence passed in
      `target/smp2-smokes/four-scheduler-cpus-spawn-pending-vec-20260430T203055Z/`.
      Full `make run-smp2-smokes` passed in
      `target/smp2-smokes/four-scheduler-cpus-full-20260430T203214Z/`.
      Local thread-scale guest-measure verification passed in
      `target/thread-scale/four-scheduler-cpus-local-20260430T203356Z/` with
      `CAPOS_THREAD_SCALE_RUNS=1`, QEMU pinned to local CPUs `0-1`, and cases
      through `-smp 4`; local timing remains noisy and is not controlled
      speedup evidence. Controlled `capos-bench` runs then verified the
      effect on the benchmark host. Physical-core run
      `target/thread-scale/capos-bench-four-scheduler-cpus-20260430T203733Z/`
      used QEMU pinned to logical CPUs `0-3`, recorded medians thread1
      `56144884`, thread2 `56190496`, thread4 `36386164`
      (`0.999x`, `1.543x`), and kernel logs show AP scheduler owners on CPUs
      1-3 starting benchmark threads. SMT diagnostic
      `target/thread-scale/capos-bench-four-scheduler-cpus-smt8-20260430T203945Z/`
      used logical CPUs `0-7`, recorded medians thread1 `56181720`, thread2
      `56191504`, thread4 `56213928`, thread8 `116270280` (`1.000x`,
      `0.999x`, `0.483x`). Both rows remain `accepted=false`; the physical
      4-thread speedup is close to but below the `1.6x` threshold, and the
      SMT8 row is informational because the scheduler owner mask remains four
      CPUs.
      Scheduler-attribution follow-up branch
      `workplan/thread-scale-scheduler-attribution` adds guest-side total and
      per-phase scheduler counters for direct-target, run-queue, and idle
      candidate classes; runnable/retry/drop outcomes; and reschedule IPI
      target/sent/skipped/failure counts. Local functional verification in
      `target/thread-scale/scheduler-attribution-local-20260430T210322Z/`
      passed all 1/2/4-thread cases with
      `CAPOS_THREAD_SCALE_GUEST_MEASURE=1`, `CAPOS_THREAD_SCALE_RUNS=1`, and
      QEMU pinned to local CPUs `0-1`; the shell wrapper reported failure only
      because it reused zsh's read-only `status` parameter after the harness
      had already written a successful `summary.log`. The 4-thread work phase
      now records scheduler retry pressure (`55` run-queue candidate checks,
      `7` idle candidate checks, `28` runnable outcomes, and `34` retry
      outcomes) while still recording zero ring dispatches. This materially
      improves attribution but does not close the broader scheduler-lock,
      serial, CR3/TLB, guest-symbol, or workload-baseline requirements in the
      migrated review-finding task records.
      Serial-attribution follow-up adds guest-side total and per-phase serial
      byte counters to `CAPOS_THREAD_SCALE_GUEST_MEASURE=1`.
      Bytes are counted after LF-to-CRLF expansion and after a UART byte is
      emitted, including emergency writes in measure kernels. Local functional
      verification in
      `target/thread-scale/serial-attribution-local-20260430T212243Z/` passed
      all 1/2/4-thread cases with `CAPOS_THREAD_SCALE_RUNS=1` and QEMU pinned
      to local CPUs `0-1`; the stricter harness now requires aggregate and
      per-phase serial lines. The run recorded total serial bytes of `4161`,
      `4788`, and `6295`; work-phase serial bytes stayed at `74` in each case,
      while shutdown serial bytes rose from `70` to `145` to `631`. This closes
      the serial-byte counter blind spot, but it does not close scheduler-lock,
      CR3/TLB, guest-symbol, workload-baseline, or logging-suppression A/B
      requirements in the migrated review-finding task records.
      Scheduler-lock attribution follow-up adds guest-side total and per-phase
      global scheduler-lock counters to `CAPOS_THREAD_SCALE_GUEST_MEASURE=1`.
      It records acquisitions, contended acquisitions, try-lock failures as
      `spin_loops`, contended wait cycles, and hold cycles. Local functional
      verification in
      `target/thread-scale/lock-attribution-local-20260430T214854Z/` passed all
      1/2/4-thread cases with `CAPOS_THREAD_SCALE_RUNS=1` and QEMU pinned to
      local CPUs `0-1`; the stricter harness now requires aggregate and
      per-phase scheduler-lock lines. The local 4-thread final-total counters
      were `234` acquisitions, `104` contended acquisitions, `2,161,691` spin
      loops, `1,239,033,542` wait cycles, and `570,372,812` hold cycles; the
      4-thread work phase still had `15` acquisitions, `5` contended
      acquisitions, `95,047` spin loops, `37,181,792` wait cycles, and
      `32,762,392` hold cycles. This closes the first scheduler-lock counter
      blind spot; hold cycles include measure acquisition-counter update
      overhead and exclude release-counter update and unlock overhead, so they
      are first-pass attribution rather than exact critical-section time. At
      that point, CR3/TLB, guest-symbol, workload-baseline, logging-suppression
      A/B, and controlled benchmark-host confirmation requirements in migrated
      review-finding task records remained open; timer tick count attribution was
      queued for the follow-up recorded below.
      Controlled `capos-bench` reruns after this landed on main at commit
      `6eff7ae4` used QEMU pinned to logical CPUs `0-3` for physical-core
      evidence and `0-7` for the informational SMT diagnostic. Physical-core
      run
      `target/thread-scale/capos-bench-lock-main-physical-20260430T220944Z/`
      recorded medians thread1 `56309194`, thread2 `56302666`, thread4
      `28301916` (`1.000x`, `1.990x`); SMT diagnostic
      `target/thread-scale/capos-bench-lock-main-smt8-20260430T221246Z/`
      recorded medians thread1 `56379514`, thread2 `56186566`, thread4
      `28259776`, thread8 `131264324` (`1.003x`, `1.995x`, `0.430x`). A
      one-run guest-measure confirmation in
      `target/thread-scale/capos-bench-lock-main-measure-20260430T221543Z/`
      verified scheduler, serial, and scheduler-lock lines on the benchmark
      host. Host perf profiling was not collected because
      `perf_event_paranoid=4` blocked unprivileged perf on the restarted VM.
      Timer-attribution follow-up on branch
      `workplan/thread-scale-timer-attribution` adds guest-side total and
      per-phase timer counters to `CAPOS_THREAD_SCALE_GUEST_MEASURE=1`,
      distinguishing user-mode timer interrupts entering the scheduler path
      from kernel-mode timer interrupts that only advance time and EOI, with
      separate BSP tick-advance counts. The harness now requires aggregate and
      per-phase timer lines. Local functional verification in
      `target/thread-scale/timer-attribution-local-20260430T223441Z/` passed
      all 1/2/4-thread cases with `CAPOS_THREAD_SCALE_RUNS=1`, QEMU pinned to
      local CPUs `0-1`, and guest measurement enabled. Aggregate timer counters
      were `7/7/0/7`, `25/17/8/9`, and `132/101/31/23`
      (`interrupts/user_scheduler/kernel_only/bsp_tick_advances`); the
      4-thread work phase recorded `7/7/0/1`. The remaining attribution
      requirements at that point were CR3/TLB, guest-symbol or guest-PC
      sampling, workload-baseline, and logging-suppression A/B evidence.
      CR3/TLB-attribution follow-up on branch
      `workplan/thread-scale-tlb-attribution` adds guest-side total and
      per-phase TLB counters to `CAPOS_THREAD_SCALE_GUEST_MEASURE=1`, covering
      runtime CR3 writes, pending-flush checks, pending full TLB flushes,
      remote shootdown requests, target CPUs, shootdown IPIs, and deferred
      completion drains. The harness now requires aggregate and per-phase TLB
      lines. Local functional verification in
      `target/thread-scale/tlb-attribution-local-20260430T225628Z/` passed all
      1/2/4-thread cases with `CAPOS_THREAD_SCALE_RUNS=1`, QEMU pinned to local
      CPUs `0-1`, and guest measurement enabled. Aggregate TLB counters were
      `3/28/0/0/0/0/0`, `7/52/3/3/3/3/2`, and `14/139/17/7/17/17/4`
      (`cr3_writes/pending_flush_checks/pending_flush_all/shootdown_requests/shootdown_target_cpus/shootdown_ipis/deferred_completion_drains`);
      the 4-thread work phase recorded `0/10/0/0/0/0/0`. The remaining
      attribution requirements at that point were guest-symbol or guest-PC
      sampling, workload-baseline evidence, and logging-suppression A/B
      evidence.
      Logging-suppression A/B follow-up adds
      `CAPOS_THREAD_SCALE_SUPPRESS_SWITCH_LOGS=1` to `make run-thread-scale`.
      The knob suppresses scheduler transition diagnostics in the benchmark
      kernel while preserving proof, error, and measurement output. Local
      one-run A/B verification with `CAPOS_THREAD_SCALE_GUEST_MEASURE=1`,
      `CAPOS_THREAD_SCALE_RUNS=1`, and QEMU pinned to local CPUs `0-1`
      produced artifacts in
      `target/thread-scale/logging-ab-baseline-local-20260430T231800Z/` and
      `target/thread-scale/logging-ab-suppressed-local-20260430T232600Z/`.
      Targeted scheduler diagnostic line counts dropped from `7/12/18` to
      `0/0/0` for the 1/2/4-thread cases, and aggregate serial bytes dropped
      from `4161/4743/5889` to `3894/4280/5047`. This closes only the logging
      A/B blind spot; guest-symbol or guest-PC sampling and workload/cacheline
      baseline evidence remained open.
      Linux pthread baseline follow-up adds
      `make run-linux-thread-scale-baseline` for the exact fixed-size
      thread-scale checksum workload. Controlled native `capos-bench` runs at
      commit `370ce145` with taskset pinned to physical-core logical CPUs
      `0-3` recorded padded-slot capOS-shaped work-window medians of `306776`,
      `152293`, and `1120024` ns for 1/2/4 workers (`2.014x`, `0.274x`).
      Compact-slot medians were similar at `316388`, `152291`, and `1123534`
      ns (`2.078x`, `0.282x`), so result-slot false sharing is not the visible
      differentiator for the current workload shape. The SMT diagnostic pinned
      to `0-7` recorded padded work medians `303877`, `155565`, `170019`, and
      `243481` ns for 1/2/4/8 workers (`1.953x`, `1.787x`, `1.248x`). The
      exact baseline shows the one-megabyte workload and coordinator spin
      window are not a clean four-core linear-scaling reference. This closes
      the exact Linux pthread baseline and result-slot padding blind spots
      only; guest-symbol or guest-PC sampling and larger-workload/Amdahl-
      sensitivity evidence remain open.
      Benchmark repair follow-up completed 2026-05-01 14:58 UTC: the default
      host baselines now use blocking parent join, 262,144 blocks (16 MiB), and
      `work_rounds=64` instead of the old 1 MiB/spinning-parent shape.
      Controlled Linux evidence on the selected physical CPU set recorded
      1-to-2 work/total speedups `1.991x`/`1.990x` and 1-to-4 work/total
      speedups `3.958x`/`3.834x`, proving the repaired benchmark shape can
      scale on the host before capOS results are interpreted as scheduler
      evidence.
      Guest-PC sampling follow-up adds a measure-only exact-RIP histogram for
      user-mode timer interrupts while a thread-scale case is active. The
      harness now requires aggregate and per-phase `user_pc_samples` lines for
      `CAPOS_THREAD_SCALE_GUEST_MEASURE=1`. Local one-run verification in
      `target/thread-scale/guest-pc-sampling-local-20260501T001500Z/` used
      `CAPOS_THREAD_SCALE_RUNS=1` with QEMU pinned to local CPUs `0-1` and
      passed all 1/2/4-thread cases. Aggregate PC sample counts were `6`,
      `17`, and `55` with zero overflow; the 4-thread phase counts were
      spawn-ready `13`, work `9`, shutdown `33`, and final-total `55`. This
      closes the guest-PC sampling blind spot only; the later symbol-map
      harness slice preserves a benchmark-only userspace map for interpreting
      those raw PC buckets, and larger-workload Amdahl-sensitivity evidence
      remained open until the follow-up below.
      Resolved PC attribution report follow-up completed 2026-05-01 06:13 UTC
      on branch `workplan/thread-scale-pc-symbol-report`: guest-measure
      case-runs now write `user-pc-symbols.log` beside `measure.log` and
      record it in `results.csv` under `user_pc_symbol_report`. Local
      verification in `target/thread-scale/20260501T060822Z/` used
      `CAPOS_THREAD_SCALE_GUEST_MEASURE=1`, `CAPOS_THREAD_SCALE_RUNS=1`, and
      QEMU pinned to local CPUs `0-1`; the thread4 report resolves sampled PCs
      to `worker_entry`, `run_case`, and `RingClient::wait` nearest symbols and
      keeps PCs below the first symbol as explicit `<unmapped>` rows.
      Larger-workload/Amdahl follow-up adds `CAPOS_THREAD_SCALE_TOTAL_BLOCKS`
      and `LINUX_THREAD_SCALE_TOTAL_BLOCKS` so the same deterministic checksum
      workload can run beyond the default one-megabyte case. Controlled
      `capos-bench` runs at commit `32c066b8` used `1,048,576` blocks
      (64 MiB). With QEMU pinned to physical-core logical CPUs `0-3`, capOS
      work medians were `112590712`, `112511206`, and `36369098` cycles for
      1/2/4 workers (`1.001x`, `3.096x`), while total medians were
      `189204910`, `218898002`, and `205640850` cycles (`0.864x`,
      `0.920x`). The matching native Linux physical-core baseline recorded
      work medians `17766664`, `8961256`, and `7442107` ns (`1.983x`,
      `2.387x`) and total medians `17883289`, `9094596`, and `10090354` ns
      (`1.966x`, `1.772x`). SMT diagnostic rows pinned to `0-7` recorded
      capOS 1/2/4/8-worker work speedups of `1.002x`, `2.870x`, and `0.644x`
      and Linux speedups of `1.993x`, `2.458x`, and `2.658x`. Raw artifacts
      are under `target/thread-scale/amdahl-1048576-physical-20260501T003700Z/`,
      `target/thread-scale/amdahl-1048576-smt8-20260501T004200Z/`,
      `target/linux-thread-scale/amdahl-1048576-physical-20260501T003400Z/`,
      and `target/linux-thread-scale/amdahl-1048576-smt8-20260501T004000Z/`.
      This closes the larger-workload evidence blind spot, but the milestone
      remains open because 1-to-2 work scaling is flat and total-case scaling
      remains below 1x for 2/4 workers. The guest rows still carry diagnostic
      `accepted=false`; host-summary acceptance remains gated by KVM evidence
      and the configured 1-to-2 median work and opt-in total thresholds.
      Guest-measure runs now preserve the benchmark-only userspace symbol map
      needed to interpret raw PC buckets after collection.
      Post-threshold-policy `capos-bench` reruns at main commit `f198b099`
      verified the host-summary total-speedup fields while keeping the
      milestone open. Physical-core pinning `0-3` recorded work speedups
      `1.002x` and `1.002x` plus total speedups `0.911x` and `0.601x` for
      2/4 workers in
      `target/thread-scale/total-threshold-main-physical-20260501T065028Z/`.
      SMT diagnostic pinning `0-7` recorded 1/2/4/8 work speedups `1.001x`,
      `0.998x`, and `0.333x` plus total speedups `0.913x`, `0.621x`, and
      `0.200x` in
      `target/thread-scale/total-threshold-main-smt8-20260501T065443Z/`.
      Scheduler-lock site attribution follow-up completed 2026-05-01
      09:52 UTC: guest-measure kernels keep the existing aggregate
      `measure: scheduler_lock` line and add aggregate plus per-phase
      `measure: scheduler_lock_site` counters for generic, timer pre-ring,
      timer select, blocking, process exit, thread exit, start/idle selection,
      wake/unblock, and metadata classes. The harness requires those lines for
      `CAPOS_THREAD_SCALE_GUEST_MEASURE=1`. Local one-run evidence in
      `target/thread-scale/20260501T100202Z/` verified the new lines and still
      reported `accepted=false` with 1-to-2/1-to-4 work speedups `0.998x` and
      `1.001x` and total speedups `0.921x` and `0.509x`. This is bounded
      split-prep attribution for the known global scheduler-lock bottleneck,
      not speedup evidence; the later caller-aware placement closeout above is
      the controlled evidence that passed the work and total gates.
- [x] Record aggregate same-process worker placement for
      `make run-thread-scale` and fix creation-time local concentration.
      Completed 2026-05-01 12:37 UTC: guest-measure output recorded
      aggregate publish, selected-CPU, first-selected CPU, and migration
      buckets for CPU slots 0-3. Newly created non-single-owner threads
      were published to the least-loaded active scheduler CPU slot, while
      single-owner capability pinning, generation checks, direct-IPC
      preference, and allocation-free timer/unblock paths were preserved.
      This aggregate evidence proved the 4-worker first-selected
      distribution reached all four scheduler CPU slots, but it was not
      per-worker identity tracking and it was not speedup evidence.
      (Update 2026-05-02: the publish counters and the caller-aware
      placement chain were retired with the per-CPU run-queue collapse;
      `make run-thread-scale` and the kernel measure printer no longer
      emit the publish_*_cpu* / publish_caller_* fields. Selected-CPU,
      first-selected CPU, and migration buckets remain. Per-CPU placement
      evidence returns with the fair-share enqueue policy that Phase D
      will own.)
- [ ] If later attribution needs individual worker histories, add per-worker
      placement output for first scheduled CPU, latest scheduled CPU, migration
      count, and runnable-owner distribution without replacing the aggregate
      counters used by the thread-scale harness.
- [x] Treat same-process speedup as a separate claim from multi-process SMP
      concurrency. Passing `make run-smp-process-scale` must not imply this
      milestone is complete. Completed: same-process speedup was accepted only
      after `make run-thread-scale` controlled evidence on the thread-scale
      harness, separate from the earlier process-scale milestone.
- [x] Keep the ordinary `-smp 2` regression gate repeatable while the
      thread-scaling implementation evolves. The `make run-smp2-smokes` target
      runs the default manifest smoke and the spawn manifest smoke with
      `-smp 2`, retaining raw per-target logs under the configured target
      directory. Closeout evidence passed.

## Task Selection

Choose a task that isolates scheduler and CPU parallelism rather than a
subsystem bottleneck. Both milestones should use workload shapes with these
properties:

- CPU-bound and deterministic, with no network, disk, terminal, or heap-heavy
  hot path.
- Naturally partitionable into independent chunks so workers do not share a
  lock, mutable buffer, or capability ring while the timed section runs.
- Verifiable by a compact checksum, count, or known-answer oracle.
- Long enough to dominate boot, process spawn, timer granularity, and serial
  logging overhead.
- Runnable as independent worker processes for the multi-process milestone,
  and runnable as sibling threads through the per-thread completion-routing
  model used by the in-process milestone.

Avoid using IPC throughput, capability-ring dispatch, park wake storms,
console logging, or allocator stress as the first SMP scaling claim. Those are
valid later benchmarks, but they measure shared kernel bottlenecks as much as
CPU scheduling. Same-process thread scaling remains a separate milestone
because it needs accepted per-thread-ring timing evidence, not only functional
sibling execution.

For the in-process milestone, the default workload should be a uniform
fixed-size chunk workload such as BLAKE3-style tree hashing, CRC32C over
disjoint buffers, or a small native deterministic block-hash loop. The first
implementation does not need a cryptographic dependency; it does need
fixed-size chunks, per-thread private output slots, and a root checksum that
detects missing, duplicated, or reordered chunks. Prime counting remains valid
historical evidence for multi-process concurrency, but it is a weaker
same-process scaling workload because numeric range cost is not uniform.

## Grounding Files

- `docs/proposals/smp-proposal.md`
- `docs/proposals/ring-v2-smp-proposal.md`
- `docs/architecture/scheduling.md`
- `docs/architecture/threading.md`
- `docs/research/completion-ring-threading.md`
- `docs/research/out-of-kernel-scheduling.md`
- `docs/research/sel4.md`
- `docs/research/zircon.md`
- `docs/research/x2apic-and-virtualization.md`

## Notes

Initial multi-CPU scheduling may keep the current process ring while the
runtime serializes process-ring consumption. Full SMP where sibling threads
from one process wait independently on different CPUs should not keep the
process-wide CQ as the kernel ABI endpoint. The target transport model is
per-thread capability rings: `cap_enter(min_complete, timeout_ns)` waits on the
current thread's CQ, kernel waiters route completions by generation-checked
`ThreadRef`, and SQPOLL becomes a per-ring mode with one kernel SQ consumer.

SharedParkSpace park-words still need MemoryObject mapping provenance or object
pins before shared-key derivation lands.

2026-04-25 11:36 UTC: commit `d88bca7` recorded the First AP Scheduler proof.
AP cpu=1 can run scheduler-owned user contexts under `-smp 2`, and a one-way
scheduler-owner latch prevents the BSP and AP from both entering
scheduler-owned user work while the process-wide ring remains the active
transport.