# Proposal: Tickless and Realtime Scheduling

This proposal captures the scheduling design from the 2026-04-29 discussion and
the subsequent implementation status: tickless idle is useful, full-nohz belongs
behind explicit CPU isolation authority, and realtime requires scheduling
contexts rather than only per-request deadlines.

## Design Grounding

The directly relevant grounding is:

- [NO_HZ, SQPOLL, and Realtime Scheduling](../research/nohz-sqpoll-realtime.md)
- [Out-of-kernel scheduling](../research/out-of-kernel-scheduling.md)
- [Completion rings and threaded runtimes](../research/completion-ring-threading.md)
- [Multimedia pipeline latency](../research/multimedia-pipeline-latency.md)
- [Robotics realtime control](../research/robotics-realtime-control.md)
- [x2APIC and APIC virtualization](../research/x2apic-and-virtualization.md)
- [Scheduling](../architecture/scheduling.md)
- [Ring v2 For Full SMP](ring-v2-smp-proposal.md)
- [SMP](smp-proposal.md)
- [Realtime Voice Agent Shell](realtime-voice-agent-shell-proposal.md)

External grounding is recorded in the research note so reviewers can audit the
prior-art claims without treating this proposal as the source of truth.

## Goals

- Add tickless idle: when a CPU has no runnable work, stop the periodic
  scheduler tick and program the local timer for the earliest known deadline.
- Split monotonic timekeeping from timer interrupt delivery.
- Convert scheduler timeout waiters to absolute monotonic deadlines.
- Stage full-nohz as an explicit CPU isolation/lease mode for SQPOLL and
  realtime executors, not as a generic scheduler default.
- Define `SQE.deadline_ns` as request freshness metadata.
- Define `SchedulingContext` as CPU-time authority.
- Define `RealtimeIsland` as the admission object for media, robotics,
  provider, and other bounded realtime graphs.

## Non-Goals

- No ambient Linux-style `NO_HZ_FULL` for arbitrary unbudgeted user threads.
  Ordinary-thread full-nohz requires an explicit budgeted `SchedulingContext`
  target and a `CpuIsolationLease`.
- No SQPOLL on the current process-wide ring.
- No second SQ consumer through timer-side polling for SQPOLL rings.
- No TSC-deadline or x2APIC requirement for the first tickless-idle milestone.
- No hard realtime claim before kernel-path, IRQ, device, locking, and WCET
  evidence exists.
- No full realtime policy blob inside every SQE.

## CPU Authority Taxonomy

These terms must not drift into overlapping authority systems:

```text
ResourceProfile:
  policy template selected by identity, session, account, or service profile;
  it is not spendable authority by itself.

ResourceLedger:
  coarse accounting and quota owner for a resource class. It records and
  enforces limits, including non-realtime CPU share/runtime budgets where the
  scheduler has not minted finer scheduling contexts.

SchedulingContext:
  spendable CPU-time authority with budget, period, relative deadline,
  priority/criticality, CPU mask, and overrun policy.

CpuIsolationLease:
  placement, exclusivity, and nohz/noise-isolation authority for a CPU or CPU
  set. It does not grant CPU-time credit and must charge consumed time through
  a SchedulingContext or coarse scheduler ResourceLedger.

NoHzEligibility:
  a reviewed claim or hint that a thread, ring, poller, or island may use nohz
  isolation if the scheduler can prove the current CPU state allows it.

NoHzActivation:
  the scheduler-proven current CPU state that actually suppresses ticks.

RealtimeIsland:
  admitted bundle of SchedulingContexts, memory reservations, device
  reservations, rings, endpoint/service constraints, and optional
  CpuIsolationLeases.
```

Scheduling-context donation is not generic resource donation. It donates only
execution budget/deadline along a synchronous capability path; it does not
donate capability authority, invocation subject identity, disclosure scope,
memory budget, network budget, storage budget, or service-management authority.

## Layer 1: Tickless Idle

Tickless idle should be the first behavioral milestone. It applies only when
the CPU has no runnable thread and no local work that still depends on a
periodic scheduler tick.

### Clocksource

Add a monotonic clock layer:

```rust
pub fn monotonic_ns() -> u64;
```

The first backend can use the current periodic tick as a compatibility source
while the system is still periodic. The selected QEMU/x86_64 backend should
eventually use a calibrated stable counter, with SMP consistency handled when
multiple scheduler owners exist.

Required invariant:

```text
monotonic_ns() never moves backwards on one CPU.
```

### Clockevent

Add a small scheduler timer backend boundary:

```rust
trait ClockEvent {
    fn program_periodic(period_ns: u64);
    fn program_oneshot(delta_ns: u64);
    fn stop();
    fn min_delta_ns() -> u64;
    fn max_delta_ns() -> u64;
}
```

The first backend is the current PIT-calibrated xAPIC LAPIC timer on vector
48. PIT/PIC and periodic LAPIC remain fallback paths.

### Deadline Waiters

Convert timeout state from tick counts to absolute deadlines:

```rust
struct DeadlineWaiter {
    deadline_ns: u64,
    target: ThreadRef,
    kind: WaiterKind,
    user_data: u64,
}
```

Affected paths:

- `Timer.sleep`;
- `cap_enter(timeout_ns)`;
- ParkSpace timeout;
- future process/thread wait timeouts;
- network poll deadline through `NetworkPollClock`.

Waiter storage remains bounded. No interrupt path may allocate.

### Network Poll Clock

The kernel-resident networking path is scheduler-polled. Rather than keep every
network-coupled lease in `ForcedPeriodic`, the in-kernel virtio-net poll is now
*routed* off a lease-isolated CPU (landed 2026-06-04,
`scheduler-nohz-network-poll-housekeeping-routing`): `virtio::poll_scheduler`
consults `sched::current_cpu_lease_nohz_active()` and skips driving the poll
from a CPU inside a lease-backed tick-suppression window, so that CPU no longer
needs the periodic tick to make network progress. The always-ticking
housekeeping CPU the lease admission already requires keeps servicing virtqueue
completions and pending network-waiter scans. The `CpuIsolationLease` activation
preflight reflects this with a `network_polling=routed-periodic-network-polling-
to-housekeeping-cpu` admit label when a housekeeping CPU is available, failing
closed (`rejected-network-polling-no-housekeeping-cpu-to-relocate`, and the lease
is refused at create when no housekeeping CPU exists) otherwise. The longer-term
explicit poll-deadline interface below remains the target for fully removing the
dependency on a housekeeping CPU continuing to tick:

```rust
trait NetworkPollClock {
    fn next_poll_deadline_ns(now_ns: u64) -> Option<u64>;
    fn poll_until_budget(now_ns: u64, budget_ns: u64) -> PollResult;
}
```

`next_poll_deadline_ns` lets the scheduler include TCP/runtime timers in
`earliest_global_deadline()`. `poll_until_budget` prevents network progress
from becoming an unbounded idle-exit or interrupt path. A CPU with active
networking may enter tickless idle only when the network runtime is inactive or
has exposed a bounded deadline through this interface.

### Kernel Idle

Tickless idle depends on replacing the user-mode idle process with a
kernel/per-CPU idle context. Timer IRQ handling must distinguish:

```text
IRQ from CPL3 user thread -> save/restore user context
IRQ from CPL0 idle        -> wake/check scheduler without fake user context
```

Idle entry shape:

```text
if no runnable work:
    deadline = earliest_global_deadline()
    clockevent.program_oneshot(deadline - now)
    enter_kernel_idle()
```

The idle loop enables interrupts, halts, wakes on timer/IPI/device interrupt,
then rechecks runnable work and deadline expiry.

### Tickless State

Per CPU:

```text
Periodic:
  normal scheduler tick active

TicklessIdle:
  no runnable thread
  one-shot local timer programmed for earliest deadline
  CPU in kernel idle

ForcedPeriodic:
  fallback when a subsystem still needs regular polling
```

Enter `TicklessIdle` only when:

```text
run queue empty
no direct IPC target
no deferred completion work
no timer-side ring work required
clockevent supports one-shot
kernel idle context available
network runtime inactive or deadline-driven
```

Keep periodic preemption whenever there is runnable contention. Even one
runnable user thread remains periodic until Ring v2, CPU accounting, and
timer-side polling dependencies are resolved.

## Layer 2: SQPOLL NoHz

SQPOLL full-nohz is a later CPU ownership mode:

```text
full-nohz is not a timer feature here;
it is part of the SQPOLL CPU ownership contract.
```

Required prerequisites:

- Ring v2 or equivalent per-thread rings;
- one SQ consumer per ring, including implemented syscall-mode leases and
  bounded SQPOLL mode transitions;
- per-CPU scheduler ownership;
- reschedule IPI and idle-to-runnable handoff;
- at least one housekeeping CPU;
- explicit placement of network polling away from isolated CPUs.

Current Phase F status: `CpuIsolationLease` and nohz telemetry exist, the
housekeeping/deferred-work placement child records selected online
housekeeping CPU masks plus deferred cleanup, timer/deadline, network polling,
IRQ-affinity, accounting-target, and cleanup-latency placement or rejection
labels, bounded SQPOLL ring mode can progress from periodic service or one
current-thread syscall/producer-wake batch, and the clockevent/deadline
substrate has split monotonic clocksource reads from LAPIC clockevent
programming. The clockevent one-shot's firing precision is proven, not just its
programming: a runtime-reprogrammed `TICK_NS/2` one-shot armed over the live
periodic timer is measured to *fire* at its requested sub-tick instant (~5 ms
for a 5 ms request, far under the 10 ms tick, with the current-count correctly
reset to the sub-tick value), and the kernel-mode-fire path restores a live
periodic timer so a one-shot consumed without running `schedule()` cannot
strand the CPU with no timer source (`make run-scheduling-context`).

The monotonic clocksource *discipline* is now sub-tick-accurate as well. The
periodic discipline step previously floored every fire to `epoch + TICK_NS`
(`max(tsc_interpolated, epoch + TICK_NS)`), which inflated a real sub-tick
interval to a full tick and hid sub-tick deadlines from the accounting clock.
`discipline_clocksource_tick` now trusts the TSC interpolation at sub-tick
granularity and falls back to the `TICK_NS` floor only when the interpolated
advance is implausibly small (below `MIN_DISCIPLINED_ADVANCE_NS`), preserving a
minimum forward rate against a degenerate TSC (`publish_monotonic_ns` enforces
only non-decreasing time, not a minimum rate). A boot proof advances a real
`TICK_NS/2` interval through one discipline step and asserts `monotonic_ns()`
tracked the sub-tick delta rather than the full-tick floor
(`make run-scheduling-context`).

The first activation increment is now real: the `CpuIsolationLease`
activation preflight performs **real per-CPU periodic-tick suppression** for
the narrow single-runnable-entity window. When the preflight finds every
proof obligation satisfied -- exactly one runnable caller on the target CPU,
ready housekeeping CPU, no local deferred-cleanup/timer dependency, valid
accounting target, live monotonic clocksource, non-stale one-SQ-consumer, and
bounded revocation latency -- and the target CPU is the CPU running the
preflight, it masks the periodic LAPIC tick and arms a bounded one-shot
deadline at `min(nearest pending timer wakeup, now + max revocation latency)`.
Network polling is now routed to a housekeeping CPU rather than kept read-only
fail-closed (landed 2026-06-04): the in-kernel virtio-net poll skips driving
from a lease-isolated CPU (`virtio::poll_scheduler` consulting
`sched::current_cpu_lease_nohz_active()`), so the admission `network_polling`
gate flips to a `routed-periodic-network-polling-to-housekeeping-cpu` admit when
a housekeeping CPU is available and fails closed otherwise. IRQ affinity is now
routable in a bounded form (landed 2026-06-04): when a lease opts in, the
activation path reprograms the leased CPU's legacy IO-APIC redirection-entry
destinations onto the selected housekeeping CPU (mask-before-reprogram +
read-back, restored on rollback/revoke) before admitting tick suppression, and
keeps the conservative `rejected-irq-affinity-not-routed-to-housekeeping` refusal
for any ring-coupled lease whose IRQ dependency cannot be safely rerouted. The
live reroute is presently scoped to a quiescent housekeeping destination: under
the in-kernel KVM irqchip, reprogramming an IO-APIC redirection-entry destination
onto a CPU that is actively scheduling stalls forward progress on that
destination CPU, so a general "reroute onto any housekeeping CPU regardless of
occupancy" admission remains future work behind a real destination-quiescence
gate or a delivery backend without that re-evaluation cost. Every disqualifying
change (stale lease generation, a
second runnable entity, stealable sibling work, a local deferred-cleanup
dependency, a target-CPU mismatch, or a one-shot backend that can no longer
arm a deadline) rolls the CPU back to the periodic LAPIC tick first, before
ordinary work continues. Generic full-nohz for ordinary budgeted compute threads
is now admitted through explicit `SchedulingContext`-targeted compute leases. A
generic SQPOLL nohz state machine now admits explicitly leased caller-thread
rings when the ring is in SQPOLL running/sleeping mode with a live owner, one
SQ consumer, and bounded producer-wake/deadline rollback. Broader
userspace-poller/device-queue admission and production realtime island
admission remain future work; the periodic tick stays the fail-closed fallback
everywhere else. Timeout-based auto-revoke has since landed:
a lease created
with `leaseLifetimeNs > 0` auto-revokes on first observation past its deadline
(`reason=lease-expired`) and a tickless CPU under it rolls back at the next
recheck (`lease-lifetime-expired`)
(`docs/tasks/done/2026-05-30/scheduler-cpu-isolation-lease-timeout-auto-revoke.md`).
SQPOLL-driven activation is now proven by
`make run-scheduler-generic-sqpoll-nohz`: a ring-coupled `kernelSqpoll` lease
whose bound ring is in SQPOLL running/sleeping mode with a live owner is
admitted for tick suppression, producer wake drives bounded non-periodic
service, and revoke/stale-owner rollback fails closed. The per-CPU
idle thread has also landed -- the scheduler idle path is now a CPL0 per-CPU
kernel idle thread and the user-mode idle process is gone (`docs/tasks/README.md`).

The non-atomic `createLease`-vs-`revokeGrant` SMP window
(`kernel/src/cap/cpu_isolation_pool_grant.rs:472-483`) -- a `createLease` that
passes the grant live-check on one CPU can `register` its lease just after a
concurrent `revokeGrant` on another CPU snapshotted the registry, so that lease
is not cascade-terminated and lingers until its own `leaseLifetimeNs` or an
explicit revoke -- is now a *modeled, bounded* residual rather than a prose-only
caveat. The Alloy lease/grant authority model represents it explicitly as the
`WindowLingering` set and checks that no live lease reaches a revoked grant
*outside* it. That the lingering lease was nonetheless legitimately authorized
(no lease is ever minted through an already-revoked grant) is a *temporal*
mint-time-vs-revoke property the static relational model does not itself check;
it rests on the code's create-time `minted_grant_live` gate
(`cpu_isolation_pool_grant.rs:484`), which fails closed before admission. Taken
together this is a bounded capacity-hold window, not an authority escalation. The
companion TLA+ model checks the two-lock teardown the cascade and prune share
(generation advances exactly once, no capacity double-free, no stranded
generation). Both run under `make model-scheduler-lease-alloy` /
`make model-scheduler-lease-tla`; see `models/scheduler/README.md`.

The nohz/tickless activation-rollback path -- the lock-free `NOHZ_ACTIVE_CPUS`
bit read from ISR context against the locked `dispatch.nohz_activation[slot]`
record, with IPI-delivered cross-CPU activation/rollback -- is likewise now a
*checked* model rather than a prose-only invariant. The TLA+ lifecycle model
(`models/scheduler/nohz_activation.tla`) checks that no scheduler CPU is ever
left timer-less (a fired one-shot always has the contention fallback re-arm
enabled, and is always eventually re-armed), that the lock-free bit and the
locked record always reconcile (the bit-set/record-cleared and
record-present/bit-cleared divergences the rollback and contention paths produce
are transient), and that a staled remote activation is dropped rather than
applied to a newer lease (a staled generation is never committed, and a
recorded generation staled by the cap-side `maybe_expire` path is always rolled
back by the `stale-lease-generation` disqualifier). A focused Loom test pins the
lock-free-bit ↔ locked-record reconciliation under the C11 memory model. Both
run under `make model-scheduler-nohz-tla` / `make model-scheduler-nohz-loom`;
see `models/scheduler/README.md`.

Ring mode:

```rust
enum RingMode {
    Syscall,
    SqpollStarting,
    Sqpoll,
    SqpollStopping,
}
```

In syscall mode, the owner thread's `cap_enter` drains SQ. In SQPOLL mode, a
kernel worker owns SQ head; userspace owns SQ tail and CQ head; `cap_enter`
waits for completions and may wake a sleeping poller, but it does not drain
SQ.

SQPOLL state:

```text
Disabled -> Starting -> Running -> IdleSpinning -> Sleeping -> Stopping
```

The wake protocol uses a `NEED_WAKEUP` flag. Userspace release-stores the SQ
tail, acquire-loads flags, and invokes a wake path only if the poller has gone
to sleep.

The race-free sequence is normative.

Poller before sleeping:

```rust
flags.fetch_or(NEED_WAKEUP, SeqCst);

let tail = sq_tail.load(Acquire);
if sq_head != tail {
    flags.fetch_and(!NEED_WAKEUP, Release);
    continue;
}

park();
```

Producer:

```rust
write_sqe();
sq_tail.store(new_tail, Release);
fence(SeqCst);

let flags = flags.load(Acquire);
if flags & NEED_WAKEUP != 0 {
    wake_poller();
}
```

The poller must set `NEED_WAKEUP` before the final tail recheck. Otherwise a
producer can publish a new SQE after the poller checks the tail but before it
parks, losing the wake.

The `NEED_WAKEUP` publication must also be ordered before the final tail
recheck by a full store-to-load barrier. A `SeqCst` RMW is the simplest
portable rule for the ABI text; an implementation may substitute an explicitly
reviewed architecture-specific fence or park primitive that provides the same
ordering. A plain release store or release-only RMW is not sufficient for this
protocol.

The producer must likewise order the SQ tail publication before checking
`NEED_WAKEUP`. The normative sequence uses a full fence between
`sq_tail.store(..., Release)` and `flags.load(Acquire)`; an implementation may
substitute an explicitly reviewed equivalent that prevents the producer from
missing `NEED_WAKEUP` while the poller misses the new tail before parking.

An SQPOLL CPU may suppress the periodic tick only if:

```text
cpu role is SqpollIsolated
exactly one runnable entity is the poller
no ordinary user thread is runnable there
no timer-side SQ polling is enabled
no network scheduler polling is pinned there
no deferred cleanup is pinned there
stable clocksource/accounting exists
housekeeping CPU is online
```

If any condition fails, restore periodic tick or migrate the unrelated work.

### NoHz Activation Proof Obligations

To enter `SqpollNoHz` or future `AutoNoHz`, the scheduler must prove:

```text
exactly one runnable entity is assigned to the CPU
at least one housekeeping CPU is online
no local network polling dependency remains
no timer-side SQ polling can run for the active ring
no local deferred cleanup or unbound kernel worker is pinned there
no unmigratable IRQ targets that CPU unless explicitly allowed
clocksource and CPU accounting are boundary/counter driven, not tick driven
revocation latency is within the lease policy
```

The proof is dynamic. If any condition stops holding, the scheduler must
restore periodic tick, migrate unrelated work, revoke the lease, or leave nohz
mode before continuing.

## Layer 3: AutoNoHz CPU Lease

The long-term design should split eligibility from activation.

Eligibility says a thread, process, ring, or realtime island may use nohz
isolation:

```rust
enum NoHzKind {
    Idle,
    KernelSqpoll,
    AutoCompute,
    AutoUserspacePoller,
    RealtimeIsland,
}

struct NoHzEligibility {
    kind: NoHzKind,
    max_revocation_latency_ns: u64,
    preferred_cpus: CpuSet,
    allow_busy_spin: bool,
    accounting_target: CpuAccountingTarget,
}

enum CpuAccountingTarget {
    CurrentSchedulingContext,
    SchedulerResourceLedger,
}
```

Activation is a scheduler proof that a CPU currently satisfies isolation
conditions. Without a lease, a latency-sensitive hint may influence placement
but must not grant exclusive CPU access.

Future lease shape:

```text
CpuIsolationLease:
  owner process/session
  allowed CPU set
  allowed mode: poller/compute/kernel-worker
  accounting target, not CPU-time credit
  revocation policy
```

Housekeeping must be explicit:

```text
Housekeeping CPU set:
  global timers
  deferred frees
  cleanup
  statistics
  non-critical kernel workers
  debug/watchdog
  load balancing and migration control
```

## Layer 4: Deadline Metadata

Deadline metadata lives in fixed ring ABI fields, not in a Cap'n Proto SQE
envelope and not in variable side metadata. The current fixed SQE layout should
not be silently reinterpreted; add these fields through a versioned
`CapSqeV2`/ring ABI gate when the transport is ready.

```rust
#[repr(C)]
struct CapSqeV2 {
    // existing fixed CapSqe fields, unchanged in order and meaning

    deadline_ns: u64,  // absolute monotonic deadline, 0 = none
    qos_flags: u32,   // drop/allow/reorder/propagate semantics
    sched_ctx_id: u32, // 0 = current/default scheduling context
}
```

`deadline_ns` is an absolute monotonic timestamp. It is request freshness
metadata, not a promise of nanosecond wakeup precision. The kernel may round
timer programming to clockevent granularity, coalesce timers where policy
allows, or report a miss when dispatch observes the timestamp has already
expired. The field remains `u64` nanoseconds because absolute `u64` ns values
are simple, tracing-friendly, and shared with existing timeout surfaces; a
`u64` microsecond field saves no ABI space.

Only consider a compact profile if SQE space becomes critical:

```rust
deadline_delta_us: u32
```

That profile would be a soft-deadline compact transport shape only. It is not
the primary realtime or `SchedulingContext` ABI and must not replace
`deadline_ns` for admitted realtime work.

ABI negotiation uses both bootstrap metadata and a runtime query surface:

```rust
struct RuntimeBootInfo {
    ring_addr: u64,
    ring_abi_version: u32,
    sqe_size: u16,
    cqe_size: u16,
}
```

- Process bootstrap passes the ring ABI version and fixed entry sizes alongside
  the ring address.
- `RuntimeBootInfo`, ring ABI version constants, and fixed SQE/CQE layouts live
  in `capos-config/src/ring.rs`; the kernel and `capos-rt` import the same
  definition rather than carrying local copies.
- A future `RuntimeInfo`/`SystemInfo` query returns the kernel-supported ring
  ABI range so language runtimes can fail before mapping incompatible rings.
- `cap_enter` rejects unsupported SQE versions or entry sizes with stable
  transport errors such as `CAP_ERR_UNSUPPORTED_RING_ABI` and
  `CAP_ERR_UNSUPPORTED_SQE_VERSION`.
- Runtimes in Rust, C, Go, and other languages must generate or mirror the
  exact fixed layout for the negotiated version.

Suggested flags:

```text
DROP_IF_LATE:
  if now > deadline_ns before dispatch, post DEADLINE_EXPIRED

ALLOW_LATE:
  dispatch anyway, but CQE/telemetry marks late

PROPAGATE_DEADLINE:
  endpoint CALL/RETURN carries deadline metadata to server-side request

DEADLINE_ORDERED:
  SQPOLL may reorder within a bounded window only when all reorder-safety
  checks below pass

NO_BLOCKING_PATH:
  reject if target method/op is not declared realtime-safe
```

Do not put budget, period, priority, criticality, or CPU affinity into each
SQE. Deadline is per request. Budget is execution authority.

`DEADLINE_ORDERED` is valid only when all of the following are true:

```text
the ring mode permits reordering
the SQE marks this request reorderable
the target capability interface and method declare reorder-safe semantics
the reordering window is bounded
the operation does not depend on earlier same-ring requests for correctness
```

Ordered side effects such as `write A; write B; flush` or `lock; mutate;
unlock` must not be deadline-reordered unless the target method contract
explicitly defines that sequence as reorder-safe.

## Layer 5: SchedulingContext

CPU time should become a capability-controlled object:

```rust
struct SchedulingContext {
    budget_ns: u64,
    period_ns: u64,
    relative_deadline_ns: u64,
    priority: u16,
    criticality: u8,
    cpu_mask: CpuSet,
    overrun_policy: OverrunPolicy,
    timeout_endpoint: Option<EndpointRef>,
}
```

Kernel responsibilities:

- decrement remaining budget by actual runtime;
- replenish budget by period;
- throttle or fault a thread on depletion;
- enforce CPU mask and scheduling eligibility;
- dispatch among eligible contexts by the selected realtime policy;
- prevent untrusted SQE bytes from minting budget.

Policy-service responsibilities:

- admission control;
- budget/period/priority selection;
- CPU-isolation lease policy;
- overload response;
- telemetry and retuning.

## Layer 6: Donation

Synchronous capability calls need scheduling-context donation:

```text
client SchedulingContext -> passive server endpoint
server runs on donated budget/deadline
context returns on reply
timeout/overrun reports to caller or island policy
```

Without donation or inheritance, a realtime caller can be defeated by a
normal-priority server that holds the capability implementation path.

Donation semantics must be fixed before implementation:

```text
max donation call depth:
  bounded per SchedulingContext or RealtimeIsland; overflow fails closed.

nested donation:
  nested synchronous calls carry the current donated context until the depth
  bound, unless a callee uses its own admitted context by explicit policy.

cycle handling:
  a donated context may not re-enter a thread already on its donation stack;
  cycles fail with a typed realtime/donation error.

partial failure:
  budget already consumed stays charged to the context that ran the work.
  rollback of authority or memory is separate from CPU charge rollback.

timeout propagation:
  the earliest of request deadline, scheduling-context deadline, and explicit
  call timeout bounds downstream execution.

server-side blocking:
  a passive server running on donated context may block only on approved
  realtime-safe waits or synchronous calls that continue donation.

return on exception:
  application exceptions, transport errors, and cancellation return the
  context to its previous owner before CQE/error delivery.

async endpoint queues:
  donation does not cross ordinary async endpoint enqueue by default. Async
  donation requires an explicit future token/lease design.
```

Hot admitted paths should avoid blocking locks. If a shared resource cannot be
modeled as a passive service, it needs a reviewed priority/deadline-inheritance
primitive or a bounded try-lock/fail/drop policy.

## Layer 7: RealtimeIsland

`RealtimeIsland` admits a whole loop or graph:

```rust
struct RealtimeIslandSpec {
    period_ns: u64,
    deadline_ns: u64,
    cpu_set: CpuSet,
    nodes: Vec<NodeBudget>,
    rings: Vec<RingSpec>,
    memory: Vec<PreallocSpec>,
    devices: Vec<DeviceReservation>,
    overrun_policy: OverrunPolicy,
}
```

Admission requires:

- total budget fits period/deadline constraints;
- all hot-path buffers are preallocated;
- hot-path memory is committed and resident before start;
- guaranteed hot-path memory uses the OOM proposal's `MemoryResidency` policy
  as `pinned` or `secret`; `normal` memory is not admitted for guaranteed hot
  paths. A future lock-resident operation may transition ordinary memory into a
  pinned reservation before admission, but the admitted island sees the result
  as `pinned`, not as `normal`;
- all caps and policy decisions are resolved before start;
- no expected page faults on the hot path;
- no unbounded lock acquisition;
- no blocking endpoint calls inside callback loops;
- no allocation, logging, service discovery, or provider credential work on
  the realtime path;
- IRQ and deferred work are bounded or moved outside the island.

Failure semantics must be typed:

```text
CAP_ERR_DEADLINE_EXPIRED
CAP_ERR_BUDGET_EXHAUSTED
CAP_ERR_REALTIME_UNSAFE_PATH
CAP_ERR_REALTIME_ADMISSION_DENIED
CAP_ERR_OVERRUN
CAP_ERR_STALE_INPUT
```

CQE/status should distinguish not-started-late, completed-late, dropped by
policy, throttled, and dependency-cancelled.

## Policy-Service Userstories: AutoNoHz Placement for Compute-Capable Threads

The Layer 1-7 primitives above are mechanism: `NoHzEligibility` is a reviewed
claim, `CpuIsolationLease` is the placement authority, `SchedulingContext` and
the coarse `ResourceLedger` own CPU-time budget, and `NoHzActivation` is the
scheduler proof that current CPU state allows tick suppression. They do not
answer *who decides* to issue an eligibility hint for an ordinary user thread
that was not pre-declared as a realtime island or kernel SQPOLL worker, or
*what observation* justifies the issuance. That decision is policy, and it
belongs in the user-space scheduler policy service described in
[Stage 7 of `scheduler-evolution-proposal`](scheduler-evolution-proposal.md#stage-7-user-space-scheduler-policy).
This section records the userstories that motivate the responsibility and the
bounds the policy service must enforce so auto-promotion never becomes an
implicit "unlimited CPU-hold" grant.

### Core property: promotion is placement, not budget

Auto-promotion adds isolation; it never mints CPU-time authority. A
policy-issued `CpuIsolationLease` only removes tick and scheduler noise while
its bound thread consumes time that was already authorized through its
`SchedulingContext` or coarse `ResourceLedger`. `SchedulingContext` budget
exhaustion is now folded into the same nearest-deadline timer as nohz
revocation/timer work, so a tick-masked CPU is re-observed at the budget
deadline rather than at a later periodic tick. When budget exhausts, or when any
existing Layer 3 activation obligation stops holding, the existing fail-closed
rollback path restores the periodic tick. Priority-aware revocation of the lease
itself when an equal-or-higher-priority runnable arrives is new Phase H surface
(see "Bounds the policy service must enforce" below); today's Phase F rollback
only restores ticks on the leased CPU and does not terminate the lease.

This separation answers the obvious objection. A busy-spinning thread cannot
escalate itself into permanent CPU exclusivity, because the spin drains its
allotted budget at the same rate periodic scheduling would have drained it.
If the operator has granted enough budget to saturate a core, auto-promotion
removes tick interference while that budget is consumed; if not, the same
authority that would have throttled the thread under periodic scheduling
still throttles it under nohz.

### Trigger: "thread appears capable of utilizing a full CPU core"

The trigger is not a fixed percentage threshold inside the kernel. The kernel
exports per-thread observation; the policy service synthesizes a
saturation-capability signal from those observations and decides what
"capable of utilizing a full CPU core" means for a given account, session,
or service profile. Plausible inputs the policy service may combine:

- runtime accumulated over a rolling window approaches the wall-clock window
  the thread had on its assigned CPU;
- voluntary-block count over the same window stays low (the thread is not
  IPC- or IO-bound at a rate that would lose the benefit);
- runnable-but-not-running time stays low when the thread is the only
  runnable entity on its CPU, or correlates with placement contention rather
  than IO when it is not.

Concrete window length, smoothing, and the synthesis rule are policy-service
choices, replaceable without ABI churn. As of 2026-05-30 the kernel exports
the observation inputs the heuristic consumes as ordinary (non-`measure`)
per-thread state: `runtime_ns`/`virtual_runtime_ns`, `voluntary_blocks`,
`preemptions`, and a cumulative `runnable_accumulated_ns`
(runnable-but-not-running time) are all returned by
`SchedulingPolicyCap.snapshot @2`. `voluntary_blocks` and `preemptions` were
promoted out of `cfg(feature = "measure")` and `runnable_accumulated_ns` was
added at the run-queue enqueue/select boundary; only `migrations` remains
`measure`-gated. This closes the Phase H "monitoring/status surface that
exports per-thread saturation observation" prerequisite. The surface exports
raw cumulative counters only: no fixed threshold and no windowing live in the
kernel -- the policy service synthesizes the saturation signal.

### Userstories

1. **Long-running compute tenant with declared budget.** A model-training,
   video-encoding, or HPC build job is admitted with a `SchedulingContext`
   (or coarse `ResourceLedger` allocation) sized for sustained near-core
   utilization on a declared CPU pool. The policy service observes the
   thread saturating the pool's CPU share, issues a bounded
   `CpuIsolationLease` against the pool, the scheduler proves the activation
   obligations from Layer 2/3, and ticks are suppressed for as long as the
   thread keeps consuming the granted budget. The lease ends when the budget
   exhausts, the job completes, the operator revokes the pool, or the
   saturation signal subsides.

2. **Userspace poller that earned isolation.** A service polls a ring or
   device queue (a candidate `AutoUserspacePoller` in the `NoHzKind`
   taxonomy). The policy service sees consistent saturation with low
   voluntary blocking, recognizes the `AutoUserspacePoller` eligibility kind,
   and issues a lease. The bounds are the same as for the kernel SQPOLL path;
   only the consumer differs.

3. **Account-scoped auto-claim pool.** An operator pre-declares "account X
   may auto-claim up to N isolated CPUs from pool P, maximum auto-lease
   lifetime L, with revocation latency R, charging to ledger E." The policy
   service monitors threads owned by X, issues leases against P when
   saturation capability is observed, and refuses promotion when X already
   holds N leases or when no CPU in P currently satisfies the activation
   proof. Without the operator declaration the policy service does not
   auto-promote.

4. **Background agent that bursts to full-core compute.** A general-purpose
   agent process does not normally saturate a core. When it briefly does
   (a planning phase, a build step, a local inference call), the policy
   service may issue a short-lifetime lease if the agent's account has
   authorized auto-promotion. When the burst ends the signal subsides; the
   lease is not renewed.

### Bounds the policy service must enforce

For every auto-issued lease the policy service records:

```text
lifetime_ns:               bounded; shorter than admin-issued leases by
                           default; renewal requires re-observing the
                           saturation signal.
max_revocation_latency_ns: bounded by NoHzEligibility.max_revocation_latency_ns;
                           cannot exceed the operator/account policy.
accounting_target:         a live SchedulingContext or coarse ResourceLedger;
                           the lease does not mint CPU-time authority.
auto_claim_pool:           the pre-authorized CPU set; no implicit fallback to
                           system-wide isolation.
fairness_preemption:       another runnable entity at equal-or-higher policy
                           priority terminates the lease if no other CPU
                           authorized by both the pool and lease mask is
                           eligible.
```

Two of these bounds map to existing kernel-enforced surfaces:
`max_revocation_latency_ns` is already a field on `NoHzEligibility` and the
closed Phase F activation preflight; `accounting_target` is already a field
on `NoHzEligibility` and the live `SchedulingContext`/`ResourceLedger`
authority.

The other three bounds need new kernel-enforced surfaces before the
heuristic can ship and are named as Phase H prerequisites:

- `lifetime_ns`: LANDED 2026-05-30. `CpuIsolationLeaseSpec` now carries
  `leaseLifetimeNs @6` (`0` = no expiry, the default). A lease records an
  absolute monotonic `expires_at_ns` at creation; the first observation past
  the deadline auto-revokes through the existing generation-advancing cleanup
  (`reason=lease-expired`), and the nohz activation record carries the
  lifetime deadline so a tickless CPU rolls back at the next timer/IPI recheck
  (`lease-lifetime-expired`), bounded by `maxRevocationLatencyNs`. This is the
  bounded-lifetime guarantee the auto-issued placement lease needs, so a
  compromised, blocked, or malfunctioning policy service cannot leave an
  auto-issued lease holding the CPU indefinitely. The bounded renewal
  *primitive* LANDED on top of this: `CpuIsolationLease.renew @4` pushes
  `expires_at_ns` forward to `now + leaseLifetimeNs` (clamped to the same
  one-hour ceiling `read_spec` enforces), keeping the same `(leaseId,
  generation)`, accounting binding, and nohz activation state -- distinct from
  re-minting a fresh lease. It is callable only before expiry (a revoked,
  auto-revoked, or past-deadline lease stays `staleGeneration` and is not
  resurrected; an unbounded `leaseLifetimeNs = 0` lease reports `notRenewable`),
  and the renewed deadline is propagated to a tickless CPU's nohz activation
  record so the `lease-lifetime-expired` disqualifier no longer rolls it back at
  the old deadline; `CpuIsolationLeaseInfo.expiresAtNs` echoes the deadline
  read-only. Only the Phase H renewal *heuristic* -- re-observing the saturation
  signal to decide *whether* to call `renew` on a near-expiry lease -- remains
  future policy-service work on top of this primitive.
- `auto_claim_pool` and per-account capacity (`N` in userstory 3): the
  operator-declared CPU-pool descriptor LANDED 2026-05-30, making a non-default
  `poolId` meaningful for the first time. `CpuIsolationLeaseSpec` carries
  `poolId @7` (`0` = the implicit default pool over every scheduler CPU), and
  the kernel seeds a fixed declared-pool registry (`CpuIsolationPoolDescriptor`:
  the default pool `0` plus exactly one declared non-default pool `1` over a
  single CPU). The create-time admission gate now looks the pool up: an
  undeclared `poolId` is rejected `invalidSpec`; a declared pool whose CPU mask
  the lease's `allowedCpuMask` exceeds is rejected `invalidSpec`; a declared
  pool with a subset mask is admitted and its id/mask are echoed read-only
  through `CpuIsolationLeaseInfo` (`admittedPoolId`/`admittedPoolCpuMask`)
  (proof `make run-scheduler-cpu-isolation-lease`: `nondefault_pool=invalidSpec`
  for the undeclared id, `declared_pool=ok admitted_pool_id=1
  admitted_pool_cpu_mask_subset=true`, `declared_pool_mask_violation=invalidSpec`,
  `default_pool_id=0`). The declared-pool table is now operator-sourced
  (LANDED 2026-05-30): the kernel installs it from the boot manifest
  `SystemConfig.cpuIsolationPools @14` (a `List(CpuIsolationPoolDescriptor)`),
  with the in-kernel constant as the fail-closed default when the manifest
  omits the list, and validates each entry fail-closed at boot (canonical CPU
  mask subset of the scheduler mask, default pool `0` synthesized if omitted,
  duplicate ids rejected). The boot line `cpu-isolation: declared-pools
  source=manifest count=3 default_pool_id=0 nondefault_pool_id=1
  nondefault_pool_cpu_mask=0x2` proves the source (proof
  `make run-scheduler-cpu-isolation-lease`; the kernel-default fallback is
  proven by `cargo test-config` decode/empty assertions). The descriptor now
  also carries a per-pool live-lease capacity bound (`poolMaxLeases @2`, LANDED
  2026-05-31): a non-zero value caps the number of simultaneously live
  (non-revoked, current-generation) leases the kernel admits against that pool
  at create-time, counted from the existing `LEASE_REGISTRY` after `prune_dead`,
  rejecting an over-capacity create fail-closed `resourceExhausted` (`0` =
  unbounded, preserving the default pool `0` and every existing producer). The
  manifest bounds pool `2` at `poolMaxLeases: 2`; the proof admits two live
  leases, refuses a third non-overlapping create (`cpu-isolation:
  pool-capacity-rejected admitted_pool_id=2 live_leases=2 pool_max_leases=2
  result=resourceExhausted`, `pool_capacity_exceeded=resourceExhausted`), then
  reclaims after a revoke (`pool_capacity_reclaimed=ok`), proving the bound is
  live-count not cumulative. The account identity and per-account `N` then
  landed on top of this counter (LANDED 2026-05-31): `CpuIsolationLeaseSpec`
  carries `accountId @8 :UInt64` (`0` = unattributed, caller-asserted and inert
  until counted, echoed read-only through `CpuIsolationLeaseInfo.accountId @6`)
  and `CpuIsolationPoolDescriptor` carries `poolMaxLeasesPerAccount @3 :UInt32`
  (`0` = unbounded per account). After the pool-wide check, `register` counts
  the requesting account's live entries (matching both `admitted_pool_id` and
  `account_id`) against the per-account bound and rejects an over-bound create
  fail-closed `resourceExhausted` (`0` account or `0` bound skips the gate). The
  manifest bounds pool `2` at `poolMaxLeasesPerAccount: 1`; the proof admits one
  account-7 lease, refuses a second account-7 create (`cpu-isolation:
  account-capacity-rejected admitted_pool_id=2 account_id=7
  account_live_leases=1 pool_max_leases_per_account=1 result=resourceExhausted`,
  `account_capacity_exceeded=resourceExhausted`), admits a different account-9
  lease on that CPU (`account_capacity_other_account=ok` -- per-account, not
  pool-wide), and reclaims after revoking account-7
  (`account_capacity_reclaimed=ok`). The account id is caller-asserted on the
  plain lease path. The authentication half LANDED 2026-05-31: `CpuIsolationPoolGrant`
  (`schema/capos.capnp`; source `cpu_isolation_pool_grant`; kernel
  `kernel/src/cap/cpu_isolation_pool_grant.rs`) introduced a bootstrap-staged
  grant that binds one authenticated account to one declared pool. Its
  `createLease` stamps the bound account/pool onto the minted lease, overriding
  any caller-asserted `accountId`/`poolId`, and reuses the same lease-create
  admission path (`cpu_isolation::create_lease_for_caller`) -- so the per-account
  bound is unforgeable by cap-possession: a holder cannot assert another account
  to evade `poolMaxLeasesPerAccount`.
  The initial single-grant proof used account `7` bound to pool `2`; the current
  `make run-scheduler-cpu-isolation-pool-grant` proof boots manifest-declared
  grants. The grant binding is now operator-declared (LANDED 2026-06-01): the
  manifest `SystemConfig.cpuIsolationPoolGrants` table seeds the bound
  `(account, pool)` pairs (mirroring the `cpuIsolationPools` table), and the
  `cpu_isolation_pool_grant` / `cpu_isolation_pool_grant_secondary` sources stage
  seeded binding index `0` / `1`, so an operator can pre-authorize multiple
  distinct accounts/pools, each staged as its own bootstrap grant cap. An
  absent/empty list falls back to one in-kernel binding at index `0`: account `7`
  bound to preferred pool `1` when active, otherwise account `7` bound to
  synthesized default pool `0`, preserving a usable single default grant when a
  manifest-sourced pool table omits pool `1`.
  `make run-scheduler-cpu-isolation-pool-grant` now boots a two-entry table
  (account `5`/pool `1`, account `8`/pool `2`) and proves each grant stamps its
  OWN bound account with the per-account bound still enforced.
  `make run-scheduler-cpu-isolation-pool-grant-default` boots the empty-list
  fallback with pool `1` omitted and proves the synthesized `(account 7, pool 0)`
  grant is usable.
  Runtime grant minting landed 2026-06-02 22:24 UTC (`CpuIsolationGrantMinter`): one cap
  mints a fresh `CpuIsolationPoolGrant` for an operator-chosen `(account, pool)`
  at call time, bounded by the declared
  `SystemConfig.cpuIsolationGrantMinterAllowlist` (an out-of-allowlist pair is
  refused `unauthorized`, so the minter is never an ambient grant-any authority;
  the minted grant reuses the same unforgeable `createLease` admission path). The
  same `make run-scheduler-cpu-isolation-pool-grant` smoke mints a grant for the
  allowed `(account 6, pool 2)`, proves its `createLease` stamps account `6` and
  stays bounded by the per-account gate, and proves an out-of-allowlist mint is
  refused.
  Grant-revocation lifecycle landed 2026-06-03 17:11 UTC
  (`CpuIsolationGrantMinter.revokeGrant`), closing (c): a runtime-minted grant
  carries a revocable `(grantId, generation)`; `revokeGrant(grantId)` advances the
  grant generation so a stale grant handle's `createLease` fails `staleGeneration`
  and mints nothing, and revocation cascades to every live lease minted through
  that grant -- reusing the landed fairness-termination cleanup
  (`reason=grant-revoked`, periodic-tick rollback, registry unregister) once per
  tagged lease, so per-pool/per-account live-lease capacity frees immediately and a
  fresh grant is admitted into the reclaimed slot. Double-revoke is
  `alreadyRevoked` and an unknown `grantId` is `unknownGrant`, both fail-closed;
  seeded bootstrap grants are not minter-owned and stay un-revocable. The same
  `make run-scheduler-cpu-isolation-pool-grant` smoke proves the full lifecycle.
  No pool authority is minted from holding a lease cap; the kernel stays the
  fail-closed admission gate.
- `fairness_preemption`: LANDED 2026-06-02 21:17 UTC. The Phase F rollback path now
  compares policy priority at the existing nohz recheck site: when a second
  runnable entity appears on the leased CPU at equal-or-higher WFQ policy
  priority (`latency_class`, `weight`) than the captured leased thread, and no
  sibling CPU authorized by both the admitted pool and the lease `allowedCpuMask`
  is eligible to host the lease, the kernel terminates the `CpuIsolationLease`
  itself (`fairness-preempted ... result=lease-terminated`) rather than only
  restoring the periodic tick, bounded by `maxRevocationLatencyNs`. The
  termination runs the same generation-advancing cleanup `leaseLifetimeNs`
  expiry uses (`reason=fairness-preempted`) immediately after the scheduler
  restores the periodic tick, so a subsequent `info`/`revoke` reports
  `staleGeneration` and placement/account capacity is freed without waiting for
  the holder's next cap call; a strictly-lower arrival or an eligible sibling
  CPU inside both masks keeps the existing tick-restore-only behavior. The
  kernel supplies the comparison and fail-closed termination; the policy service
  remains the issuer and bookkeeper of the saturation signal. Re-placement of the
  leased thread onto an eligible sibling CPU (instead of terminating) remains
  generic-full-nohz work; the "no sibling eligible" condition is recorded.

The policy service is the issuer and the bookkeeper of the synthesized
saturation signal; the kernel remains the authority gate, the activation
prover, and the fail-closed rollback path -- including for the three
not-yet-existing surfaces above.

### Explicit non-goals

- The kernel does not contain a saturation-detection rule of its own. It
  exports observation; it does not synthesize the signal.
- Auto-promotion does not grant unlimited CPU-hold. The lease is bounded by
  lifetime, budget, revocation, and pool capacity; absent a pre-authorized
  pool, no auto-promotion occurs.
- Auto-promotion does not grant realtime authority. `RealtimeIsland`
  admission remains a separate, stricter path with preallocation, deadline,
  and no-blocking proofs.
- Auto-promotion does not bypass donation, fairness, or session-lifecycle
  invariants. Process exit, session logout, and explicit revoke still tear
  the lease down through the existing Layer 3 rollback.

## Telemetry Requirements

Tickless, nohz, SQPOLL, and realtime behavior must be observable through
future monitoring/status capability surfaces, not only through ad hoc debug
logs. The first counters should include:

```text
scheduler_tick_count{cpu}
ticks_suppressed{cpu,mode}
nohz_enter_count{cpu,kind}
nohz_exit_count{cpu,reason}
oneshot_deadline_miss_count
sqpoll_busy_ns
sqpoll_sleep_count
deadline_expired_count
budget_exhausted_count
realtime_overrun_count
donation_depth_max
housekeeping_offload_count
```

These counters are correctness evidence. Missing or surprising values should
fail focused nohz/realtime proofs rather than being treated as performance-only
diagnostics.

The `ticks_suppressed{cpu,mode}` / `scheduler_tick_count{cpu}` evidence is
realized as an asserted proof line on the lease path:
`make run-scheduler-cpu-isolation-lease` now counts genuine periodic LAPIC
fires per CPU (a fire is counted only when neither the lease-backed nor the
idle tick-suppression bit is set, so the one-shot replacement is never
miscounted) and, on lease nohz rollback, emits
`cpu-isolation: nohz suppressed-ticks cpu=<n> window_ns=<w>
expected_periodic=<e> actual_periodic=<a> suppressed=<e-a>`. The harness
asserts that over a bounded masked window the leased CPU recorded `actual` near
zero while `expected` was substantial -- the periodic tick demonstrably stopped,
not merely that the mask write was issued -- and that a bounded post-rollback
`cpu-isolation: nohz restored-rate` window shows the periodic rate returning.
This is bounded proof-line evidence, not yet a durable
`SchedulingPolicyCap`/monitoring telemetry field; the persistent
`ticks_suppressed` surface and the generic-full-nohz path's inheritance of the
same measured assertion remain future telemetry work.

## Implementation Sequence

1. Add timer/scheduler instrumentation around the existing periodic tick.
2. Add `monotonic_ns()` backed by a clocksource that is not derived from the
   scheduler tick, and switch `Timer.now` plus scheduler accounting to that
   clocksource while keeping periodic scheduling. Completed for normal
   QEMU/x86_64 by the Phase F clockevent/deadline substrate.
3. Convert timeout waiters to `deadline_ns`. Completed for `Timer.sleep`,
   finite `cap_enter`, and park timeouts by the Phase F
   clockevent/deadline substrate.
4. Add LAPIC one-shot programming, periodic restore state, and a focused
   one-shot smoke. Completed as a disabled-nohz substrate proof by the Phase F
   clockevent/deadline substrate.
5. Replace user-mode idle with kernel/per-CPU idle while keeping periodic
   ticks. Completed: the scheduler idle path is now a CPL0 per-CPU kernel idle
   thread and the user-mode idle process is gone (`docs/tasks/README.md`).
6. Enable tickless idle only when there is no runnable work. Completed by
   `docs/tasks/done/2026/scheduler-tickless-idle-step6.md`: true-idle CPUs
   with no runnable non-idle work, no active nohz lease, no local deferred
   cleanup, no cap-enter polling dependency, and a one-shot LAPIC clockevent
   mask the periodic tick and arm a bounded one-shot at the next
   `Timer`/`ParkSpace` deadline or the 100 ms idle housekeeping floor. The
   scheduler restores the periodic tick before ordinary non-idle
   dispatch, on reschedule IPIs, and on backend/refusal rollback. Cap-enter
   polling waiters and ready-but-budget-throttled `SchedulingContext` retry
   windows remain periodic until the legacy terminal/network/IRQ polling and
   scheduling-context retry surfaces move behind explicit deadlines or
   housekeeping placement.
7. Route the in-kernel virtio-net poll off a lease-isolated CPU to the
   housekeeping CPU (landed 2026-06-04); an explicit `NetworkPollClock` poll
   deadline remains the longer-term target.
8. Add ring mode state and refuse timer-side SQ processing for SQPOLL rings.
9. Land Ring v2 per-thread ring ownership and completion routing.
10. Add the SQPOLL wake/sleep protocol and a host or Loom-style lost-wakeup
    model.
11. Add kernel SQPOLL without full-nohz, under normal scheduler ticks.
12. Add CPU isolation leases and housekeeping CPU placement.
13. Prove SQPOLL progress through a wake/deadline path that does not depend on
    periodic scheduler ticks. Completed for bounded current-thread
    syscall/producer-wake progress by the Phase F SQPOLL nohz-progress child.
14. Enable SQPOLL nohz on isolated CPUs for explicitly leased caller-thread
    rings. Landed 2026-06-07 09:45 UTC; broader userspace-poller/device-queue
    policy issuance remains separate.
15. Add request `deadline_ns` metadata and typed late/drop CQE outcomes.
16. Add `SchedulingContext` and admission-controlled realtime islands.
17. Add generic full-nohz admission for ordinary budgeted compute threads
    through explicit `SchedulingContext`-targeted `CpuIsolationLease` preflight.
    Landed 2026-06-06 09:44 UTC; policy-service issuance remains separate.
18. Add the user-space policy-service AutoNoHz placement heuristic. The
    kernel exports per-thread saturation observation through the
    monitoring/status surface; the policy service synthesizes the "thread
    appears capable of utilizing a full CPU core" decision and issues
    bounded `CpuIsolationLease` grants against pre-authorized account or
    session CPU pools. The auto-revoke timeout primitive (`leaseLifetimeNs`)
    landed 2026-05-30 15:22 UTC at `84c1c5ba`, priority-aware fairness lease
    termination landed 2026-06-02 21:28 UTC at `cae825a4` with immediate release
    remediation at `ca28ef63`, runtime grant minting
    (`CpuIsolationGrantMinter`) landed 2026-06-02 22:25 UTC at `5c5c63cc`, and the
    grant-revocation lifecycle (`CpuIsolationGrantMinter.revokeGrant` with
    cascade-to-leases) landed 2026-06-03 17:11 UTC, completing the pool-grant
    authority surface. The local userspace policy-service proof landed
    2026-06-07: it reads the per-thread saturation counters, denies a
    voluntarily blocking worker, issues a finite grant-stamped full-nohz lease
    only after a saturated local window, renews only after re-observation, and
    lets stopped renewal expire fail-closed. A reusable production policy daemon
    with profile-driven smoothing, cross-process target discovery, and richer
    operator policy remains future work.

## Verification

Tickless idle gates:

```text
make fmt-check
cargo test-lib
cargo test-config
make run-smoke
make run-spawn
```

Additional tickless proof:

```text
1 second idle interval does not produce 100 scheduler ticks
Timer.sleep still completes
cap_enter timeout still completes
ParkSpace timeout still completes
preemption fairness unchanged with runnable contention
```

SQPOLL gates:

```text
thread-lifecycle
timer-smoke
timer-flood
park wake/timeout
endpoint CALL/RECV/RETURN
mandatory host or Loom-style lost-wakeup model before any real SQPOLL worker:
  poller: set NEED_WAKEUP -> full barrier -> recheck tail -> park
  producer: write SQE -> publish tail -> full barrier -> check NEED_WAKEUP -> wake
```

Realtime gates:

```text
deadline ordering tests
budget depletion tests
donation/return tests through passive endpoint
admission denial tests
QEMU proof for late/drop/overrun behavior
telemetry counters prove ticks suppressed, deadlines expired, budgets
exhausted, and donation depth bounded as expected
```

## Decision

Adopt this staged direction:

```text
Tickless idle:
  yes, after the kernel/per-CPU idle context and activation proof. The
  clocksource/clockevent split is implemented.

Generic full-nohz:
  implemented for explicit budgeted compute leases targeting a live
  SchedulingContext. Automatic issuance and unbudgeted ordinary threads remain
  out of scope.

SQPOLL nohz:
  yes, for explicitly leased caller-thread rings whose SQPOLL poller is live,
  single-consumer, and bounded by producer wake plus rollback deadlines.

AutoNoHz placement for ordinary threads:
  yes, but only as a user-space policy-service decision that issues a
  bounded CpuIsolationLease against a pre-authorized CPU pool. The lease
  adds isolation; it never mints CPU-time authority. The "thread appears
  capable of utilizing a full CPU core" signal is synthesized in the
  policy service from observations the future monitoring/status surface
  must export, not as a fixed kernel threshold.

Realtime:
  `SQE.deadline_ns` is useful metadata, but `SchedulingContext` is the
  authority that provides CPU time.
```
