# Research: Future Scheduler Architecture

This note records the prior art checked for future capOS scheduling work after
the first SMP and per-thread ring milestones exposed that scheduler structure,
not only timer programming, will decide whether capOS scales.

## Local Grounding

Existing capOS documents already cover part of the answer:

- [Scheduling](../architecture/scheduling.md): scheduler architecture including
  Phase D WFQ policy, Phase E `SchedulingContext`, Phase F `CpuIsolationLease`
  and CPL0 idle thread, LAPIC timer/IPI foundation, global run queue, per-thread
  rings, timer sleep waiters, park waiters, direct IPC handoff, and SMP state.
- [SMP Phase C](../backlog/smp-phase-c.md): active multi-CPU execution and
  in-process thread-scaling proof work.
- [SMP](../proposals/smp-proposal.md): accepted SMP direction.
- [Ring v2 For Full SMP](../proposals/ring-v2-smp-proposal.md): per-thread
  completion ownership and SQPOLL preconditions.
- [Tickless and Realtime Scheduling](../proposals/tickless-realtime-scheduling-proposal.md):
  tickless idle, SQPOLL nohz, deadlines, scheduling contexts, donation, and
  realtime islands.
- [Stateful Task and Job Graphs](../proposals/stateful-task-job-graphs-proposal.md):
  durable work graphs, assignment metadata, graph-run state, and the stop line
  that graph coordinators must not become authority-holding god objects.
- [NO_HZ, SQPOLL, and Realtime Scheduling](nohz-sqpoll-realtime.md): Linux
  NO_HZ, clocksource/clockevent, CPU isolation, SQPOLL, `SCHED_DEADLINE`,
  PREEMPT_RT, and seL4 MCS grounding.
- [Out-of-kernel scheduling](out-of-kernel-scheduling.md): kernel mechanism
  versus user-space policy split.
- [Completion rings and threaded runtimes](completion-ring-threading.md):
  completion ownership, `io_uring`, futex, and IOCP lessons.
- [Multimedia pipeline latency](multimedia-pipeline-latency.md) and
  [Robotics realtime control](robotics-realtime-control.md): admitted
  realtime-island use cases.

## External Sources Checked

- Linux kernel documentation,
  [CFS Scheduler](https://docs.kernel.org/scheduler/sched-design-CFS.html).
- Linux kernel documentation,
  [EEVDF Scheduler](https://www.kernel.org/doc/html/latest/scheduler/sched-eevdf.html).
- Linux kernel documentation,
  [Deadline Task Scheduling](https://www.kernel.org/doc/html/latest/scheduler/sched-deadline.html).
- Linux kernel documentation,
  [Extensible Scheduler Class](https://docs.kernel.org/scheduler/sched-ext.html).
- Linux kernel documentation,
  [NO_HZ: Reducing Scheduling-Clock Ticks](https://docs.kernel.org/timers/no_hz.html).
- Linux kernel documentation,
  [CPU Isolation](https://www.kernel.org/doc/html/next/admin-guide/cpu-isolation.html).
- Linux kernel documentation,
  [Housekeeping](https://www.kernel.org/doc/html/latest/core-api/housekeeping.html).
- Linux kernel documentation,
  [PREEMPT_RT theory of operation](https://docs.kernel.org/core-api/real-time/theory.html).
- FreeBSD manual,
  [ULE scheduler](https://man.freebsd.org/cgi/man.cgi?query=sched_ule&manpath=FreeBSD+11.0-RELEASE).
- seL4 documentation,
  [MCS Extensions tutorial](https://docs.sel4.systems/Tutorials/mcs.html).
- Anderson, Bershad, Lazowska, and Levy,
  [Scheduler Activations](https://homes.cs.washington.edu/~tom/pubs/sched_act.html).
- Ghosh et al.,
  [ghOSt: Fast & Flexible User-Space Delegation of Linux Scheduling](https://dl.acm.org/doi/10.1145/3477132.3483542).
- Ousterhout et al.,
  [Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads](https://www.usenix.org/conference/nsdi19/presentation/ousterhout).
- Fried et al.,
  [Caladan: Mitigating Interference at Microsecond Timescales](https://www.usenix.org/conference/osdi20/presentation/fried).
- Kaffes et al.,
  [Shinjuku: Preemptive Scheduling for microsecond-scale Tail Latency](https://www.usenix.org/conference/nsdi19/presentation/kaffes).
- Qin et al.,
  [Arachne: Core-Aware Thread Management](https://www.usenix.org/conference/osdi18/presentation/qin).

## Findings

### Fair General-Purpose Scheduling

Linux CFS established the now-common model that ordinary tasks should be
ordered by virtual runtime, not by a fixed time-slice list. Linux EEVDF keeps
the fair-scheduler lineage but chooses the eligible task with the earliest
virtual deadline, using request size and lag to improve latency and fairness.

The capOS consequence is not "import Linux CFS." It is:

- ordinary best-effort work should use virtual-time accounting (Phase D WFQ
  is now the active policy; the earlier FIFO round-robin was the bootstrap);
- latency-sensitive best-effort work should have bounded, policy-visible
  request sizes or weights rather than hidden scheduler magic;
- per-CPU run queues are a prerequisite before any EEVDF-like policy matters
  at SMP scale.

EEVDF is the strongest candidate for the next capOS best-effort policy
evolution after WFQ. It should follow WFQ rather than replace it immediately
because it depends on accurate runtime charging, per-CPU runnable ownership,
and migration accounting that are not yet in place.

### Per-CPU Run Queues and Topology

Linux and FreeBSD both make per-CPU scheduler state the normal SMP unit.
FreeBSD ULE additionally exposes CPU topology and affinity as first-class
placement concerns. This matches the current capOS scaling evidence: one
global scheduler lock and one global run queue make every CPU contend on the
same state even after per-thread rings remove the process-wide CQ bottleneck.

The near-term capOS scheduling architecture should split:

- per-CPU current thread and run queue ownership;
- cross-CPU wakeup and migration paths;
- shared process/thread metadata protected by narrower locks;
- placement policy from dispatch mechanism;
- diagnostic counters for lock hold/spin time, migration, steals, and IPIs.

### Realtime and Temporal Isolation

Linux `SCHED_DEADLINE` uses EDF plus Constant Bandwidth Server-style budget,
deadline, and period parameters. Its key lesson for capOS is admission:
deadline scheduling without bandwidth control is only a priority policy, not
a guarantee.

seL4 MCS is the more capability-native precedent. CPU time is represented by
scheduling-context objects. Passive servers can run on caller-donated CPU
time, avoiding priority inversion across synchronous IPC. This maps directly
to capOS endpoint services and direct IPC handoff.

The capOS split should remain:

- `SQE.deadline_ns`: request freshness and propagation metadata;
- `SchedulingContext`: spendable CPU-time authority;
- donation: temporary transfer of CPU budget/deadline along a synchronous
  capability path;
- `RealtimeIsland`: admitted bundle of scheduling contexts, memory/device
  reservations, communication paths, and overrun policy.

### Tickless, Isolation, and Housekeeping

Linux NO_HZ and CPU isolation reinforce that tick suppression is not one
feature. Idle tickless is a timer cleanup. Full-nohz is an isolation contract
that also needs housekeeping CPUs, accounting, timer migration, deferred work
placement, and revocation latency policy.

For capOS, this grounding shaped the implementation order: automatic nohz
activation for the narrow single-runnable-entity window and SQPOLL-driven
auto-nohz for ring-coupled leases are now implemented (Phase F), both tied
to the `CpuIsolationLease` with housekeeping, deferred-work placement,
clockevent deadline substrate, one-SQ-consumer ownership, and fail-closed
rollback prerequisites satisfied first. Generic full-nohz for explicitly
budgeted compute leases, timeout-based auto-revoke, and SQPOLL nohz for
explicitly leased caller-thread rings have since landed. Remaining future work
includes:

- full-nohz tied to policy-service issuance and durable monitoring telemetry;
- SQPOLL nohz beyond the current caller-thread ring-coupled lease shape;
- realtime island nohz after admission proves unrelated work, IRQs, deferred
  frees, and timers are excluded or bounded.

### Pluggable and User-Space Policy

Linux sched_ext and ghOSt show that fast scheduler experimentation is useful,
but they also preserve privileged dispatch and enforcement. sched_ext runs BPF
inside the kernel scheduler framework with fallback; ghOSt delegates policy to
user-space agents while retaining kernel mechanisms for safety and preemption.

For capOS, the safe architecture is:

- keep dispatch, budget enforcement, interrupt handling, idle, and fallback in
  the kernel;
- expose policy knobs through capabilities;
- let a privileged scheduler-policy service own admission, budget selection,
  CPU partitioning, isolation leases, and tuning;
- call the policy service on configuration changes, depletion/timeout faults,
  and coarse placement events, not on every context switch.

Dynamic policy loading is a later experiment. It should not become the first
way to make basic SMP scheduling scale.

### Datacenter Runtime Schedulers

Shenango, Caladan, Shinjuku, and Arachne target microsecond-scale service
latency by managing cores, preempting long request handlers, and separating
fast user-level scheduling from coarser kernel control. They are useful
because capOS will host services, agent runtimes, and network stacks that want
low tail latency.

The shared lessons are:

- core grants are different from CPU-time budgets;
- user-level worker schedulers need kernel-visible blocking and preemption
  boundaries;
- tail-latency policies need request-level telemetry, not only thread-level
  CPU shares;
- cross-core coordination must be cheap enough that it does not dominate the
  service latency it tries to reduce.

For capOS this argues for scheduler hints and policy capabilities above the
kernel mechanism, not a datacenter-specific kernel scheduler as the default.

### Stateful Work Graphs

The stateful task/job graph proposal is related at the workload layer. A graph
node can carry assignment metadata such as priority, deadline, budget, queue,
and lease, and a domain coordinator can decide which node attempt is runnable
inside that graph. That is not the same authority as kernel CPU scheduling.

The scheduler consequence is a clean boundary:

- graph/node priority is domain policy until translated by an authorized
  scheduler policy service;
- graph budgets reference resource profiles or scheduling contexts, but do not
  mint CPU time by themselves;
- graph deadlines may create request deadlines or admission inputs, but do not
  bypass scheduler admission;
- build, init, agent, and operator graph coordinators should lease work and
  consume scheduler primitives rather than owning a global CPU run queue;
- scheduler telemetry should be attachable to graph runs as artifacts, so a
  failed or slow job can explain whether it waited on authority, CPU budget,
  dependency state, I/O, or policy.

## Recommended capOS Direction

1. *(Done: Phase D/E/F)* Finish the current thread-scale evidence before
   larger policy changes. Phase D WFQ, Phase E `SchedulingContext`, and
   Phase F `CpuIsolationLease` / auto-nohz / SQPOLL-coupled nohz have landed.
2. Split scheduler state into per-CPU runnable ownership and bounded
   cross-CPU wake/migration. (Per-CPU queues remain future work; Phase F.5.)
3. Add precise CPU accounting and scheduler attribution before changing the
   default policy. (Attribution guardrails landed in Phase A; full per-CPU
   accounting is Phase F.5 follow-on.)
4. Move ordinary best-effort work toward an EEVDF-like virtual-deadline policy
   after accounting and per-CPU queues exist. (WFQ is current; EEVDF
   is a follow-on evaluation deferred until per-CPU queues exist.)
5. Keep `SCHED_DEADLINE`/EDF-CBS and seL4 MCS as the precedent for admitted
   realtime work, but express CPU authority as capOS `SchedulingContext`
   capabilities. (`SchedulingContext` is implemented; `RealtimeIsland`
   admission is Phase G future work.)
6. Keep user-space scheduler policy coarse-grained and capability-authorized;
   do not consult a user process on every timer interrupt or dispatch.
7. Treat SQPOLL, busy polling, and full-nohz as CPU-isolation leases with
   housekeeping and revocation constraints. (Ring-coupled SQPOLL nohz and
   generic full-nohz for explicitly budgeted compute leases are implemented;
   policy-service issuance remains future work.)
8. Keep runtime schedulers above per-thread rings, futex/park/notification
   primitives, timers, and explicit thread objects.

The resulting target is a layered scheduler:

- Kernel dispatch/enforcement: per-CPU queues, context switch, idle,
  accounting, budget enforcement, timeout faults, direct IPC donation, and
  cross-CPU wake/migration.
- Kernel policy primitives: weights, virtual deadlines, scheduling contexts,
  CPU masks, isolation leases, and realtime-island admission hooks.
- Userspace policy: profiles, admission, budget selection, service/runtime
  hints, placement, diagnostics, and policy reload.
- Userspace runtimes: work stealing, actor queues, async reactors, service
  request schedulers, and language-specific M:N scheduling.

## Open Questions

The questions below have been answered by Phase D/E/F implementation; they are
kept for record and context:

- *Answered (Phase D):* WFQ is the first virtual-time policy; EEVDF is
  deferred until per-CPU queues and runtime charging exist.
- *Answered (Phase D):* The thread-scale milestone did not require a per-CPU
  queue split; WFQ on a global queue with per-thread weight sufficed.
- *Answered (Phase E):* The initial `SchedulingContext` ABI uses
  `SchedulingContextSpec` (weight, latency class, budget, period, overrun
  policy) and `SchedulingContextInfo` for info-only read access;
  `SchedulingContext.info()` is method id 0 for stability. Donation/return
  through endpoints is implemented; realtime island admission is future work.
- *Answered (Phase F):* `CpuIsolationLease` revocation interacts with session
  logout and process exit through lease-generation staling, which is the
  load-bearing rollback trigger; service replacement and process exit cleanup
  go through the same generation-staling path.

Remaining open questions:

- What is the minimum per-CPU queue split that closes the full-SMP scalability
  milestone (Phase F.5) without prematurely designing the full fair scheduler?
- How should policy-service issuance select and renew generic full-nohz and
  SQPOLL nohz leases beyond the current explicit local proofs?
- Which scheduler telemetry belongs in the always-on kernel and which belongs
  behind the benchmark-only `measure` feature?
- What is the right `RealtimeIsland` admission shape for admitted scheduling
  contexts, memory/device reservations, and overrun policy (Phase G)?