Research: Out-of-Kernel Scheduling

Survey of whether capOS can move CPU scheduler implementation out of the kernel, which parts are normally kept privileged, and which policy has been moved to user-space services or loadable policy modules in prior systems.

Scope

“User-space scheduler” is an overloaded term. The question here is narrower than language/runtime scheduling: can the OS CPU scheduler itself be moved out of the kernel?

This report separates the relevant models:

Model	Schedules	Kernel sees	Examples
User-controlled kernel scheduling	Kernel threads / scheduling contexts	Privileged mechanism plus user policy inputs	L4 user-level scheduling, seL4 MCS, ARINC partition schedulers on seL4
Dynamic in-kernel policy	Kernel threads	Policy loaded from user space but executed in kernel	Linux sched_ext, Ekiben, Bossa
Whole-machine core arbitration	Cores granted to applications/runtimes	Kernel threads pinned, parked, or revoked	Arachne, Shenango, Caladan
In-process M:N runtime	Goroutines, virtual threads, fibers, async tasks	A smaller set of OS threads	Go, Java virtual threads, Erlang, Tokio
User-level thread package	User-level threads or tasklets	One or more kernel execution contexts	Capriccio, Argobots
Kernel-assisted two-level runtime scheduling	User threads plus kernel events	Virtual processors / activations	Scheduler activations, Windows UMS

The common boundary in prior systems is: the kernel allocates protected execution resources, handles blocking and preemption, and enforces isolation. User space supplies domain policy: which goroutine, actor, task, request, or coroutine runs next.

Feasibility Assessment

Moving the entire scheduler out of the kernel is not viable in a protected, preemptive system if “scheduler” means the code that runs on timer interrupts, chooses an immediately runnable kernel thread, saves/restores CPU state, changes page tables, updates per-CPU state, and enforces CPU-time isolation. That mechanism is part of the CPU protection boundary.

Moving scheduler policy out of the kernel is viable. A capOS-like kernel can act as a small CPU driver that enforces runnable-state invariants, capability-authorized scheduling contexts, budgets, priorities, CPU affinity, timeout faults, and IPC donation. A privileged user-space scheduler service can own admission control, budgets, priorities, placement, CPU partitioning, and service-specific policy.

The design point supported by the surveyed systems is not “no scheduler in kernel.” It is “minimal kernel dispatch and enforcement, user-space policy.”

Executive Conclusions

The next-thread dispatch path is normally kept in kernel mode. It runs when the current user process may be untrusted, blocked, faulting, or out of budget.
User space can own policy if the kernel exposes scheduling contexts as capability-controlled CPU-time objects. Thread creation and thread handles should follow the same capability-first model.
Consulting a user-space scheduler server on every timer tick adds context switches to the hottest path and creates a bootstrap problem when the scheduler server itself is not runnable.
seL4 MCS is the most directly comparable model: scheduling contexts are explicit objects, budgets are enforced by the kernel, and passive servers can run on caller-donated scheduling contexts.
L4 user-level scheduling experiments show that user-directed scheduling is possible, with reported overhead from 0 to 10 percent compared with a pure in-kernel scheduler for their workload. That is plausible for policy changes, not for every dispatch decision.
seL4 user-mode partition schedulers show the downside: a prototype partitioned scheduler measured substantial overhead because each scheduling event crosses the user/kernel boundary.
sched_ext and Ekiben are useful evidence for pluggable scheduler policy, but they still execute policy in or near the kernel. They do not prove that the dispatch mechanism can be a normal user process.
Whole-machine core arbiters such as Arachne, Shenango, and Caladan support a different split: the kernel still schedules threads, while a privileged control plane grants, revokes, and places cores at coarser granularity.
Direct-switch IPC and scheduling-context donation reduce the priority inversion and dispatch-overhead risks that appear when capability servers are scheduled only by per-process priorities.
Pure M:1 user-level threads are insufficient for capOS as the only threading story. They are fast, but one blocking syscall, page fault wait, or long CPU loop can stall unrelated user threads unless every blocking operation is converted to async form.
M:N runtimes need a small OS contract: capability-created kernel threads, TLS/FS-base state, capability-authorized futex-style wait/wake, monotonic timers, async I/O/event notification, and a way to detect or avoid kernel blocking.
Scheduler activations solved the right conceptual problem but exposed a complicated upcall contract. A capability OS can get most of the benefit with simpler primitives: async capability rings, notification objects, futexes, and explicit thread objects.
Work-stealing with per-worker local queues is the dominant general-purpose runtime design. It gives locality and scale, but it needs explicit fairness guards and I/O polling integration.
SQPOLL-style polling is a scheduling decision. It trades a core for lower submission latency and depends on SMP plus explicit CPU ownership. Full-nohz for that poller should be treated as a CPU-isolation lease with housekeeping and accounting constraints, not as an automatic timer optimization; see NO_HZ, SQPOLL, and realtime scheduling.
A generic language scheduler in the kernel is a separate design from out-of-kernel CPU policy. Go, Rust async, actor runtimes, and POSIX layers need kernel mechanisms that let them implement their own policy.

Privileged Mechanisms

The following responsibilities are mechanism, not policy. Moving them to a normal user process either breaks protection or puts a user/kernel round trip on the critical path:

Save and restore CPU register context.
Switch page tables / address spaces.
Update per-CPU current-thread state, kernel stack, TSS/RSP0, and syscall stack state.
Handle timer interrupts and IPIs.
Maintain a safe runnable/blocked/exited state machine.
Enforce CPU budgets and preempt a thread that exceeds its budget.
Choose an emergency runnable thread when the policy owner is dead, blocked, or malicious.
Run idle and halt safely when no runnable work exists.
Integrate scheduling with blocking syscalls, page faults, futex waits, and IPC wakeups.
Preserve invariants under SMP races.

These are exactly the parts currently concentrated in kernel/src/sched.rs and the x86 context-switch path. They can be simplified and made more generic, but they remain required somewhere privileged.

Policy Surface

The following are policy examples that can be owned by a privileged user-space service once scheduling contexts exist:

Admission control: which process/thread is allowed to consume CPU time.
Priority assignment and dynamic priority changes.
Budget/period selection for temporal isolation.
CPU affinity and CPU partitioning decisions.
Core grants for SQPOLL, device polling, network stacks, and latency-sensitive services.
Overload handling policy.
Per-service or per-tenant fair-share policy.
Instrumentation-driven tuning.
Runtime-specific hints, such as “latency-sensitive”, “batch”, “driver”, or “poller”.

This split gives a capOS-like system policy freedom while preserving a small, auditable kernel CPU mechanism.

Viable Architectures

1. Minimal Kernel Scheduler Plus User Policy Service

This is one capOS-compatible design point.

The kernel implements:

Thread states and per-CPU run queues.
Priority/budget-aware dispatch.
Scheduling-context objects.
Timer-driven budget accounting.
Timeout faults or notifications.
Capability-checked operations to bind/unbind scheduling contexts to threads.
Emergency fallback policy.

A user-space sched service implements:

System policy loaded from the boot manifest.
Resource partitioning between services.
Priority/budget updates.
CPU pinning and SQPOLL grants.
Diagnostics and policy reload.

The policy service is invoked on configuration changes and timeout faults, not on every context switch.

2. seL4-MCS-Style Scheduling Contexts

seL4 MCS makes CPU time a first-class kernel object. A thread needs a scheduling context to run. A scheduling context carries budget, period, and priority. The kernel enforces the budget with a sporadic-server model. Passive servers can block without their own scheduling context; callers donate their scheduling context through synchronous IPC, and the context returns on reply.

This maps directly to capOS:

SchedContext {
    budget_ns
    period_ns
    priority
    cpu_mask
    remaining_budget
    timeout_endpoint
}

Kernel responsibilities:

Enforce budget and period.
Dispatch runnable threads with eligible scheduling contexts.
Donate and return contexts across direct-switch IPC.
Notify user space on timeout or depletion.

User-space responsibilities:

Create and distribute scheduling-context capabilities.
Decide budgets and priorities.
Build passive service topologies.
React to timeout faults.

This moves scheduling policy out without moving the hot dispatch mechanism out.

3. Hierarchical User-Level Scheduler

L4 research evaluated exporting scheduling to user level through a hierarchical user-level scheduling architecture. The reported application overhead was 0 to 10 percent compared with a pure in-kernel scheduler in their evaluation, and the design enabled user-directed scheduling.

This is possible, but the cost model is sensitive:

Every policy decision that requires a scheduler-server round trip is expensive.
The scheduler server needs guaranteed CPU time, or the system can deadlock.
Faults and interrupts still need kernel fallback.
SMP multiplies races around run queues, CPU ownership, and migration.

This architecture is viable for coarse-grained partition scheduling, VM scheduling, or policy control. As a first general dispatch path, it has higher latency and bootstrap risk than an in-kernel dispatcher.

4. Dynamic In-Kernel Policy

Linux sched_ext lets user space load BPF scheduler programs, but the policy runs inside the kernel scheduler framework. The kernel preserves integrity by falling back to the fair scheduler if the BPF scheduler errors or stalls runnable tasks. Ekiben similarly targets high-velocity Linux scheduler development with safe Rust policies, live upgrade, and userspace debugging.

This model is a later-stage option for dynamic scheduler experiments, but it is not “scheduler in user space.” It also adds verifier/runtime complexity.

5. Core Arbiter / Resource Manager

Arachne, Shenango, and Caladan move high-level core allocation decisions out of the ordinary kernel scheduler path. Applications or runtimes know which cores they own, while an arbiter grants and revokes cores based on load or interference.

This model is useful for capOS after SMP:

grant cores to NIC drivers, network stacks, or SQPOLL workers;
revoke poller cores under CPU pressure;
isolate latency-sensitive services from batch work;
expose CPU ownership through capabilities.

It does not remove the kernel dispatcher. It changes the granularity of policy from “which thread next” to “which service owns this CPU budget.”

Classic Problem: Kernel Threads vs User Threads

The scheduler activations paper is still the cleanest statement of the core problem: kernel threads have integration with blocking and preemption, while user-level threads have cheaper context switching and better policy control. The failure mode of user-level threads layered naively on kernel threads is that kernel events are hidden from the runtime. A kernel thread can block in the kernel while runnable user threads exist, and the kernel can preempt a kernel thread without telling the runtime which user thread was stopped.

Scheduler activations address this by giving each address space a “virtual multiprocessor.” The kernel allocates processors to address spaces and vectors events to the user scheduler when processors are added, preempted, blocked, or unblocked. The activation is both an execution context and a notification vehicle.

The lesson for capOS is not to copy the full activation API. The durable idea is the split:

Kernel owns physical CPU allocation, protection, preemption, and blocking.
Runtime owns which application-level work item runs on a granted execution context.
Kernel-visible blocking must create a runtime-visible event, or it must be avoided by making the operation async.

For capOS, async capability rings already avoid many blocking syscalls. The remaining hard cases are futex waits, page faults that require I/O, synchronous IPC, and preemption of long-running runtime tasks.

Runtime Schedulers in Practice

Go

Go uses an M:N scheduler with three central concepts:

G: goroutine.
M: worker thread.
P: processor token required to execute Go code.

The Go runtime distributes runnable goroutines over worker threads, keeps per-P queues for scalability, uses global queues and netpoller integration for fairness and I/O, and parks/unparks OS threads conservatively to avoid wasting CPU. Its own source comments call out why centralized state and direct handoff were rejected: centralization hurts scalability, while eager handoff hurts locality and causes thread churn.

Preemption is mixed. Go has synchronous safe points and asynchronous preemption using OS mechanisms such as signals. The runtime can only safely stop a goroutine at points where stack and register state can be scanned.

Implications for capOS:

Initial GOOS=capos can run with GOMAXPROCS=1 and cooperative preemption, but useful Go requires kernel threads, futexes, FS-base/TLS, a monotonic timer, and an async network poller.
A signal clone is not strictly required if capOS provides a runtime-visible timer/preemption notification and the Go port accepts cooperative-first behavior.
The kernel must schedule threads, not processes, before Go can use multiple cores.

Java Virtual Threads

JDK virtual threads use M:N scheduling: many virtual threads are mounted on a smaller number of platform threads. The default scheduler is a FIFO-mode work-stealing ForkJoinPool; the platform thread currently carrying a virtual thread is called its carrier.

The design is intentionally not pure cooperative scheduling from the application’s perspective: most JDK blocking operations unmount the virtual thread, freeing the carrier. But some operations pin the virtual thread to the carrier, notably native calls and some synchronized regions. The JEP also notes that the scheduler does not currently implement CPU time-sharing for virtual threads.

Implications for capOS:

“Blocking” compatibility requires library/runtime cooperation, not just a scheduler. The runtime needs blocking operations to yield carriers.
Native calls and pinned regions remain a general M:N hazard. capOS cannot make that disappear in the kernel.

Tokio and Rust Async Executors

Tokio represents the async executor model rather than stackful green threads. Tasks run until they return Poll::Pending, so fairness depends on cooperative yield points and wakeups. Tokio’s multi-thread scheduler uses one global queue, per-worker local queues, work stealing, an event interval for I/O/timer checks, and a LIFO slot optimization for locality.

Implications for capOS:

A capos-rt async executor can integrate capability-ring completions, notification objects, and timers as wake sources.
A cooperative budget is mandatory. A future that never awaits can monopolize a worker until kernel preemption takes the whole OS thread away.
A single global CQ per process can become an executor bottleneck if many worker threads consume completions. Per-thread or sharded wake queues are likely needed after SMP.

Erlang/BEAM

BEAM schedulers run lightweight Erlang processes on scheduler threads. The runtime exposes scheduler count and binding controls, and Erlang processes are preempted by reductions rather than OS timer slices. This shows a different point in the design space: the language VM owns fairness because it controls execution of bytecode.

Implications for capOS:

Managed runtimes can implement stronger fairness than native async libraries because they control instruction dispatch or compiler-inserted safe points.
Native Rust/C userspace cannot rely on that unless the compiler/runtime inserts yield or safe-point checks.

Capriccio and Argobots

Capriccio showed that a user-level thread package can scale to very high concurrency by combining cooperative user-level threads, asynchronous I/O, O(1) thread operations, linked stacks, and resource-aware scheduling. The important lesson is that the thread abstraction can survive high concurrency when the runtime controls stacks and blocking.

Argobots generalizes lightweight execution units into user-level threads and tasklets over execution streams. It is designed as a substrate for higher-level systems such as OpenMP and MPI, with customizable schedulers. This is directly relevant to capOS because it argues for low-level runtime mechanisms, not one global scheduling policy.

Lithe

Lithe targets composition of parallel libraries. Its thesis is that a universal task abstraction or one global scheduler does not compose well when multiple parallel libraries are nested. Instead, physical hardware threads are shared through an explicit resource interface, while each library keeps its own task representation and scheduling policy.

Implications for capOS:

Avoid oversubscription by making CPU grants visible to user space.
A future CpuSet or scheduling-context capability could let runtimes know how much parallelism they are actually allowed to use.
Nested runtimes benefit from the ability to donate or yield execution resources without going through a process-global policy singleton.

Kernel Interfaces That Matter

Futexes

Futexes are the standard split-lock design: user space does the uncontended fast path with atomics, and the kernel only participates to sleep or wake threads. Linux also has priority-inheritance futex operations for cases where the kernel must manage lock-owner priority propagation.

For capOS:

Implement futex as a capability-authorized primitive. Do not assume generic Cap’n Proto method encoding is acceptable for the hot path; measure it against a compact operation before fixing the ABI.
Key futex wait queues by (address_space, user_virtual_address) for private futexes. Shared-memory futexes eventually need a memory-object identity plus offset.
Support timeout against monotonic time first. Requeue and PI futexes can wait.

Restartable Sequences

Linux rseq lets user space maintain per-CPU data without heavyweight atomics and lets a thread cheaply read its current CPU/node. The current kernel docs also describe scheduler time-slice extensions for short critical sections.

For capOS:

rseq-style current-CPU access becomes useful after SMP and per-CPU run queues.
It is not a first threading prerequisite. Futex, TLS, and kernel threads come first.
If added, expose a small per-thread ABI page with cpu_id, node_id, and an abort-on-migration critical-section protocol.

io_uring SQPOLL

SQPOLL moves submission from syscall-driven to polling-driven. A kernel thread polls the submission queue and submits work as soon as userspace publishes SQEs. This reduces submission latency and syscall overhead for sustained I/O, but it burns CPU and needs careful affinity.

capOS already has an io_uring-inspired capability ring, so the analogy is direct:

Current tick-driven ring processing is correct for a toy system but couples invocation latency to timer frequency.
A kernel-side SQ polling thread interacts badly with single-CPU systems. On a single CPU it competes with the application it is supposed to accelerate.
Make SQPOLL a scheduling/capability decision: the process donates or is granted a CPU budget for the poller.
Completion handling remains a separate problem. A runtime still needs to poll CQs or block on notifications.

sched_ext

Linux sched_ext is not a normal user-level thread scheduler. It is a scheduler class whose behavior is defined by BPF programs loaded from user space. The kernel docs emphasize that sched_ext can be enabled and disabled dynamically, can group CPUs freely, and falls back to the default scheduler if the BPF scheduler misbehaves. The docs also warn that the scheduler API has no stability guarantee.

For capOS:

The relevant idea is safe, dynamically replaceable policy with kernel integrity fallback.
Copying the BPF ABI is not required. capOS can get a smaller version through privileged scheduler-policy capabilities later.
Keep early scheduling policy in kernel Rust until the invariants are clear.

Whole-Machine User-Space/Core Schedulers

Arachne

Arachne is a user-level thread system for very short-lived threads. It is core-aware: applications know which cores they own and control placement of work on those cores. A central arbiter reallocates cores among applications. The published results report strong memcached and RAMCloud improvements, and the implementation requires no Linux kernel modifications.

Takeaway: user-level scheduling gets much better when the runtime has explicit core ownership. Blindly creating more kernel threads and hoping the OS scheduler does the right thing is a weaker contract.

Shenango

Shenango targets datacenter services with microsecond-scale tail-latency goals. It uses kernel-bypass networking and an IOKernel on a dedicated core to steer packets and reallocate cores across applications every 5 microseconds. The key policy is rapid core reallocation based on whether queued work is waiting long enough to imply congestion.

Takeaway: a dedicated scheduling/control core can be worthwhile when latency SLOs are tighter than normal kernel scheduling reaction times. It is expensive and only justified for sustained latency-sensitive workloads.

Caladan

Caladan extends the idea from load to interference. It uses a centralized scheduler core and kernel module to monitor and react to memory hierarchy and hyperthread interference at microsecond scale. Its main claim is that static partitioning of cores, caches, and memory bandwidth is neither necessary nor sufficient for rapidly changing workloads.

Takeaway: CPU scheduling is not only “which runnable thread next.” On modern machines it is also placement relative to caches, sibling SMT threads, memory bandwidth, and bursty workload phase changes.

Design Axes

Axis	Options	Practical conclusion
Stack model	Stackless tasks, segmented/growing stacks, fixed stacks	Rust async uses stackless futures; Go/Java need runtime-managed stacks; POSIX threads need fixed or growable user stacks
Preemption	Cooperative, safe-point, signal/upcall, timer-forced OS preemption	Kernel preemption alone protects the system; runtime fairness needs safe points or cooperative budgets
Blocking	Convert all operations to async, add carriers, kernel upcalls	Async caps reduce blocking; Go/POSIX still need kernel threads and futexes
Queueing	Global queue, per-worker queues, work stealing, priority queues	Per-worker queues plus stealing are the default; add global fairness escape hatches
CPU ownership	Invisible OS scheduling, affinity hints, explicit CPU grants	Explicit grants matter for high-performance runtimes and SQPOLL
Cross-process calls	Queue through scheduler, direct switch, scheduling donation	Direct switch and scheduling-context donation reduce sync IPC overhead and inversion
Isolation	Best-effort fairness, priorities, budget/period contexts	Cloud-oriented capOS eventually needs budget/period scheduling contexts

capOS Design Options

Option: Minimal Kernel Mechanism Plus User Policy

This option keeps dispatch and enforcement in the kernel, replaces the current round-robin process scheduler with a minimal kernel CPU mechanism, and moves policy to user space through scheduling-context capabilities.

The kernel side covers:

dispatching the next runnable thread on each CPU;
enforcing budget/period/priority invariants;
handling interrupts, blocking, wakeups, and exits;
direct-switch IPC and scheduling-context donation;
an emergency fallback policy.

The user-space scheduler service covers:

policy configuration from the manifest;
per-service budgets, periods, priorities, and CPU masks;
admission control for new processes and threads;
SQPOLL/core grants;
response to timeout faults and overload telemetry.

This gives a capOS-like system the exokernel/microkernel benefit of policy freedom without putting a user-space server on the context-switch fast path.

Possible Implementation Sequence

Thread scheduler in kernel. Convert from process scheduling to thread scheduling, with per-thread kernel stack, saved registers, FS base, and shared process address space/cap table.
Scheduling contexts. Add kernel objects that carry budget, period, priority, CPU mask, and timeout endpoint. Initially assign one default context per thread.
ThreadSpawner and ThreadHandle capabilities. Expose thread creation and lifecycle through capabilities from the start. Bootstrap grants init the initial authority; init or a scheduler service delegates it under quota.
Scheduling-context donation for IPC. Baseline direct-switch IPC handoff exists for blocked Endpoint receivers; add budget/priority donation and return once scheduling contexts exist.
User-space policy service. Let init or a sched service create and update scheduling contexts via capabilities.
SMP core ownership. After per-CPU run queues and TLB shootdown exist, allow the scheduler service to manage CPU masks and SQPOLL/poller grants.
Optional dynamic policy. Much later, consider sched_ext-like policy modules if Rust/verifier infrastructure exists. This is not a prerequisite.

Minimal Kernel API Sketch

interface SchedulerControl {
    createContext @0 (budgetNs :UInt64, periodNs :UInt64, priority :UInt16)
        -> (context :SchedulingContext);
    setCpuMask @1 (context :SchedulingContext, mask :Data) -> ();
    bind @2 (thread :ThreadHandle, context :SchedulingContext) -> ();
    unbind @3 (thread :ThreadHandle) -> ();
    setTimeoutEndpoint @4 (context :SchedulingContext, endpoint :Endpoint) -> ();
    stats @5 (context :SchedulingContext) -> (consumedNs :UInt64, throttled :Bool);
}

interface SchedulingContext {
    yieldTo @0 (thread :ThreadHandle) -> ();
    consumed @1 () -> (consumedNs :UInt64);
}

interface ThreadSpawner {
    create @0 (
        entry :UInt64,
        stackTop :UInt64,
        arg :UInt64,
        context :SchedulingContext,
        flags :UInt64
    ) -> (thread :ThreadHandle);
}

interface ThreadHandle {
    join @0 (timeoutNs :UInt64) -> (status :Int64);
    exitCode @1 () -> (exited :Bool, status :Int64);
    bind @2 (context :SchedulingContext) -> ();
}

The hot path does not invoke these methods; they are control-plane operations.

Dependency: In-Process Threading

Kernel threads inside a process are a dependency for sophisticated user-level thread support:

Thread object with saved registers, per-thread kernel stack, user stack pointer, FS base, state, and parent process reference.
Scheduler runs threads, not processes.
Process owns address space and cap table; threads share both.
Process context switch saves/restores FS base today; thread scheduling must make that state per-thread.
Thread creation is exposed first as a ThreadSpawner capability; bootstrap grants initial authority to init, and later policy delegates it through the capability graph.
Thread exit reclaims the thread stack and wakes joiners if join exists.

This directly unblocks Go phase 2, POSIX pthread compatibility, native thread-local storage, and any multi-worker Rust async executor.

Dependency: Park (Linux futex analogue) and Timer

A minimal capability-authorized park primitive has this shape:

park(park_space, uaddr, expected, timeout_ns) -> Result
unpark(park_space, uaddr, max_count) -> usize

Required semantics:

park checks that *uaddr == expected while holding the park wait-lock equivalent, then blocks the current thread.
unpark makes up to max_count waiters runnable.
Timeouts use monotonic ticks or a timer wheel/min-heap.
Return values must distinguish woken, timed out, interrupted, and value mismatch.

The authority should be capability-based from the start, for example through a ParkSpace, SharedParkSpace, or memory-object-derived capability. Pre-thread measurement with the benchmark-only ParkBench cap favors a compact capability-authorized operation over generic Cap’n Proto methods for failed wait and empty wake. The blocked/resume path still needs measurement after threads exist because the primitive sits on the runtime parking path.

Measure this before fixing the ABI:

CAP_OP_NOP: ring validation plus CQE post, with no cap lookup or capnp.
Empty and small NullCap calls through normal cap lookup, method dispatch, capnp param decode, and capnp result encode.
Futex-shaped compact operation carrying cap_id, uaddr, expected, and timeout/max_count, initially returning without blocking.
Generic ParkBench.wait / ParkBench.wake Cap’n Proto methods for the same pre-thread failed-wait and empty-wake cases.
Later, real blocking paths: failed wait, wake with no waiters, wait-to-block, wake-to-runnable, and wake-to-resume.

The useful decision is not “capability or syscall”; it is “generic capnp method or compact capability-authorized scheduler primitive.” Authority remains in the capability model either way.

Near Term: Runtime Event Integration

For capos-rt, design the executor around kernel completion sources:

Capability-ring CQ entries wake tasks waiting on cap invocations.
Notification objects wake tasks waiting on interrupts, timers, or service events.
Futex wakes resume parked worker threads.
Timers can be integrated as wakeups instead of periodic polling.

The executor policy can start simple:

One worker per kernel thread.
Local FIFO queue per worker.
One global injection queue.
Work stealing when local and global queues are empty.
Cooperative operation budget, then requeue.

Stage 6: IPC Scheduling

For synchronous IPC, direct switch has been introduced before priority scheduling:

If client A calls server B and B is blocked in receive, switch A -> B directly without picking an unrelated runnable thread. This is implemented for the current single-CPU Endpoint path.
Mark A blocked on reply.
Future fastpath work can transfer a small message inline; use shared buffers for large data.

Scheduling-context donation then adds the budget/priority transfer:

The server runs the request using the caller’s scheduling context.
The caller’s budget covers client + server work.
Passive servers can exist without independent CPU budget and only run when a caller donates one.

This avoids priority inversion through the capability graph and matches the service architecture better than per-process priorities alone.

Stage 7: SMP and Core Ownership

Once per-CPU scheduler queues exist, these become policy surfaces:

CPU affinity depends on correct migration and TLB shootdown.
A CpuSet or SchedulingContext capability can describe allowed CPUs, budget, period, and priority.
Cheap current-CPU exposure depends on a stable per-thread ABI page.
SQPOLL can be gated on available CPU budget to avoid unlimited poller creation.

Risks and Failure Modes

M:1 green threads do not provide Go or POSIX compatibility by themselves.
A normal user-space process choosing the next thread on every timer tick puts a context-switch round trip on the hot path.
Recovery from scheduler-service failure cannot depend solely on the scheduler service being runnable.
A Go-like G/M/P scheduler in the kernel couples language runtime policy to the kernel.
Generic Cap’n Proto capability calls may be too heavy for every synchronization primitive. Measure generic calls against compact capability-authorized operations before fixing the futex ABI.
sched_ext-like dynamic policy loading depends on mature scheduler invariants and verifier/runtime machinery.
SQPOLL on a single-core system can compete with the application it is meant to accelerate.

Open Questions

Does capOS need scheduler-activation-style upcalls? Async caps and notification objects cover many of the same cases with less machinery.
How can runtime preemption work without Unix signals? Options are cooperative-only, timer notification to a runtime handler, or a kernel forced safe-point ABI. Cooperative-only is one first-support option for Go.
How are shared-memory futex keys represented? Private futexes can key on address space and virtual address. Shared futexes need memory-object identity and offset.
How large is the blocked/resume overhead once threads exist? The pre-thread failed-wait and empty-wake measurement already favors compact operations, but 4.5.5 still needs the contended path before freezing the final ABI.
How much policy belongs in the boot manifest versus a long-running sched service? Static embedded systems can use manifest policy. Cloud or developer systems need runtime policy updates.
What is the emergency fallback if the scheduler service exits? Options are a tiny kernel round-robin fallback for privileged recovery threads, a pinned immortal scheduler thread, or panic. The first is the only robust development choice.

Source Notes

Anderson et al., “Scheduler Activations: Effective Kernel Support for the User-Level Management of Parallelism” (SOSP 1991): https://polaris.imag.fr/vincent.danjean/papers/anderson.pdf
“Towards Effective User-Controlled Scheduling for Microkernel-Based Systems” (L4 user-level scheduling): https://os.itec.kit.edu/21_738.php
Asberg and Nolte, “Towards a User-Mode Approach to Partitioned Scheduling in the seL4 Microkernel”: https://www.es.mdh.se/pdf_publications/2641.pdf
Kang et al., “A User-Mode Scheduling Mechanism for ARINC653 Partitioning in seL4”: https://link.springer.com/chapter/10.1007/978-981-10-3770-2_10
L4Re overview: https://l4re.org/doc/l4re_intro.html
Liedtke, “On micro-kernel construction”: https://elf.cs.pub.ro/soa/res/lectures/papers/lietdke-1.pdf
seL4 MCS tutorial: https://docs.sel4.systems/Tutorials/mcs.html
seL4 design principles: https://microkerneldude.org/2020/03/11/sel4-design-principles/
Linux kernel sched_ext documentation: https://www.kernel.org/doc/html/next/scheduler/sched-ext.html
Arun et al., “Agile Development of Linux Schedulers with Ekiben”: https://arxiv.org/abs/2306.15076
Williams, “An Implementation of Scheduler Activations on the NetBSD Operating System” (USENIX 2002): https://web.mit.edu/nathanw/www/usenix/freenix-sa/freenix-sa.html
Microsoft, “User-Mode Scheduling”: https://learn.microsoft.com/en-us/windows/win32/procthread/user-mode-scheduling
Go runtime scheduler source: https://go.dev/src/runtime/proc.go
Go preemption source: https://go.dev/src/runtime/preempt.go
OpenJDK JEP 444, “Virtual Threads”: https://openjdk.org/jeps/444
Tokio runtime scheduling documentation: https://docs.rs/tokio/latest/tokio/runtime/
von Behren et al., “Capriccio: Scalable Threads for Internet Services” (SOSP 2003): https://web.stanford.edu/class/archive/cs/cs240/cs240.1046/readings/capriccio-sosp-2003.pdf
Argobots paper page: https://www.anl.gov/argonne-scientific-publications/pub/137165
Argobots project: https://www.argobots.org/
Pan et al., “Lithe: Enabling Efficient Composition of Parallel Libraries” (HotPar 2009): https://www.usenix.org/legacy/event/hotpar09/tech/full_papers/pan/pan_html/
Linux futex(2) manual: https://man7.org/linux/man-pages/man2/futex.2.html
Linux kernel restartable sequences documentation: https://docs.kernel.org/userspace-api/rseq.html
io_uring_sqpoll(7) manual: https://manpages.debian.org/testing/liburing-dev/io_uring_sqpoll.7.en.html
Qin et al., “Arachne: Core-Aware Thread Management” (OSDI 2018): https://www.usenix.org/conference/osdi18/presentation/qin
Ousterhout et al., “Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads” (NSDI 2019): https://www.usenix.org/conference/nsdi19/presentation/ousterhout
Fried et al., “Caladan: Mitigating Interference at Microsecond Timescales” (OSDI 2020): https://www.usenix.org/conference/osdi20/presentation/fried

Keyboard shortcuts

capOS Documentation