Proposal: Symmetric Multi-Processing (SMP)
How capOS goes from single-CPU execution to utilizing all available processors.
This document has three phases: a per-CPU foundation (prerequisite plumbing), AP startup (bringing secondary CPUs online), and SMP correctness (making shared state safe under concurrency).
Depends on: Stage 5 (Scheduling) – needs a working timer, context switch, and run queue on the BSP before adding more CPUs.
Can proceed in parallel with: Stage 6 (IPC and Capability Transfer).
Current State
Everything is single-CPU. Specific assumptions that SMP breaks:
| Component | File | Assumption |
|---|---|---|
| Syscall stack switching | kernel/src/arch/x86_64/syscall.rs | Global SYSCALL_KERNEL_RSP / SYSCALL_USER_RSP statics |
| GDT, TSS, kernel stacks | kernel/src/arch/x86_64/gdt.rs | One static GDT, one TSS, one kernel stack, one double-fault stack |
| IDT | kernel/src/arch/x86_64/idt.rs | Single static IDT (shareable – IDT can be the same across CPUs) |
| SYSCALL MSRs | kernel/src/arch/x86_64/syscall.rs | STAR/LSTAR/SFMASK/EFER set once on BSP only |
| Current process | kernel/src/sched.rs | SCHEDULER with BTreeMap<Pid, Process> + current: Option<Pid> — single global behind Mutex |
| Frame allocator | kernel/src/mem/frame.rs | Single global ALLOCATOR behind one spinlock |
| Heap allocator | kernel/src/mem/heap.rs | linked_list_allocator behind one spinlock |
The comment in syscall.rs:12 already anticipates the fix: “Will be
replaced by per-CPU data (swapgs) for SMP.”
Phase A: Per-CPU Foundation
Establish per-CPU data structures on the BSP. No APs are started yet – this phase makes the BSP’s own code SMP-ready so Phase B is a clean addition.
Per-CPU Data Region
Each CPU needs a private data area accessible via the GS segment base. On
x86_64, swapgs switches between user-mode GS (usually zero) and
kernel-mode GS (pointing to per-CPU data). The kernel sets KernelGSBase
MSR on each CPU during init.
#![allow(unused)]
fn main() {
/// Per-CPU data, one instance per processor.
/// Accessed via GS-relative addressing after swapgs.
#[repr(C)]
struct PerCpu {
/// Self-pointer for accessing the struct from GS:0.
self_ptr: *const PerCpu,
/// Kernel stack pointer for syscall entry (replaces SYSCALL_KERNEL_RSP).
kernel_rsp: u64,
/// Saved user RSP during syscall (replaces SYSCALL_USER_RSP).
user_rsp: u64,
/// CPU index (0 = BSP).
cpu_id: u32,
/// LAPIC ID (from Limine SMP info or CPUID).
lapic_id: u32,
/// Pointer to the currently running process on this CPU.
current_process: *mut Process,
}
}
The syscall entry stub changes from:
movq %rsp, SYSCALL_USER_RSP(%rip)
movq SYSCALL_KERNEL_RSP(%rip), %rsp
to:
swapgs
movq %rsp, %gs:16 ; PerCpu.user_rsp
movq %gs:8, %rsp ; PerCpu.kernel_rsp
And symmetrically on return:
movq %gs:16, %rsp ; restore user RSP
swapgs
sysretq
Per-CPU GDT, TSS, and Stacks
Each CPU needs its own:
- GDT – the TSS descriptor encodes a physical pointer to the CPU’s TSS, so each CPU needs a GDT with its own TSS entry. The segment layout (kernel CS/DS, user CS/DS) is identical across CPUs.
- TSS –
privilege_stack_table[0](kernel stack for interrupts from Ring 3) and IST entries (double-fault stack) must be per-CPU. - Kernel stack – each CPU needs its own stack for syscall/interrupt handling. Current size: 16 KB (4 pages). Same size per CPU.
- Double-fault stack – each CPU needs its own IST stack. Current size: 20 KB (5 pages).
#![allow(unused)]
fn main() {
/// Allocate and initialize per-CPU structures for one CPU.
fn init_per_cpu(cpu_id: u32, lapic_id: u32) -> &'static PerCpu {
// Allocate kernel stack (4 pages) and double-fault stack (5 pages)
let kernel_stack = alloc_stack(4);
let df_stack = alloc_stack(5);
// Create TSS with per-CPU stacks
let mut tss = TaskStateSegment::new();
tss.privilege_stack_table[0] = kernel_stack.top();
tss.interrupt_stack_table[DOUBLE_FAULT_IST_INDEX] = df_stack.top();
// Create GDT with this CPU's TSS
let (gdt, selectors) = create_gdt(&tss);
// Allocate and populate PerCpu struct
let per_cpu = Box::leak(Box::new(PerCpu {
self_ptr: core::ptr::null(), // filled below
kernel_rsp: kernel_stack.top().as_u64(),
user_rsp: 0,
cpu_id,
lapic_id,
current_process: core::ptr::null_mut(),
}));
per_cpu.self_ptr = per_cpu as *const PerCpu;
per_cpu
}
}
LAPIC Initialization
Stage 5 uses the 8254 PIT (100 Hz) and 8259A PIC (IRQ0 → vector 32) for preemption on the BSP. Phase A must migrate from PIT to LAPIC timer before bringing APs online, since the PIT is a single shared device that cannot provide per-CPU timer interrupts. Phase A sets up the full LAPIC, which is needed for:
- Per-CPU timer – replace PIT with LAPIC timer (required for SMP)
- IPI – inter-processor interrupts for TLB shootdown and AP startup
- Spurious interrupt vector – must be configured per-CPU
Crate Dependencies
| Crate | Purpose | no_std |
|---|---|---|
x2apic or manual MMIO | LAPIC/IOAPIC access | yes |
The x86_64 crate (already a dependency) provides MSR access. LAPIC
register access can use the existing HHDM for MMIO, or x2apic crate for
the MSR-based interface.
Migration Path
Phase A is a refactor of existing single-CPU code, not an addition:
- Add
PerCpustruct, allocate one instance for BSP - Set BSP’s
KernelGSBaseMSR, addswapgsto syscall entry/exit - Replace
SYSCALL_KERNEL_RSP/SYSCALL_USER_RSPglobals with GS-relative accesses - Replace scheduler’s global
SCHEDULER.currentwithPerCpu.current_pid - Move GDT/TSS creation into
init_per_cpu(), call it for BSP - Migrate BSP from PIT to LAPIC timer (PIT initialized in Stage 5)
After Phase A, the kernel still runs on one CPU but the per-CPU plumbing is
in place. Existing tests (make run) continue to pass.
Phase B: AP Startup
Bring Application Processors (APs) online. Each AP runs the same kernel code with its own per-CPU state.
Limine SMP Request
Limine provides an SMP response with per-CPU records. Each record contains
the LAPIC ID and a goto_address field – writing a function pointer there
starts the AP at that address.
#![allow(unused)]
fn main() {
use limine::request::SmpRequest;
#[used]
#[unsafe(link_section = ".requests")]
static SMP_REQUEST: SmpRequest = SmpRequest::new();
fn start_aps() {
let smp = SMP_REQUEST.get_response().expect("no SMP response");
for cpu in smp.cpus() {
if cpu.lapic_id == smp.bsp_lapic_id {
continue; // skip BSP
}
let per_cpu = init_per_cpu(cpu.id, cpu.lapic_id);
// Limine starts the AP at ap_entry with the cpu info pointer
cpu.goto_address.write(ap_entry);
}
}
}
AP Entry
Each AP must:
- Load its per-CPU GDT and TSS
- Load the shared IDT
- Set
KernelGSBaseMSR to itsPerCpupointer - Configure SYSCALL MSRs (STAR, LSTAR, SFMASK, EFER.SCE)
- Initialize its LAPIC (enable, set timer, set spurious vector)
- Signal “ready” to BSP (atomic flag or counter)
- Enter the scheduler idle loop
#![allow(unused)]
fn main() {
/// AP entry point. Called by Limine with the SMP info pointer.
unsafe extern "C" fn ap_entry(info: &limine::smp::Cpu) -> ! {
let per_cpu = /* retrieve PerCpu for this LAPIC ID */;
// Load this CPU's GDT + TSS
per_cpu.gdt.load();
unsafe { load_tss(per_cpu.selectors.tss); }
// Shared IDT (same across all CPUs)
idt::load();
// Set GS base for swapgs
unsafe { wrmsr(IA32_KERNEL_GS_BASE, per_cpu as *const _ as u64); }
// Configure syscall MSRs (same values as BSP)
syscall::init_msrs();
// Initialize local APIC
lapic::init_local();
// Signal ready
AP_READY_COUNT.fetch_add(1, Ordering::Release);
// Enter scheduler
scheduler::idle_loop();
}
}
Scheduler Integration
Stage 5 establishes a single run queue. Phase B extends it:
- Per-CPU run queues – each CPU pulls work from its local queue. Avoids global lock contention on the scheduler hot path.
- Global overflow queue – when a CPU’s local queue is empty, it steals from the global queue (or from other CPUs’ queues).
- CPU affinity – optional, not needed initially. All processes are eligible to run on any CPU.
- Idle loop – when no work is available,
hltuntil the next timer interrupt or IPI.
The Process struct gains a cpu field indicating which CPU it’s currently
running on (or None if queued).
Boot Sequence
BSP: kernel init (GDT, IDT, memory, heap, caps, scheduler)
BSP: init_per_cpu(0, bsp_lapic_id)
BSP: start_aps()
AP1: ap_entry() → init GDT/TSS/LAPIC → idle_loop()
AP2: ap_entry() → init GDT/TSS/LAPIC → idle_loop()
...
BSP: wait for all APs ready
BSP: load init process, schedule it
BSP: enter scheduler
Phase C: SMP Correctness
With multiple CPUs running, shared mutable state needs careful handling.
TLB Shootdown
When the kernel modifies page tables that other CPUs may have cached in their TLBs, it must send an IPI to those CPUs to invalidate the affected entries.
Scenarios requiring shootdown:
- Process exit – unmapping user pages. Only the CPU running the process has the mapping cached, but if the process migrated recently, stale TLB entries may exist on the old CPU.
- Shared kernel mappings – changes to the kernel half of page tables (e.g., heap growth, MMIO mappings) require all-CPU shootdown.
- Capability-granted shared memory – if future stages allow shared memory regions between processes, modifications require targeted shootdown.
Implementation: IPI vector + bitmap of target CPUs + invlpg on each
target. Linux uses a more sophisticated batching scheme, but a simple
broadcast IPI with single-page invlpg is sufficient initially.
#![allow(unused)]
fn main() {
/// Flush TLB entry on all CPUs except the caller.
fn tlb_shootdown(addr: VirtAddr) {
// Record the address to flush
SHOOTDOWN_ADDR.store(addr.as_u64(), Ordering::Release);
// Send IPI to all other CPUs
lapic::send_ipi_all_excluding_self(TLB_SHOOTDOWN_VECTOR);
// Wait for all CPUs to acknowledge
wait_for_shootdown_ack();
}
/// IPI handler on receiving CPU.
fn handle_tlb_shootdown_ipi() {
let addr = VirtAddr::new(SHOOTDOWN_ADDR.load(Ordering::Acquire));
x86_64::instructions::tlb::flush(addr);
SHOOTDOWN_ACK.fetch_add(1, Ordering::Release);
}
}
Lock Audit
Existing spinlocks need review for SMP safety:
| Lock | Current Use | SMP Concern |
|---|---|---|
SERIAL | COM1 output | Safe but high contention if many CPUs print. Acceptable for debug output. |
ALLOCATOR | Frame bitmap | Hot path. Holding lock during full bitmap scan is O(n). Consider per-CPU free lists. |
KERNEL_CAPS | Kernel cap table | Low contention (init only). Safe. |
SCHEDULER.current | Single global running-process slot | Split into PerCpu.current_process in Phase A. |
Interrupt + spinlock deadlock: if CPU A holds a spinlock and takes an
interrupt whose handler tries to acquire the same lock, deadlock. This is
already noted in REVIEW.md. Fix: disable interrupts while holding locks
that interrupt handlers may need (frame allocator, serial). The spin crate
supports MutexIrq for this pattern, or use manual cli/sti wrappers.
Allocator Scaling
The frame allocator is behind a single spinlock with O(n) bitmap scan. Under SMP, this becomes a contention bottleneck.
Options (in order of complexity):
- Per-CPU free list cache – each CPU maintains a small cache of free frames (e.g., 64 frames). Refill from the global allocator when empty, return batch when full. Reduces lock acquisitions by ~64x.
- Region partitioning – divide physical memory into per-CPU regions. Each CPU owns a bitmap partition. Cross-CPU allocation falls back to a global lock. More complex, better NUMA behavior (future).
Option 1 is recommended for initial SMP. ~50-100 lines.
The heap allocator (linked_list_allocator) is also behind a single lock.
For a research OS this is acceptable initially – heap allocations in the
kernel should be infrequent compared to frame allocations.
Cap’n Proto Schema Additions
SMP introduces a kernel-internal CpuManager capability for inspecting and
controlling CPU state. This is not exposed to userspace initially but follows
the “everything is a capability” principle.
interface CpuManager {
# Number of online CPUs.
cpuCount @0 () -> (count :UInt32);
# Per-CPU info.
cpuInfo @1 (cpuId :UInt32) -> (lapicId :UInt32, online :Bool);
}
This capability would be held by init (or a system monitor process) for diagnostics. It’s additive and can be deferred until the mechanism is useful.
Estimated Scope
| Phase | New/Changed Code | Depends On |
|---|---|---|
| Phase A: Per-CPU foundation | ~300-400 lines (PerCpu struct, swapgs migration, per-CPU GDT/TSS) | Stage 5 |
| Phase B: AP startup | ~200-300 lines (SmpRequest, ap_entry, scheduler extension) | Phase A |
| Phase C: SMP correctness | ~200-300 lines (TLB shootdown, allocator cache, lock audit) | Phase B |
| Total | ~700-1000 lines |
Milestones
- M1: Per-CPU data on BSP –
swapgs-based syscall entry, per-CPU GDT/TSS, global current-process state split. Single CPU still.make runpasses. - M2: APs running – secondary CPUs reach
idle_loop(). BSP prints “N CPUs online”.make runstill runs init on BSP. - M3: Multi-CPU scheduling – processes can run on any CPU. The existing
boot-manifest service set still works, but the scheduler distributes work
across CPUs once runnable processes are available (runtime spawning still
depends on
ProcessSpawner). - M4: TLB shootdown – page table modifications are safe across CPUs. Process exit on one CPU doesn’t leave stale mappings on others.
Open Questions
-
LAPIC vs x2APIC. Modern hardware supports x2APIC (MSR-based, no MMIO). Should we require x2APIC, support both, or start with xAPIC? QEMU supports both. x2APIC is simpler (no MMIO mapping needed).
-
Idle strategy.
hltis the simplest idle.mwaitis more power-efficient and can be used to wake on memory writes. Overkill for QEMU, but worth noting for future hardware targets. -
CPU hotplug. Limine starts all CPUs at boot. Runtime CPU online/offline is a future concern, not needed initially.
-
NUMA awareness. Multi-socket systems have non-uniform memory access. Per-CPU frame allocator regions could be NUMA-aware. Deferred – QEMU emulates flat memory by default.
-
Scheduler policy. Round-robin per-CPU queues with global overflow is the simplest starting point. Work stealing, priority scheduling, and CPU affinity are future refinements.
References
Specifications
- Intel SDM Vol. 3, Chapter 8 – Multiple-Processor Management (AP startup, APIC, IPI)
- Intel SDM Vol. 3, Chapter 10 – APIC (Local APIC, I/O APIC, x2APIC)
- OSDev Wiki: SMP
- OSDev Wiki: APIC
Limine
- Limine SMP Feature
–
SmpRequest/SmpResponseAPI, AP startup mechanism
Prior Art
- Redox SMP – per-CPU contexts, LAPIC timer, IPI-based TLB shootdown
- xv6-riscv SMP – minimal multi-core OS, clean per-CPU implementation
- Hermit SMP – Rust unikernel with SMP support via per-core data and APIC
- BlogOS – educational x86_64 Rust OS (single-CPU, but good APIC coverage)