Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Proposal: Capability-Oriented GPU/CUDA Integration

Purpose

Define a minimal, capability-safe path to integrate NVIDIA/CUDA-capable GPUs into the capOS architecture without expanding kernel trust.

The kernel keeps direct control of hardware arbitration and trust boundaries. GPU hardware interaction is performed by a dedicated userspace service that is invoked through capability calls.

Positioning Against Current Project State

capOS currently provides:

  • Process lifecycle, page tables, preemptive scheduling (PIT 100 Hz, round-robin, context switching).
  • A global and per-process capability table with CapObject dispatch.
  • Shared-memory capability ring (io_uring-inspired) with syscall-free SQE writes. cap_enter syscall for ordinary CALL dispatch and completion waits.
  • No ACPI/PCI/interrupt infrastructure yet in-kernel.

That means GPU integration must be staged and should begin as a capability model exercise first, with real hardware I/O added after the underlying kernel subsystems exist.

Design Principles

  • Keep policy in kernel, execution in userspace.
  • Never expose raw PCI/MMIO/IRQ details to untrusted processes.
  • Make GPU access explicit through narrow capabilities.
  • Treat every stateful resource (session, buffer, queue, fence) as a capability.
  • Require revocability and bounded lifetime for every GPU-facing object.
  • Avoid a Linux-driver-in-kernel compatibility dependency.

Proposed Architecture

capOS kernel (minimal) exposes only resource and mediation capabilities. gpu-device service (userspace) receives device-specific caps and exposes a stable GPU capability surface to clients. application receives only GpuSession/GpuBuffer/GpuFence capabilities.

Kernel responsibilities

  • Discover GPUs from PCI/ACPI layers.
  • Map/register BAR windows and grant a scoped DeviceMmio capability.
  • Set up interrupt routing and expose scoped IRQ signaling capability.
  • Enforce DMA trust boundaries for process memory offered to the driver.
  • Enforce revocation when sessions are closed.
  • Handle all faulting paths that would otherwise crash the kernel.

User-space GPU service responsibilities

  • Open/initialize one GPU device from device-scoped caps.
  • Allocate and track GPU contexts and queues.
  • Implement command submission, buffer lifecycle, and synchronization.
  • Translate capability calls into driver-specific operations.
  • Expose only narrow, capability-typed handles to callers.

Capability Contract (schema additions)

Add to schema/capos.capnp:

  • GpuDeviceManager
    • listDevices() -> (devices: List(GpuDeviceInfo))
    • openDevice(capabilityIndex :UInt32) -> (session :GpuSession)
  • GpuSession
    • createBuffer(bytes :UInt64, usage :Text) -> (buffer :GpuBuffer)
    • destroyBuffer(buffer :UInt32) -> ()
    • launchKernel(program :Text, grid :UInt32, block :UInt32, bufferList :List(UInt32), fence :GpuFence) -> ()
    • submitMemcpy(dst :UInt32, src :UInt32, bytes :UInt64) -> ()
    • submitFenceWait(fence :UInt32) -> ()
  • GpuBuffer
    • mapReadWrite() -> (addr :UInt64, len :UInt64)
    • unmap() -> ()
    • size() -> (bytes :UInt64)
    • close() -> ()
  • GpuFence
    • poll() -> (status :Text)
    • wait(timeoutNanos :UInt64) -> (ok :Bool)
    • close() -> ()

Exact wire fields are intentionally flexible to keep this proposal at the interface level; method IDs and concrete argument packing should be finalized in the implementation PR.

Implementation Phases

Phase 0 (prerequisite): Stage 4 kernel capability syscalls

  • Implement capability-call syscall ABI.
  • Add cap_id, method_id, params_ptr, params_len path.
  • Add kernel/user copy/validation of capnp messages.
  • Validate user process permissions before dispatch.

Phase 1: Device mediation foundations

  • Add kernel caps:
    • DeviceManager/DeviceMmio/InterruptHandle/DmaBuffer.
  • Add PCI/ACPI discovery enough to identify NVIDIA-compatible functions.
  • Add guarded BAR mapping and scoped grant to an init-privileged service.
  • Add minimal GpuDeviceManager service scaffold returning synthetic/empty device handles.
  • Add manifest entries for a GPU service binary and launch dependencies.

Phase 2: Service-based mock backend

  • Implement gpu-mock userspace service with same Gpu* interface.
  • Support no-op buffers and synthetic fences.
  • Prove end-to-end:
    • init spawns driver
    • process opens session
    • buffer create/map/wait flows via capability calls
  • Add regression checks in integration boot path output.

Phase 3: Real backend integration

  • Add actual backend adapter for one concrete GPU driver API available in environment.
  • Add:
    • queue lifecycle
    • fence lifecycle
    • DMA registration/validation
    • command execution path
    • interrupt completion path to service and return through caps
  • Keep backend replacement possible via trait-like abstraction in userspace service.

Phase 4: Security hardening

  • Add per-session limits for mapped pages and in-flight submissions.
  • Add bounded queue depth and timeout enforcement.
  • Add explicit revocation propagation:
    • session close => all child caps revoked.
    • driver crash => all active caps fail closed.
  • Add explicit audit hooks for submit/launch calls.

Security Model

The kernel does not grant a user process direct MMIO access.

Processes only receive:

  • GpuSession / GpuBuffer / GpuFence capabilities.

The service process receives:

  • DeviceMmio, InterruptHandle, and memory-caps derived from its policy.

This ensures:

  • No userland process can program BAR registers.
  • No userland process can claim untrusted memory for DMA.
  • No userland process can observe or reset another session state.

Dependencies and Alignment

This proposal depends on:

  • Stage 4 capability syscalls.
  • Kernel networking/PCI/interrupt groundwork from cloud deployment roadmap.
  • Stage 6/7 for richer cross-process IPC and SMP behavior.

It complements:

  • Device and service architecture proposals.
  • Storage/service manifest execution flow.
  • In-process threading work (future queue completion callbacks).

Minimal acceptance criteria

  • make run boots and prints GPU service lifecycle messages.
  • Init spawns GPU service and grants only device-scoped caps.
  • A sample userspace client can:
    • create session
    • allocate and map a GPU buffer
    • submit a synthetic job
    • wait on a fence and receive completion
  • Attempts to submit unsupported/malformed operations return explicit capnp errors.
  • Removing service/session capabilities invalidates descendants without kernel restart.

Risks

  • Real NVIDIA closed stack integration may require vendor-specific adaptation.
  • Buffer mapping semantics can become complex with paging and fragmentation.
  • Interrupt-heavy completion paths require robust scheduling before user-visible completion guarantees.

Open Questions

  • Is CUDA mandatory from first integration, or is the initial surface command-focused (gpu-kernel payload as opaque bytes) and CUDA runtime-specific later?
  • Should memory registration support pinned physical memory only at first?
  • Which isolation level is needed for multi-tenant versus single-tenant first phase?