DMA Isolation Design
S.11 gates PCI, virtio, and later userspace device-driver work on an explicit DMA authority model. The immediate goal is narrow: let the kernel bring up a QEMU virtio-net smoke without creating a user-visible raw physical-memory escape hatch.
Short-Term Decision
Use kernel-owned bounce buffers for the first in-kernel QEMU virtio-net smoke.
The kernel allocates DMA-capable pages from its own frame allocator, owns the virtqueue descriptor tables and packet buffers, programs the device with the corresponding physical addresses, and copies packet payloads between those buffers and the networking stack. No userspace process receives a DMA buffer capability, a physical address, a virtqueue pointer, or a BAR mapping for this smoke.
This is deliberately conservative:
- It works before ACPI/DMAR or AMD-Vi parsing, IOMMU page-table management, MSI/MSI-X routing, and userspace driver lifecycle supervision exist.
- It keeps all physical-address programming inside the kernel, where the same code that allocates the frames also bounds the descriptors that reference them.
- It does not make the current
FrameAllocatorcapability part of the DMA path.FrameAllocatorcan expose raw frames today and is already tracked inREVIEW_FINDINGS.md; DMA must not build new untrusted-driver semantics on that interface. - It gives the smoke a disposable implementation path. When NIC or block
drivers move to userspace, bounce-buffer authority becomes a typed
DMAPoolobject instead of an ad hoc physical-address grant.
An IOMMU-backed DMA-domain model remains the target for direct device access from mutually untrusted userspace drivers, but it is not a prerequisite for the first QEMU smoke. Without an IOMMU, a malicious bus-mastering device can still DMA to arbitrary RAM at the hardware level; the short-term smoke assumes QEMU-provided virtio hardware and protects against confused or untrusted userspace, not hostile hardware.
Authority Model
Device authority is split into three independent capabilities:
DMAPool: authority to allocate, expose, and revoke device-visible memory within a kernel-owned physical range or IOMMU domain.DeviceMmio: authority to map and access one device’s register windows.Interrupt: authority to wait for and acknowledge one interrupt source.
Holding one of these capabilities never implies the others. A driver needs all three for a normal device, but the kernel and init can grant, revoke, and audit them separately.
DMAPool Invariants
DMAPool is the only future userspace-facing authority that may cause a
device-visible DMA address to exist.
- Authority: A holder may allocate buffers only from the pool object it was granted. It may not request arbitrary physical frames, import caller virtual memory by address, or derive another pool.
- Physical range: Every exported device address must resolve to pages owned by the pool. The kernel records the allowed host-physical page set and validates every descriptor mapping against that set before a device can use it. If an IOMMU domain backs the pool, the exported address is an IOVA, not raw host physical memory.
- Ownership: Each DMA buffer has one pool owner, one device-domain owner, and explicit CPU mappings. Sharing a buffer with another process requires a later typed memory-object transfer; copying packet data is the default until that object exists.
- No raw grants: Userspace never receives an unrestricted host-physical
address. A driver may receive an opaque DMA handle or an IOVA meaningful
only to its
DMAPool/device domain. It cannot turn that value into access to unrelated RAM. - Bounds: Buffer length, alignment, segment count, and queue depth are bounded by the pool. Descriptor chains that point outside an allocated buffer, wrap arithmetic, exceed device limits, or reference freed buffers fail closed before doorbell writes.
- Revocation: Revoking the pool first quiesces the device path using it, prevents new descriptors, waits for or cancels in-flight descriptors, then removes IOMMU mappings or invalidates bounce-buffer handles before freeing pages.
- Reset: If in-flight DMA cannot be proven stopped, revocation escalates to device reset through the owning device object before pages are reused.
- Residual state: Pages returned from a pool are zeroed or otherwise scrubbed before reuse by a different owner. Receive buffers are treated as device-written untrusted input until validated by the driver or stack.
For the in-kernel QEMU smoke, the kernel is the only DMAPool holder. The
same invariants apply internally even though no userspace capability object is
exposed yet.
DeviceMmio Invariants
DeviceMmio is register authority, not memory authority.
- Authority: A holder may map only BARs or subranges recorded in the claimed device object. It may not map PCI config space globally, another function’s BAR, RAM, ROM, or synthetic kernel pages.
- Physical range: Each mapping is bounded to the BAR’s decoded physical range, page-rounded by the kernel, and tagged as device memory with cache attributes appropriate for MMIO. Partial BAR grants must preserve page-level isolation; otherwise the grant must cover the whole page-aligned register window and be treated as that much authority.
- Ownership: At most one mutable driver owner controls a device function’s
MMIO at a time. Management capabilities may inspect topology, but register
writes require the claimed
DeviceMmioobject. - No DMA implication: Mapping registers does not grant any DMA buffer,
frame allocation, interrupt, or config-space authority. Doorbell writes are
accepted only as effects of register access; descriptor validity is enforced
by
DMAPoolbefore queues are made visible to the device. - Revocation: Revocation unmaps the driver’s register pages, marks the device object unavailable for new calls, and invalidates outstanding MMIO handles. Stale mappings or calls fail closed.
- Reset: Revoking the final mutable
DeviceMmioowner resets or disables the device unless a higher-level device manager explicitly transfers ownership without exposing it to an untrusted holder.
Interrupt Invariants
Interrupt is event authority for one routed source.
- Authority: A holder may wait for, mask/unmask where supported, and acknowledge only its assigned vector, line, or MSI/MSI-X table entry. It may not reprogram arbitrary interrupt controllers or claim another source.
- Ownership: Each interrupt source has one delivery owner at a time. Shared legacy lines must be represented as a kernel-demultiplexed object with explicit device membership, not as ambient access to the whole line.
- Range: The capability records the hardware source, vector, trigger mode, polarity, and target CPU/routing state. User-visible operations are checked against that record.
- Revocation: Revocation masks or detaches the source, drains pending notifications for the old holder, invalidates waiters, and prevents stale acknowledgements from affecting a new owner.
- Reset: If the source cannot be detached cleanly, the owning device is reset or disabled before the interrupt is reassigned.
- No MMIO or DMA implication: Interrupt delivery does not grant register access, DMA buffers, or packet memory.
Revocation Ordering
Device revocation must follow a fixed order:
- Stop new submissions by invalidating the driver’s user-visible handles.
- Revoke MMIO write authority by write-blocking or unmapping BAR pages, or by disabling the device before any DMA teardown starts.
- Mask or detach interrupts.
- Quiesce virtqueues or device command queues.
- Reset or disable the device if in-flight DMA cannot be accounted for.
- Remove IOMMU mappings or invalidate bounce-buffer handles.
- Scrub and free DMA pages.
This order prevents a stale driver from racing revocation with doorbell writes, interrupt acknowledgement, or descriptor reuse. Logical handle invalidation is not sufficient while a BAR remains mapped; register-write authority must be removed or the device must be disabled before descriptor or DMA-buffer ownership is reclaimed.
Future Userspace-Driver Transition Criteria
Moving NIC or block drivers out of the kernel is gated by S.11.2. The gate is only open when all rows below are implemented and demonstrated.
| Gate item | Required state | Must-have proof |
|---|---|---|
| S.11.2.0 DMA-objected buffers | DMAPool owns every driver-visible DMA mapping. | A driver receives opaque buffer handles or IOVA-only values; no path hands out raw host physical addresses. |
| S.11.2.1 Bound checks | Allocation, descriptor chain length, alignment, segment length, and ring depth are bounded and constant-time validated before ring submission. | Ring submissions fail closed on overflow, wrap, stale-handle, and freed-handle reuse attempts. |
| S.11.2.2 Explicit remap/ownership | DeviceMmio can only grant claimed BAR pages; cache attributes and write policy are enforced. | Driver cannot access unclaimed BARs, ROM, RAM pages, config-space globals, or stale mappings after revoke. |
| S.11.2.3 Interrupt correctness | Interrupt owns exactly one logical source at a time and drains/waits only for that source. | Reassigning an owner invalidates old waiters and masks or detaches the source first. |
| S.11.2.4 Quiesce + reset contract | Device manager can force reset/disable on failed revoke or teardown. | No in-flight descriptor may continue touching freed buffers after driver removal. |
| S.11.2.5 Process lifecycle | Capability release, process exit, and process-spawn cleanup paths cannot leak DMA pages/MMIO/intr ownership. | Crash-path teardown removes holds and invalidates user-visible handles before page free. |
| S.11.2.6 Isolation and accounting | S.9 quota and authority ledgers include DMA, MMIO, and interrupt hold edges. | A malicious or buggy driver cannot consume more than its allocated authority budget. |
| S.11.2.7 Hostile-smoke coverage | QEMU/CI smokes cover stale handles, descriptor abuse, revoke races, and exit-under-dma. | Smoke output has explicit closed-case proof lines for each above failure mode. |
For each row, the transition requires an owner, implementation notes, and a CI-backed
verification path. Until all rows pass, Phase 4.2 NIC/block drivers remain in-kernel for
functionality, and only kernel-mapped bounce-buffer mode is allowed for prototype DMA.
S.11.2 Decision Record
S.11.2 is not complete until the kernel has a dedicated device manager object model
that can produce, transfer, and revoke DMAPool, DeviceMmio, and Interrupt in a
single ownership transaction for a driver process.
Current status: transition remains blocked pending implementation of the conditions above.