Proposal: Live Upgrade

Replacing a running service with a new binary, without dropping outstanding capability references or losing in-flight work. The kernel-side primitive (CapRetarget) is owned by this proposal; the surrounding orchestration (supervisors, manifest sources, fault containment) is owned by service-architecture-proposal.md and consumes the primitive defined here.

Problem

In a Linux-like system, “upgrading a service” is one of:

Restart: stop the old process, start the new one. Clients holding file descriptors, sockets, or pipes to the old process receive ECONNRESET or EPIPE and must reconnect. Session state is lost unless clients serialize it themselves.
Graceful restart (nginx -s reload, unicorn, systemd socket activation): new process starts alongside old, inherits the listening socket, old drains in-flight requests. Works only for request/response protocols where the session is the request. Does nothing for stateful sessions.
Live patch (kpatch, ksplice): binary-level function replacement. Narrow, fragile, no schema for state layout changes.

None of these compose with a capability OS. A CapId held by a client points at a specific process; if that process exits, the cap is dead. There is no “the service” abstraction the kernel could re-bind — the point of capabilities is that they identify a specific reference, not a name that could be redirected after the fact.

But capOS has a kernel-side primitive the Linux model lacks: the kernel already owns the authoritative table of every CapId and which process serves it. Rewriting “cap X is served by process v1” → “cap X is served by process v2” is a table update. The question is when it is safe, and how v2 inherits enough state to answer the next call.

Three Cases

Live upgrade has three distinct cost profiles. The right design is to make each one explicit rather than pretend the hard case doesn’t exist.

Case 1: Stateless services

Each SQE is independent; the service holds no state that matters across calls. A request router, a pure codec, a logger that flushes to an external sink.

Upgrade is trivial: start v2, retarget every CapId from v1 to v2, exit v1. Clients may observe a small latency spike; no DISCONNECTED CQE fires. Only the kernel primitive is needed.

Case 2: State externalized into other caps

The service’s in-memory data is a cache or dispatch table; durable state lives behind caps the service holds (Store, SessionMap, Namespace). v1’s held caps are passed to v2 at spawn time (via the supervisor, per the manifest), kernel retargets client caps, v1 exits.

Architecturally this is the idiomatic capOS pattern: services stay thin, state is factored into dedicated holders with their own caps. The Fetch/HttpEndpoint split in the service-architecture proposal already pushes in this direction. In that world, most services fall into this bucket by construction.

Case 3: Stateful services requiring migration

The service has in-memory state that matters: a JIT’s code cache, a codec’s ring buffer, a parser’s arena, session data not yet flushed. Upgrade requires v1 to hand its state to v2.

capOS’s contribution here is that the state wire format is already capnp — the same format the service uses for IPC. v1 serializes its state as a capnp message; v2 consumes it. There is no separate serialization layer to build and no opportunity for it to drift from the IPC format.

The contract extends the service’s capnp interface:

interface Upgradable {
    # Called on v1 by the supervisor. Returns a snapshot of service
    # state and stops accepting new calls. Calls already in flight
    # complete before the snapshot returns.
    quiesce @0 () -> (state :Data);

    # Called on v2 after spawn. Loads state from the snapshot. After
    # this returns, v2 is ready to serve calls.
    resume @1 (state :Data) -> ();
}

The state schema is service-defined. Schema evolution follows capnp’s standard rules: adding fields is backward-compatible, renaming requires care, removing requires a major version bump.

Kernel Primitive: CapRetarget

The kernel exposes the retarget as a capability method, not a syscall:

interface ProcessControl {
    # Atomically redirect every CapId currently served by `old` to
    # be served by `new`. Requires: `new` implements a schema
    # superset of `old` (schema-id compatibility), `new` is Ready,
    # `old` is Quiesced (graceful) or the caller has permission to
    # force.
    retargetCaps @0 (old :ProcessHandle, new :ProcessHandle,
                     mode :RetargetMode) -> ();
}

enum RetargetMode {
    graceful @0;  # old must be Quiesced; in-flight calls drain on old
    force    @1;  # caps redirect immediately; in-flight calls fail
}

Only a process holding a ProcessControl cap to both processes can perform this — typically the supervisor that spawned them. The kernel never initiates upgrades.

Atomicity is per-CapId. From a client’s perspective, the retarget is a single point in time: a CALL SQE submitted before retarget goes to v1; a CALL SQE submitted after goes to v2. A CALL already dispatched to v1 either completes there (graceful) or returns a DISCONNECTED CQE (force).

Supervisor-Level Upgrade Protocol

The primitives above compose into a protocol the supervisor runs:

1. spawn v2 from the new binary in the manifest
2. Case 1 & 2: v2.resume(EMPTY_STATE)
   Case 3:     state = v1.quiesce()
               v2.resume(state)
3. kernel.retargetCaps(v1, v2, graceful)
4. wait for v1 to drain (graceful mode)
5. v1.exit()

If any step fails, the supervisor rolls back: kill v2, resume v1 (if quiesced), log the failure. Because the retarget hasn’t happened yet, clients never observe the aborted attempt.

In-Flight Calls

The subtle case is a client that has already posted a CALL SQE to v1 when the retarget happens. Two options:

Graceful mode. v1 finishes the call, kernel routes the CQE back to the client on v1’s ring. v1 exits only after its ring is empty. This preserves call semantics; v1 and v2 coexist briefly.
Force mode. The in-flight CALL returns DISCONNECTED. Client retries against v2. Appropriate when v1 is wedged and quiesce won’t return.

In graceful mode the client cannot distinguish “call landed on v1” from “call landed on v2” — which is the point. Capability identity survives the upgrade; process identity does not.

Relationship to Fault Containment

Live upgrade and fault containment (driver panics → supervisor respawns) share machinery. The difference is one step of the protocol:

Fault containment: v1 has crashed; kernel has already marked it dead and epoch-bumped its caps. Supervisor spawns v2, issues a graceful retarget (no quiesce — v1 is gone; in-flight CALLs already delivered DISCONNECTED). Clients reconnect to v2.
Live upgrade: v1 is healthy; supervisor initiates quiesce → state transfer → retarget, and no CQE ever reports DISCONNECTED to any caller.

The epoch-based revocation work from Stage 6 is the foundation for both. CapRetarget is one additional primitive layered on top.

Security and Trust

Live upgrade does not expand the trust model. The supervisor already holds the authority to kill, restart, and reassign caps for services it spawned — upgrade is a refinement of that authority, not a new principal. Requirements:

Only a holder of ProcessControl caps to both old and new can call retargetCaps. By construction this is the supervisor that spawned them.
The new binary must be legitimately obtained — in practice, loaded from the same content-addressed store as everything else (ties to Content-Addressed Boot).
Schema compatibility (new is a superset of old) is checked by the kernel before retarget. This prevents an upgrade from silently narrowing the interface clients depend on.

Non-Goals

Code hot-patching. No binary-level function replacement. Upgrade is at the process boundary, not the symbol boundary.
Kernel live replacement. Covered by Reboot-Proof / process persistence (reboot with state preserved, not live replacement). The kernel is a single trust domain; replacing it in place needs a different design.
Automatic schema migration across incompatible changes. If v2’s state schema is not a capnp-evolution-compatible superset of v1’s, the service author writes the migration. The kernel does not.
System-wide registry of upgradable services. The supervisor knows what it spawned; there is no ambient discovery.

Phased Implementation

CapRetarget primitive. Kernel operation + ProcessControl cap. Useful immediately for stateless services (Case 1) and as the foundation of Fault Containment (respawn with a new process, point its caps to a fresh instance).
Upgradable interface. Schema, contract documentation, and a Rust helper in capos-rt that services derive.
Graceful drain. Quiesce + in-flight call completion + v1 exit synchronization.
Stateful demo. A service maintaining session state, upgraded live with zero session loss. This is the Live Upgrade observable milestone.

Erlang/OTP code_change/3 is the closest prior art: processes upgrade their behavior module in place, with a callback to migrate state. capOS differs only in that state transport goes through capnp rather than Erlang term format, and that the process boundary is an OS process rather than a BEAM process.
Fuchsia component updates rebind component instances in the routing graph. Similar primitive in a different mechanism.
nginx -s reload is graceful restart for request/response servers. The design here generalizes it by exposing the state migration point explicitly rather than relying on “the session is the request.”

Cross-Links

service-architecture-proposal.md — owns the supervisor surface that drives this proposal’s protocol. The “Supervisors” and “Supervision Tree” sections describe the principal that holds ProcessControl caps to both old and new and runs spawn → quiesce → resume → retargetCaps → drain → exit. The “Service Taxonomy” entry Upgrade manager is the per-system orchestrator that consumes CapRetarget for live replacement, distinct from a per-subtree supervisor that uses the same primitive for fault containment (respawn after crash). Schema compatibility for new vs old is the same superset check the manifest executor and the boot package contract already require, not a new policy invented here.
cloud-deployment-proposal.md — owns the binary delivery story this proposal depends on. new must be obtained from the same content-addressed boot package / image-update pipeline the cloud deployment plan describes, not from an ad-hoc path. Cloud-managed services (KMS clients, metadata agents, log/metric shippers, the cloud-metadata agent itself) are exactly the Case 2 / Case 3 services where this proposal’s value shows up first: they hold long-lived caps to upstream cloud APIs, and a restart that drops those caps either re-runs IAM/JWT handshakes or, worse, drops audit/log shippers’ in-flight buffers. The bootable disk image / NVMe path defines what “update the binary” means on real hardware; until then the manifest-embedded BootPackage blobs are the only source of new.
storage-and-naming-proposal.md — owns the Case 2 holders (Store, SessionMap, Namespace) the idiomatic service factoring relies on, and the future sealed/stored capability path that lets state survive across reboot, not just across live upgrade. Case 3 state-transfer is the strictly weaker contract: same capnp wire format, but the snapshot only has to outlive a single retargetCaps call, not power loss.
system-monitoring-proposal.md — quiesce start, resume completion, retargetCaps mode (graceful vs force), drain duration, and rollback (kill new, resume old) are audit-worthy lifecycle events. The upgrade manager emits them through the audit cap so an operator can correlate a service binary change with downstream behavior. Graceful upgrades by definition emit zero DISCONNECTED CQEs; force-mode and fault-containment respawns do, and that distinction is what the audit record has to preserve.
security-and-verification-proposal.md — retargetCaps is a natural target for bounded modeling: per-CapId atomicity (no SQE submitted before retarget lands on new; no SQE submitted after lands on old), graceful-mode in-flight completion (old’s ring drains before exit), and schema-superset enforcement at the kernel before retarget. Force-mode DISCONNECTED delivery is the same epoch-revocation path the fault-containment story already needs, not a separate kernel surface.
../design-risks-register.md — the register currently carries no dedicated R-entry for live upgrade, which is intentional: no implementation exists yet. The closest cross-cutting entries are R6 (CAP_OP_RELEASE is deferred), because graceful drain has to outlive the per-process release path before v1.exit() is safe; R12 (verification coverage is partial), because the per-CapId retarget atomicity and graceful-drain invariants belong in a bounded model before this lands; and Q7 (revocation strategy), because force-mode retarget shares the epoch path the open revocation decision will pick. Open a dedicated R-entry once CapRetarget lands in code, since at that point retarget atomicity, graceful-drain shutdown, and the supervisor-only authority constraint become long-horizon design surfaces in their own right.

Keyboard shortcuts

capOS Documentation