# Benchmarks

capOS benchmark rows are evidence records. Each row should say what workload
ran, what was verified, how time was measured, what machine envelope was used,
and where the raw artifacts were stored. A faster row whose verifier did not
complete is not a performance result.

The broader benchmark model is in
[System Performance Benchmarks](proposals/system-performance-benchmarks-proposal.md).
Future parallel-pattern coverage is in
[HPC Parallel Processing Patterns](proposals/hpc-parallel-patterns-proposal.md).

## Current CPU Workloads

capOS currently has two CPU-scaling workloads:

| Workload | Target | Timed region | Verifier | Primary use |
| --- | --- | --- | --- | --- |
| `run-smp-process-scale` | Independent worker processes | worker compute only, after setup and before result reporting | aggregate prime count and checksum | Exercises multiple process-owned rings running CPU work on more than one scheduler CPU. |
| `run-thread-scale` | Sibling threads in one process | checksum work window, separate from spawn/join/shutdown totals | deterministic root checksum and metadata checks | Measures same-process thread scheduling, per-thread rings, and scheduler overhead. |

Both workloads keep serial and harness artifacts under `target/`. The capOS
rows below were collected under QEMU/KVM. The matching Linux rows use the same
workload shape where possible, but units differ by harness and should not be
compared directly across systems. Compare speedup ratios within a row.

## Process-Scale SMP

`make run-smp-process-scale` boots a focused manifest, runs independent worker
processes, and times the CPU-bound worker window. Each worker owns its own
process ring. The timed section avoids syscalls and serial output; the
coordinator verifies the aggregate result after workers finish.

The current workload counts primes over `2..3_000_000` using balanced
contiguous splits. capOS reports a worker-side user-mode cycle counter shifted
right by 20 bits. Linux reports guest `clock_gettime` nanoseconds.

Controlled benchmark-VM reruns were recorded on GCE `n2-highcpu-8` at capOS
commit `0d89a91b` (`2026-04-30 11:09 UTC`) with nested QEMU/KVM on Ubuntu
`6.17.0-1012-gcp`, QEMU `8.2.2`, Rust nightly `1.97.0-nightly`
(`c935696dd 2026-04-29`), and host logical CPUs `0,1,2,3` mapped to distinct
physical cores with SMT siblings `4,5,6,7`.

<!-- capos-benchmark-results:multi-process-smp start -->
| System | smp1 median | smp2 median | smp4 median | 1-to-2 speedup | 1-to-4 speedup |
|---|---:|---:|---:|---:|---:|
| capOS | 1,639 scaled cycles | 875 scaled cycles | 1,111 scaled cycles | 1.873x | 1.475x |
| Linux | 1,275,187,210 ns | 659,218,025 ns | 337,877,986 ns | 1.934x | 3.774x |
<!-- capos-benchmark-results:multi-process-smp end -->

The capOS 4-vCPU row improved over the 1-vCPU row but was slower than the
2-vCPU row. Linux continued improving through 4 vCPUs under the same pinning
and workload. Raw capOS artifacts are under
`target/smp-process-scale/pinned-20260430T1113Z/`; raw Linux artifacts are
under `target/linux-smp-process-scale/pinned-20260430T1118Z/`.

### SMT Run

The same harness can run an eight-logical-CPU case on the benchmark VM. That
machine exposes four physical cores and eight SMT threads, so the `smp8-smt`
row is an SMT measurement on a 4-core host.

The SMT run was recorded at commit `7c15dd47`
(`2026-04-30 11:45 UTC`) with QEMU pinned to logical CPUs
`0,1,2,3,4,5,6,7`.

<!-- capos-benchmark-results:smp8-smt-medians start -->
| System | smp1 median | smp2 median | smp4 median | smp8-smt median |
|---|---:|---:|---:|---:|
| capOS | 1,500 scaled cycles | 787 scaled cycles | 1,052 scaled cycles | 1,595 scaled cycles |
| Linux | 1,274,507,854 ns | 647,611,418 ns | 337,479,795 ns | 198,903,231 ns |
<!-- capos-benchmark-results:smp8-smt-medians end -->

<!-- capos-benchmark-results:smp8-smt-speedups start -->
| System | 1-to-2 speedup | 1-to-4 speedup | 1-to-8 speedup |
|---|---:|---:|---:|
| capOS | 1.906x | 1.426x | 0.940x |
| Linux | 1.968x | 3.777x | 6.408x |
<!-- capos-benchmark-results:smp8-smt-speedups end -->

Raw capOS SMT artifacts are under `target/smp-process-scale/smt8-20260430T1148Z/`.
Raw Linux SMT artifacts are under
`target/linux-smp-process-scale/smt8-20260430T1151Z/`.

## In-Process Thread Scaling

`make run-thread-scale` runs sibling threads inside one process. Child threads
use per-thread rings. The workload computes fixed-size checksum blocks; the
default shape is a blocking parent join, 262,144 blocks (16 MiB), and
`work_rounds=64`.

The harness records both a work-window time and a total time. The work window
brackets the checksum computation. Total time includes thread startup,
synchronization, shutdown, and join overhead. For scheduler analysis, both
numbers matter: work speedup shows CPU placement and dispatch during the
syscall-free section, while total speedup shows the cost of the surrounding
thread lifecycle.

The old 1 MiB workload with a spinning parent is historical only because the
matching Linux pthread baseline also stayed flat at four workers. The
current rows use the repaired 16 MiB blocking-parent shape unless noted.

Recorded evidence:

<!-- capos-benchmark-results:thread-scale-current start -->
| System / mode | Placement | Runs | 1-to-2 work | 1-to-2 total | 1-to-4 work | 1-to-4 total | Notes |
|---|---|---:|---:|---:|---:|---:|---|
| Linux pthread baseline (benchmark VM, 2026-05-10 19:46 UTC) | physical-core logical CPUs `0,1,2,3` | 5 | 1.996x | 1.995x | 3.974x | 3.850x | Same checksum workload and pin set as the 2026-05-10 capOS row. |
| capOS (Phase D WFQ, benchmark VM, 2026-05-10 19:32 UTC) | physical-core logical CPUs `0,1,2,3` | 5 | 1.809x | 1.774x | 3.088x | 2.700x | Per-thread weights/latency classes, per-CPU WFQ queues, bounded steal path. |
| Linux pthread baseline (benchmark VM, 2026-05-02 21:34 UTC) | physical-core logical CPUs `0,1,2,3` | 5 | 1.988x | 1.987x | 3.963x | 3.858x | Same repaired workload before Phase D. |
| capOS (single global queue, benchmark VM, 2026-05-02 21:35 UTC) | physical-core logical CPUs `0,1,2,3` | 5 | 1.883x | 1.787x | 1.566x | 1.538x | Shows the four-worker cost of the single global runnable queue. |
| Linux pthread baseline (2026-05-01 report) | physical-core logical CPUs | 5 | 1.991x | 1.990x | 3.958x | 3.834x | Repaired-shape baseline recorded in `docs/changelog.md`; target artifact directory is not named in the source record. |
| capOS (pre-collapse placement, 2026-05-01 report) | physical-core logical CPUs | 5 | 1.828x | 1.687x | 3.029x | 2.386x | Commit `136b72de`; per-CPU placement model later replaced by the queue-collapse cleanup; target artifact directory is not named in the source record. |
| capOS, switch logs suppressed (pre-collapse, 2026-05-01 report) | physical-core logical CPUs | 5 | 1.913x | 1.636x | 3.272x | 2.303x | Same commit and model with scheduler switch logs suppressed; target artifact directory is not named in the source record. |
| capOS (post-collapse, single global queue, 2026-05-02 10:42 UTC) | physical-core logical CPUs `0,1,2,3` on the benchmark VM | 3 | 1.890x | 1.792x | 1.504x | 1.436x | Queue-collapse row recorded in `docs/backlog/scheduler-evolution.md`; target artifact directory is not named in the source record. |
<!-- capos-benchmark-results:thread-scale-current end -->

The 2026-05-10 Phase D WFQ row uses the same repaired shape as the 2026-05-02
pair: blocking parent join, 262,144 blocks, `work_rounds=64`, five runs,
KVM-backed QEMU pinned to physical-core logical CPUs `0,1,2,3`, and a matching
Linux pthread baseline on the same pin set. Raw capOS artifacts are under
`target/thread-scale/20260510T193200Z/`; raw Linux artifacts are under
`target/linux-thread-scale/20260510T194600Z/`.

The 2026-05-02 capOS/Linux pair used `main` commit `374f8556`; raw capOS
artifacts are under `target/thread-scale/20260502T213544Z/`, and raw Linux
artifacts are under `target/linux-thread-scale/20260502T213445Z/`.

The row improved the four-worker work window from `1.566x` to `3.088x` and
the four-worker total window from `1.538x` to `2.700x` compared with the
single-global-queue row. Linux on the same host and pin set recorded
`3.974x` work and `3.850x` total at four workers. The remaining difference is
the scheduler/runtime optimization target for later work.

Guest-side attribution is available with
`CAPOS_THREAD_SCALE_GUEST_MEASURE=1`. It emits aggregate and per-phase
measurements for `spawn_ready`, `work`, `shutdown`, and `final_total`,
including scheduler choice, lock, timer, TLB, serial, shared-kernel-lock,
network-poll, thread-placement, and sampled user-PC buckets. Host-side QEMU
profiling is available with `CAPOS_THREAD_SCALE_PROFILE=1`.

## Interpreting CPU Counts

CPU-count rows are meaningful only with a recorded topology:

- Physical-core rows require enough physical cores for the vCPU count.
- SMT rows should say they are SMT rows and list the logical CPU set.
- Pinning QEMU with `taskset` is useful, but it is not CPU isolation by itself.
  Stronger runs should record `isolcpus`/`nohz_full`/`rcu_nocbs`, cpuset, or
  systemd affinity policy when used.
- Pinning QEMU to fewer host logical CPUs than guest vCPUs measures
  oversubscription behavior, not core scaling.
- Current QEMU/KVM results should stay separate from future direct cloud or
  bare-metal runs.

The current capOS benchmark table reaches four physical-core rows and an
eight-logical-CPU SMT row on a 4-core/8-thread VM. It does not yet measure
16-core or 32-core systems.

## Next CPU-Scaling Work

The next CPU-scaling milestone should be designed around direct hardware or a
dedicated perf runner rather than nested QEMU as the primary evidence source.
The benchmark suite needs:

- hardware discovery records for socket/core/SMT topology, APIC mode, timer
  source, frequency policy, memory size, and firmware/device model;
- workload rows at 1, 2, 4, 8, 16, and 32 workers where the machine has enough
  physical cores, plus separately labeled SMT rows;
- at least one static map/reduce checksum workload, one uneven dynamic-task
  workload, one barrier-heavy phase loop, and one IPC/service-bound workload;
- work-window and total-time reporting for every workload;
- matching Linux native baselines on the same hardware where a comparable
  workload exists;
- scheduler/runtime counters for queue depth, migrations, steals, reschedule
  IPIs, TLB shootdowns, timer ticks, lock wait/hold time, blocked time, and
  runnable but not running time;
- raw artifacts with source commit, toolchain, kernel config, host topology,
  run count, warmup policy, and verifier output.

QEMU should remain useful for boot and regression coverage, but it should not
be the primary source for a 16/32-core SMP scalability milestone.

## Commands

Run the capOS process-scale workload:

```bash
make run-smp-process-scale
```

Run the process-scale workload with QEMU pinned to selected host CPUs:

```bash
CAPOS_SMP_SCALE_QEMU_TASKSET_CPUS=0,1 make run-smp-process-scale
```

Run the process-scale SMT row on a host with at least eight logical CPUs:

```bash
CAPOS_SMP_SCALE_INCLUDE_SMT=1 \
  CAPOS_SMP_SCALE_QEMU_TASKSET_CPUS=0,1,2,3,4,5,6,7 \
  make run-smp-process-scale
```

Run the thread-scale workload:

```bash
CAPOS_THREAD_SCALE_RUNS=5 \
  CAPOS_THREAD_SCALE_QEMU_TASKSET_CPUS=0,1,2,3 \
  make run-thread-scale
```

Run the larger-workload Amdahl row:

```bash
CAPOS_THREAD_SCALE_RUNS=5 \
  CAPOS_THREAD_SCALE_TOTAL_BLOCKS=1048576 \
  CAPOS_THREAD_SCALE_QEMU_TASKSET_CPUS=0,1,2,3 \
  make run-thread-scale
```

Run a one-sample host-side QEMU profiling pass:

```bash
CAPOS_THREAD_SCALE_PROFILE=1 \
  CAPOS_THREAD_SCALE_RUNS=1 \
  CAPOS_THREAD_SCALE_QEMU_TASKSET_CPUS=0,1,2,3 \
  make run-thread-scale
```

Run a one-sample guest-side measurement pass:

```bash
CAPOS_THREAD_SCALE_GUEST_MEASURE=1 \
  CAPOS_THREAD_SCALE_RUNS=1 \
  CAPOS_THREAD_SCALE_QEMU_TASKSET_CPUS=0,1,2,3 \
  make run-thread-scale
```

Run only the host summary parser against an existing `results.csv` without
booting QEMU:

```bash
CAPOS_THREAD_SCALE_SUMMARY_ONLY=1 \
  CAPOS_THREAD_SCALE_SUMMARY_CSV=<results.csv> \
  CAPOS_THREAD_SCALE_SUMMARY_KVM_EVIDENCE=1 \
  CAPOS_THREAD_SCALE_QEMU_TASKSET_CPUS=0,1,2,3 \
  CAPOS_THREAD_SCALE_TOTAL_BLOCKS=262144 \
  CAPOS_THREAD_SCALE_PARENT_WAIT=join \
  CAPOS_THREAD_SCALE_WORK_ROUNDS=64 \
  tools/qemu-thread-scale-harness.sh
```

Run the native Linux pthread baseline for the thread-scale checksum workload:

```bash
LINUX_THREAD_SCALE_TASKSET_CPUS=0,1,2,3 \
  make run-linux-thread-scale-baseline
```

Run the Linux process-scale comparison:

```bash
LINUX_SMP_SCALE_KERNEL=target/linux-smp-process-scale/kernel/vmlinuz \
  tools/linux-smp-process-scale-baseline.sh
```

On hosts where `/boot/vmlinuz` is not readable by the current user, copy a
kernel image into ignored `target/` storage first through the host's normal
administrative path, then pass it as `LINUX_SMP_SCALE_KERNEL`. The script does
not invoke `sudo` itself.