ADR-0002 — Performance Baseline and Benchmark Harness
| Number | 0002 |
| Title | Performance baseline and benchmark harness for CopyPasteCollator |
| Status | Accepted |
| Author | @NoeFontana |
| Created | 2026-04-20 |
| Updated | 2026-04-20 |
| Tag | ADR-0002 |
Every Phase 1 (P1) change that touches CopyPasteCollator, the copy-paste
augmentation pipeline, or the placement logic should reference ADR-0002
when it ships a performance claim. Superseding a decision means writing a
new ADR that supersedes this one, not quietly drifting.
Context
Phase 1's exit criterion requires a ≥2× throughput improvement over the
v0.9.x CopyPasteCollator baseline. ADR-0001 explicitly scoped the
performance budget as out of scope — this ADR closes that gap so the 2×
target is a measurement, not an argument.
The repository prior to this ADR has no benchmark harness, no committed baseline, and no regression gate. Any perf claim in a P1 PR would be unverifiable without this infrastructure.
This ADR pins interface-level decisions only (workload, measurement contract, acceptance and gating policy, refresh policy). The GPU lane, durable time-series storage, and per-operation microbenchmarks are deliberately deferred.
Part (i) — Measurement contract
Canonical workload
The CopyPasteCollator baseline is measured against a single synthetic
workload defined by benchmarks/_fixture.py::build_batch:
- Batch size:
8. - Image size:
512 × 512(see note below). kpasted instances per image: \(k \sim \mathcal{U}\{1, \dots, 5\}\) (inclusive endpoints, 5 possible values).- Image dtype:
uint8, random values in[0, 256). - Bounding boxes: xyxy
float32, width and height each drawn from \(\mathcal{U}[64, 256]\), position clamped so boxes stay inside the image. - Labels:
int64, drawn from \(\mathcal{U}[1, 80]\) (COCO-class cardinality). - Masks:
[k, H, W]bool, the interior of each bounding box (guaranteesarea >= 1for the canonical config).
Reproducibility: each sample is seeded from seed * 8191 + i, and the
harness pre-builds 64 distinct batches and cycles through them by index
so the 2000 measurement iterations are bit-identical across runs with the
same seed.
Image size deviation from the original brief
The initial P0.C brief called for 1024² with 500 warmup + 2000
measured iterations. On ubuntu-latest (GitHub's 2-core default
runner) a single collator call at 1024² measures ~500 ms locally,
which projects to ~22 minutes for a full pass at the original counts
— too close to the 25-minute job timeout and nearly 40 minutes on a
colder 2-core runner. Two deviations from the brief resolve this:
512²replaces1024²as the canonical image size, bringing per-iter cost to ~100 ms.512²is also a realistic training resolution (Mask R-CNN baseline, LVIS short side).- Warmup is 100 iterations; measurement is 800 iterations (down from 500 / 2000). 800 samples is ample for a stable median and IQR on wall-clock data, and the 100-iter warmup is sufficient on CPU where there is no cuDNN autotune to settle.
Combined, a full pass is ~1.5 minutes instead of ~22. The Part (ii)
acceptance rule (three runs with IQR/median < 3%) is applied under
these reduced counts.
512² is also a realistic training resolution (Mask R-CNN baseline,
LVIS short side). A future runner upgrade or a GPU lane (see Part iv)
may motivate a new ADR that moves the canonical size back to 1024².
Timer usage
torch.utils.benchmark.Timer is the sole measurement primitive:
num_threads=1pins the PyTorch thread pool for the measurement window so runner-to-runner core-count differences do not leak into the metric.- Warmup: 100 calls to
stmt(), discarded. On CUDA, atorch.cuda.synchronize()is emitted once after warmup. - Measurement: 800 calls via
timer.timeit(1).median, collected as raw per-iter nanoseconds. blocked_autorangeis deliberately not used: it batches inner iterations and collapses them into a singleMeasurement, which prevents IQR computation.
JSON schema (version 1)
Every bench run emits a JSON report with the following shape. The
schema_version field is bumped when any field is removed or changes
type; additive changes do not bump it.
{
"schema_version": 1,
"label": "CopyPasteCollator",
"device": "cpu",
"cpu_model": "string or null",
"batch_size": 8,
"image_size": 512,
"k_range": [1, 5],
"warmup": 100,
"iters": 800,
"median_ns": 0,
"q1_ns": 0,
"q3_ns": 0,
"iqr_ns": 0,
"iqr_over_median": 0.0,
"python": "3.12.x",
"torch": "2.8.x",
"torchvision": "0.23.x",
"segpaste_version": "0.9.x",
"commit_sha": "string or null",
"runner": "string",
"timestamp": "ISO8601"
}
Part (ii) — Acceptance and gating
Runner pinning
The committed baseline.json and the bench-cpu-pr gate both run on
ubuntu-latest. A baseline captured on any other runner (dev box,
self-hosted, ubuntu-latest-large) is not comparable and must not be
committed. When GitHub changes the default ubuntu-latest image in a
way that shifts the median, the baseline is refreshed per Part (iii).
Baseline acceptance
A new benchmarks/baseline.json is accepted when three sequential
local runs with identical arguments satisfy
where \(m_i\) is the median_ns of the \(i\)-th run. One of the three is
committed verbatim; the other two are retained as evidence in the PR
description.
Why run-to-run RSD of medians and not within-run IQR/median. The
collator has substantial data-dependent per-call variance
(torch.equal early termination, variable object counts after
placement, mask-subtraction cost scaling with pasted area). Within a
single run, iqr_over_median is typically 15–25% — that is a property
of the workload, not of the measurement. The median itself, however,
is a stable run-to-run statistic at this iteration count. The
regression gate in Part (ii) compares medians, so the acceptance rule
measures the right thing: run-to-run stability of the quantity the
gate uses. iqr_over_median is retained in the JSON as a diagnostic
for noticing structural workload shifts that change distribution
shape without moving the median.
Regression gate
A PR fails the bench-cpu-pr job when:
i.e. a median regression strictly greater than 5%. The baseline is the
file on the PR's base branch (main), not the head. batch_size,
image_size, and device must match between baseline and current —
mismatches exit with code 2 and must be resolved by refreshing the
baseline.
Label escape hatches
Two PR labels alter the gate's behavior:
skip-perf-gate— the comparison step is skipped entirely; the harness still runs and the artifact is still uploaded. Reserved for (a) the P0.C PR itself (which creates the baseline), (b) PRs that refresh the baseline (see Part iii), and (c) PRs that intentionally land perf-affecting refactors whose baseline will be refreshed in a follow-up. Use requires a justification in the PR description.retry-on-perf-flake— ongithub.run_attempt > 1(i.e. a re-run), the comparison step is skipped. First-attempt runs still enforce the gate. This is a one-shot escape hatch for genuine runner noise; repeated use indicates the gate is too tight and should be widened via a new ADR (additive change).
Part (iii) — Baseline refresh policy
The baseline is refreshed only via an opt-in PR that:
- Is labeled
skip-perf-gate. - References
ADR-0002in the commit message or PR description. - Includes, in the PR description, the three local runs'
median_nsandiqr_over_medianvalues from Part (ii)'s acceptance criterion. - Touches
benchmarks/baseline.jsonand only that file in its commit (i.e. refresh is not bundled with feature work).
A refresh that changes batch_size, image_size, k_range, warmup,
or iters is a breaking change to the contract and requires a new
ADR that supersedes this one. Refreshes that only reflect runner-image
updates, torch-version bumps, or unchanged-workload re-captures are
additive and land under this ADR.
Part (iv) — Out of scope
- Durable time-series storage. Nightly reports land as GitHub Actions artifacts with 14-day retention. A follow-up ADR pins the long-term storage mechanism (orphan branch, external TSDB, etc.) once Phase 1 demonstrates the metric is load-bearing.
- Per-operation microbenchmarks. The baseline is end-to-end on
CopyPasteCollator.__call__. Probing_try_place_single_object,_blend_object_on_target, etc. is an internal-profile activity and does not belong in the regression gate. - Memory / allocator metrics. Only wall-clock median and IQR are pinned. Peak-RSS and allocator-stats are useful diagnostics but would broaden the contract beyond what P1's 2× throughput criterion needs.
The GPU/CUDA lane, originally listed here as deferred to P0.D, is discharged by Part (v) below.
Part (v) — GPU throughput lane
Added by ADR-0008. The CPU entry point
(CopyPasteCollator) is deleted in the same workstream and replaced by
BatchCopyPaste(nn.Module), which becomes the subject of the GPU
measurement. The CPU baseline.json (Part (i)) remains the anchor for
the ≥ 2× Phase 1 exit criterion.
Canonical GPU workload
Same shape as Part (i), lifted to CUDA:
- Batch size:
8. - Image size:
1024 × 1024. (The CPU baseline lowered to512²to fitubuntu-latest's 25-minute timeout. The A100 SXM runner is not constrained the same way;1024²matches the original P0.C brief and is a realistic training resolution for panoptic / Cityscapes-class workloads.) kpasted instances per image:k ∼ U{1, …, 5}(unchanged).- Source: intra-batch, via
torch.multinomialwith the diagonal masked (ADR-0008 §3). A futureInstanceBank(ADR-0009) will expand the source pool without changing the workload shape under this ADR.
Timer usage
Same as Part (i), with the CUDA-specific addition:
torch.cuda.synchronize() is emitted once after the warmup window and
once after the measurement window before collecting per-iter nanoseconds.
Warmup at 200 calls (the CPU baseline uses 100) to let cuDNN autotune
settle; measurement at 1000 calls (the CPU baseline uses 800) to
absorb the larger per-iter variance from kernel-launch jitter.
JSON schema delta
The bench report adds two fields to the Part (i) schema without
bumping schema_version:
compile:bool— whethertorch.compile(mode='reduce-overhead', dynamic=True, fullgraph=True)was applied.max_memory_allocated_bytes:int— peak device memory reported bytorch.cuda.max_memory_allocated()during the measurement window.
CI lane and gating
- Workflow:
.github/workflows/bench-gpu.yml, triggered byworkflow_dispatchonly at M4. Runsbenchmarks/bench_batch_copy_paste.py --device cuda --compile --batch-size 8 --image-size 1024. The A100 SXM runner is self-hosted and manually provisioned per run. - Committed artifact:
benchmarks/baseline_gpu.json, initially an empty placeholder, populated on the first successful dispatch. - Gate.
benchmarks/compare.py --mode speedup-vs-cpu-baselinecomparesbaseline_gpu.json.median_nsagainst the CPUbaseline.json.median_nsand exits non-zero when the speedup is below2.0. The gate runs only on manual dispatch; PR-level GPU gating is deferred until a persistent self-hosted runner lands. - Memory cap. The bench asserts
max_memory_allocated_bytes < 40 × 2^30(40 GB) on the Cityscapes panopticB=8, 2048²fixture; failure exits the bench with code 3.
Compile-clean gate
Every PR runs scripts/compile_explain.py, which invokes
torch._dynamo.explain on a CPU trace of BatchCopyPaste.forward and
diffs the graph-break-reason list against scripts/compile_allowlist.txt.
The allow-list is empty at M4 and additions require an ADR-0008
amendment. This runs on ubuntu-latest; no GPU required.
Baseline refresh
Part (iii)'s refresh policy applies to baseline_gpu.json with two
substitutions: the runner pin is the self-hosted A100 SXM runner (not
ubuntu-latest), and the three-run acceptance criterion uses the
iters / warmup pair specified above.
Verification
uv run mkdocs build --strictpasses with this ADR in the nav.uv run python -m benchmarks.bench_copy_paste --device cpu --out /tmp/r.jsonruns end-to-end locally.uv run python -m benchmarks.compare --baseline benchmarks/baseline.json --current /tmp/r.jsonexits 0 on an unchanged workload.uv run pyrightanduv run ruff check .pass withbenchmarks/included in their scopes.
Status and supersession
- Accepted when this file lands on
mainwithStatus: Acceptedin the header andbenchmarks/baseline.jsonis committed alongside it. - Any later decision that contradicts Part (i)–(iii) requires a new
ADR (
ADR-000N) that explicitly supersedes this one, updating this file'sStatustoSuperseded by ADR-000N.