Hub Regression Performance Report (2026-04-23)
Accompanies PR #198 (hub dataset rename + robustness suite). Captures
accuracy, pose metrology, and latency for every hub-backed regression test
after migration to the renamed locus_v1_* subsets and inclusion of four new
single-tag robustness variants (tag16h5, high_iso, low_key,
raw_pipeline).
Environment
| Component | Value |
|---|---|
| CPU | AMD EPYC-Milan (4 cores / 8 threads) |
| L3 cache | 32 MiB |
| RAM | 32 GiB |
| Arch | x86_64 |
| Build profile | --release --all-features --features bench-internals |
| Threads | cargo test -- --test-threads=1 (sequential) |
Latencies are mean_total_ms per frame as reported by the harness stdout
(the same value that is redacted as [DURATION] in the YAML snapshot for
stability). Accuracy figures come directly from the accepted .snap files.
1. Accuracy Baseline — tag36h11 across resolutions (high_accuracy profile)
| Resolution | Images | Recall | Precision | RMSE (px) | Repro RMSE (px) | Trans P50 (m) | Rot P50 (°) | Latency (ms) |
|---|---|---|---|---|---|---|---|---|
| 640×480 | 50 | 86.00% | 100.00% | 0.2449 | 0.1993 | 0.0005 | 0.119 | 2.02 |
| 1280×720 | 50 | 90.00% | 100.00% | 0.1873 | 0.1696 | 0.0004 | 0.064 | 4.86 |
| 1920×1080 | 50 | 94.00% | 100.00% | 0.2029 | 0.1943 | 0.0004 | 0.054 | 11.41 |
| 3840×2160 | 50 | 94.00% | 100.00% | 0.1676 | 0.1506 | 0.0008 | 0.049 | 43.50 |
high_accuracy is used as the baseline because its 4-6× tighter pose
bounds catch pose-solver regressions that standard's wider thresholds
mask. The 6-14 points of recall ceded (vs standard's 100%) is the cost
of EdLines + no-sharpen; small-tag recall signal is instead picked up
by the robustness suite (§ 5), which stays on standard.
Latency scales roughly linearly with pixel count (11.4 ms × 4 → 43.5 ms
at 4K) as expected for a preprocessing-dominated pipeline. high_accuracy
is also ~2× faster than standard because EdLines is cheaper than
ContourRdp.
2. Pose-Mode Variant — tag36h11 1080p, standard profile
| Mode | Recall | RMSE (px) | Repro RMSE (px) | Trans P50 (m) | Rot P50 (°) | Latency (ms) |
|---|---|---|---|---|---|---|
| Fast | 100.00% | 1.3403 | 4.8908 | 0.0073 | 2.002 | 18.70 |
The Fast-mode test stays on standard so that it contrasts cleanly against
the standard-profile Accurate metrics derivable from § 4 and the
robustness baseline. Fast mode preserves recall and pixel RMSE but its
pose accuracy is substantially worse than standard-Accurate (5×
reprojection error, 6× rotation error) at only ~13% latency saving —
consistent with the LM solver not dominating a 50-tag 1080p frame.
3. Refinement Variant — tag36h11 1080p, GWLF on standard
| Refinement | Recall | RMSE (px) | Repro RMSE (px) | Trans P50 (m) | Rot P50 (°) | Latency (ms) |
|---|---|---|---|---|---|---|
| GWLF | 100.00% | 0.9891 | 0.8502 | 0.0032 | 0.067 | 15.08 |
GWLF-on-standard is retained as a refinement variant since it's the only
standard-profile configuration with tight rotation P50 — useful for
catching regressions in the covariance-consuming weighted LM path without
paying the high_accuracy recall tax. Counter-intuitively it's faster
than standard default (15.1 ms vs 21.4 ms) because the weighted solver
converges in fewer iterations when fed GWLF priors.
The high_accuracy-profile 1080p test was removed from this module — it
is now the baseline at § 1.
4. Quad-Extraction Variants
tag36h11 720p:
| Config | Recall | RMSE (px) | Repro RMSE (px) | Latency (ms) |
|---|---|---|---|---|
| EdLines + no refinement | 90.00% | 0.1714 | 0.1599 | 7.46 |
| EdLines + GWLF | 90.00% | 0.5794 | 0.5804 | 9.21 |
tag36h11 1080p:
| Config | Recall | RMSE (px) | Repro RMSE (px) | Latency (ms) |
|---|---|---|---|---|
| Default + moments culling | 100.00% | 1.3403 | 0.9294 | 18.20 |
| EdLines | 96.00% | 0.5861 | 0.5809 | 24.14 |
| EdLines + moments culling | 96.00% | 0.5861 | 0.5809 | 23.76 |
Moments culling shaves ~3 ms off the default 1080p run (18.2 vs 21.4 ms) without changing accuracy. Paired with EdLines it has no material effect — EdLines' own filtering already rejects the same candidates.
5. Robustness Suite — NEW
Single-tag subsets stressing the detector under conditions the golden
tag36h11 rendering does not exercise. All at 1920×1080, Accurate mode,
standard profile — the robustness suite stays on standard precisely
because it depends on the recall path that high_accuracy trades away.
| Subset | Family | Images | Recall | Precision | RMSE (px) | Trans P50 (m) | Rot P50 (°) | Latency (ms) |
|---|---|---|---|---|---|---|---|---|
tag16h5 |
AprilTag16h5 | 100 | 100.00% | 27.64% | 1.0799 | 0.0057 | 0.463 | 16.96 |
high_iso |
AprilTag36h11 | 50 | 100.00% | 100.00% | 1.3361 | 0.0035 | 0.309 | 21.95 |
low_key |
AprilTag36h11 | 50 | 10.00% | 100.00% | 0.2578 | 0.0029 | 0.459 | 15.24 |
raw_pipeline |
AprilTag36h11 | 50 | 58.00% | 100.00% | 0.7594 | 0.0013 | 0.389 | 17.40 |
Observations (baselining new behaviour — each row is a forward-looking watchlist, not a regression):
tag16h5— precision cliff. 100% recall but only 27.6% precision. The 16h5 family has a dense codebook (30 codes in 16 bits) and rampant false-positive decodes are the dominant failure mode. This is a known property of the family rather than a detector bug; any future detector change that moves 16h5 precision should be treated as a regression if it drops, and a win if it climbs.low_key— 10% recall. Intentional: the dataset simulates extreme low-dynamic-range captures. Current adaptive thresholding leaves most tags buried in the black level. Surfacing this at 10% gives us a clear KPI for future contrast-robust threshold work.raw_pipeline— 58% recall. Raw-like pipeline variant exposes the detector to less-processed sensor output. Useful middle-ground target.high_iso— no regression. The detector's demosaic/denoise chain appears robust to synthetic high-noise captures (100% recall matches cleantag36h11_1920x1080).
5.1 Configuration-only tuned variants (follow-up)
Each subset above was re-run with a bespoke JSON profile (embedded at
crates/locus-core/tests/fixtures/robustness/<subset>_tuned.json) to
quantify how far the existing DetectorConfig knob set can move each KPI
without algorithmic changes. The baselines are retained as the KPI
watchlist; these rows are additive.
| Subset | Baseline | Tuned | Δ | Knobs that moved the number |
|---|---|---|---|---|
tag16h5 |
R 100% / P 27.64% / 16.96 ms | R 100% / P 68.85% / 16.78 ms | Precision ×2.5 at zero recall cost | decoder.max_hamming_error: 2→1, decoder.min_contrast: 20→30, quad.min_edge_score: 4→8, quad.min_fill_ratio: 0.10→0.18 |
low_key |
R 10.00% / P 100% / 15.24 ms | R 22.00% / P 100% / 28.08 ms | Recall ×2.2, latency ×1.8 | threshold.tile_size: 8→2, threshold.min_range: 10→0, threshold.constant: 0→-20 |
raw_pipeline |
R 58.00% / P 100% / 17.40 ms | R 60.00% / P 100% / 19.05 ms | Recall +2 pts (marginal) | quad.min_area: 16→8, quad.min_edge_length: 4→2, quad.min_edge_score: 4→1 |
Findings:
tag16h5is the clear configuration win. Halving the allowed Hamming distance is the dominant lever; raising the decoder's contrast gate and the upstream quad edge-score/fill-ratio gates trim the remaining false-positive pool without sacrificing a single true positive. A+41.2precision-point lift at zero recall cost is the ceiling we could find; pushing further (min_edge_score: 10,min_fill_ratio: 0.22) recovers ~3 more precision points at the cost of 1-2 recall points — not a trade we took. The remaining ~31% of false positives are dense structural ambiguities in the 16h5 codebook itself and are no longer configuration-bound.low_keyhas a configuration ceiling at 22% recall. A finer adaptive-threshold grid (tile_size=2) with zeromin_rangeand a strongly negativeconstantunlocks the tags buried in the low-DR black level. A full sweep (tile_size ∈ {2,4,8,16},min_range ∈ {0,2,5,10},constant ∈ {0,-3,-5,-10,-15,-20,-30,-50}, plus decoder and quad-gate relaxation, plus structure-tensor radius and adaptive window) converges to the same 22% ceiling. Confirms the report's framing: this subset is the KPI for future contrast-robust threshold work.raw_pipelineis largely config-bound at its baseline. Threshold and decoder relaxation do not move the needle; only quad-gate relaxation adds a marginal +2 recall points.EdLinesextraction gives the same recall at materially better RMSE (0.33 vs 0.76 px) but we did not swap the default in this batch because the suite's baseline isContourRdpand switching extractors is a separate regression axis. The remaining 40% are likely noise-fragmented contours that would require a pre-filtering pass to recover.
6. Distortion — AprilGrid
All 1920×1080, 50 images per subset, AprilTag36h11, Accurate pose mode. Pose-mode coverage lives in § 2; this suite isolates the undistortion path.
| Model | Recall | Precision | RMSE (px) | Repro RMSE (px) | Trans P50 (m) | Rot P50 (°) | Latency (ms) |
|---|---|---|---|---|---|---|---|
| Brown–Conrady | 92.93% | 99.43% | 1.0947 | 1.1336 | 0.0634 | 2.151 | 26.52 |
| Kannala–Brandt | 81.30% | 98.97% | 1.5204 | 3.0482 | 0.0926 | 7.313 | 24.76 |
Kannala–Brandt (fisheye) recall sits ~11 points below Brown–Conrady at equivalent pixel RMSE, confirming that the wider field of view pushes a larger fraction of tags into the distortion-heavy periphery where the detector's undistorted assumptions cost yield.
7. Board-Level (reference — dataset unchanged)
Included for completeness. These snapshots pre-date PR #198 and were not regenerated; the board datasets themselves were not re-rendered.
| Board | Frames | Frames with board | Mean Trans Err (m) | Mean Rot Err (°) | Mean Coverage |
|---|---|---|---|---|---|
charuco_golden_v1_1920x1080 |
150 | 149 | 0.0081 | 1.929 | 99.0% |
aprilgrid_golden_v1_1920x1080 |
145 | 145 | 0.0108 | 2.035 | 95.0% |
Reproduction
TRACY_NO_INVARIANT_CHECK=1 \
LOCUS_HUB_DATASET_DIR=tests/data/hub_cache \
cargo test --release \
--test regression_render_tag \
--test regression_render_tag_robustness \
--test regression_distortion_hub \
--test regression_board_hub \
--features bench-internals \
-- --test-threads=1 --nocapture
The regression_render_tag_robustness suite includes both the baseline
watchlist runs (regression_hub_<subset>_1080p) and the configuration-
tuned follow-up runs (regression_hub_<subset>_1080p_tuned). The tuned
variants load their JSON profiles from
crates/locus-core/tests/fixtures/robustness/*_tuned.json.