Skip to content

Hub Regression Performance Report (2026-04-23)

Accompanies PR #198 (hub dataset rename + robustness suite). Captures accuracy, pose metrology, and latency for every hub-backed regression test after migration to the renamed locus_v1_* subsets and inclusion of four new single-tag robustness variants (tag16h5, high_iso, low_key, raw_pipeline).

Environment

Component Value
CPU AMD EPYC-Milan (4 cores / 8 threads)
L3 cache 32 MiB
RAM 32 GiB
Arch x86_64
Build profile --release --all-features --features bench-internals
Threads cargo test -- --test-threads=1 (sequential)

Latencies are mean_total_ms per frame as reported by the harness stdout (the same value that is redacted as [DURATION] in the YAML snapshot for stability). Accuracy figures come directly from the accepted .snap files.

1. Accuracy Baseline — tag36h11 across resolutions (high_accuracy profile)

Resolution Images Recall Precision RMSE (px) Repro RMSE (px) Trans P50 (m) Rot P50 (°) Latency (ms)
640×480 50 86.00% 100.00% 0.2449 0.1993 0.0005 0.119 2.02
1280×720 50 90.00% 100.00% 0.1873 0.1696 0.0004 0.064 4.86
1920×1080 50 94.00% 100.00% 0.2029 0.1943 0.0004 0.054 11.41
3840×2160 50 94.00% 100.00% 0.1676 0.1506 0.0008 0.049 43.50

high_accuracy is used as the baseline because its 4-6× tighter pose bounds catch pose-solver regressions that standard's wider thresholds mask. The 6-14 points of recall ceded (vs standard's 100%) is the cost of EdLines + no-sharpen; small-tag recall signal is instead picked up by the robustness suite (§ 5), which stays on standard.

Latency scales roughly linearly with pixel count (11.4 ms × 4 → 43.5 ms at 4K) as expected for a preprocessing-dominated pipeline. high_accuracy is also ~2× faster than standard because EdLines is cheaper than ContourRdp.

2. Pose-Mode Variant — tag36h11 1080p, standard profile

Mode Recall RMSE (px) Repro RMSE (px) Trans P50 (m) Rot P50 (°) Latency (ms)
Fast 100.00% 1.3403 4.8908 0.0073 2.002 18.70

The Fast-mode test stays on standard so that it contrasts cleanly against the standard-profile Accurate metrics derivable from § 4 and the robustness baseline. Fast mode preserves recall and pixel RMSE but its pose accuracy is substantially worse than standard-Accurate (5× reprojection error, 6× rotation error) at only ~13% latency saving — consistent with the LM solver not dominating a 50-tag 1080p frame.

3. Refinement Variant — tag36h11 1080p, GWLF on standard

Refinement Recall RMSE (px) Repro RMSE (px) Trans P50 (m) Rot P50 (°) Latency (ms)
GWLF 100.00% 0.9891 0.8502 0.0032 0.067 15.08

GWLF-on-standard is retained as a refinement variant since it's the only standard-profile configuration with tight rotation P50 — useful for catching regressions in the covariance-consuming weighted LM path without paying the high_accuracy recall tax. Counter-intuitively it's faster than standard default (15.1 ms vs 21.4 ms) because the weighted solver converges in fewer iterations when fed GWLF priors.

The high_accuracy-profile 1080p test was removed from this module — it is now the baseline at § 1.

4. Quad-Extraction Variants

tag36h11 720p:

Config Recall RMSE (px) Repro RMSE (px) Latency (ms)
EdLines + no refinement 90.00% 0.1714 0.1599 7.46
EdLines + GWLF 90.00% 0.5794 0.5804 9.21

tag36h11 1080p:

Config Recall RMSE (px) Repro RMSE (px) Latency (ms)
Default + moments culling 100.00% 1.3403 0.9294 18.20
EdLines 96.00% 0.5861 0.5809 24.14
EdLines + moments culling 96.00% 0.5861 0.5809 23.76

Moments culling shaves ~3 ms off the default 1080p run (18.2 vs 21.4 ms) without changing accuracy. Paired with EdLines it has no material effect — EdLines' own filtering already rejects the same candidates.

5. Robustness Suite — NEW

Single-tag subsets stressing the detector under conditions the golden tag36h11 rendering does not exercise. All at 1920×1080, Accurate mode, standard profile — the robustness suite stays on standard precisely because it depends on the recall path that high_accuracy trades away.

Subset Family Images Recall Precision RMSE (px) Trans P50 (m) Rot P50 (°) Latency (ms)
tag16h5 AprilTag16h5 100 100.00% 27.64% 1.0799 0.0057 0.463 16.96
high_iso AprilTag36h11 50 100.00% 100.00% 1.3361 0.0035 0.309 21.95
low_key AprilTag36h11 50 10.00% 100.00% 0.2578 0.0029 0.459 15.24
raw_pipeline AprilTag36h11 50 58.00% 100.00% 0.7594 0.0013 0.389 17.40

Observations (baselining new behaviour — each row is a forward-looking watchlist, not a regression):

  • tag16h5 — precision cliff. 100% recall but only 27.6% precision. The 16h5 family has a dense codebook (30 codes in 16 bits) and rampant false-positive decodes are the dominant failure mode. This is a known property of the family rather than a detector bug; any future detector change that moves 16h5 precision should be treated as a regression if it drops, and a win if it climbs.
  • low_key — 10% recall. Intentional: the dataset simulates extreme low-dynamic-range captures. Current adaptive thresholding leaves most tags buried in the black level. Surfacing this at 10% gives us a clear KPI for future contrast-robust threshold work.
  • raw_pipeline — 58% recall. Raw-like pipeline variant exposes the detector to less-processed sensor output. Useful middle-ground target.
  • high_iso — no regression. The detector's demosaic/denoise chain appears robust to synthetic high-noise captures (100% recall matches clean tag36h11_1920x1080).

5.1 Configuration-only tuned variants (follow-up)

Each subset above was re-run with a bespoke JSON profile (embedded at crates/locus-core/tests/fixtures/robustness/<subset>_tuned.json) to quantify how far the existing DetectorConfig knob set can move each KPI without algorithmic changes. The baselines are retained as the KPI watchlist; these rows are additive.

Subset Baseline Tuned Δ Knobs that moved the number
tag16h5 R 100% / P 27.64% / 16.96 ms R 100% / P 68.85% / 16.78 ms Precision ×2.5 at zero recall cost decoder.max_hamming_error: 2→1, decoder.min_contrast: 20→30, quad.min_edge_score: 4→8, quad.min_fill_ratio: 0.10→0.18
low_key R 10.00% / P 100% / 15.24 ms R 22.00% / P 100% / 28.08 ms Recall ×2.2, latency ×1.8 threshold.tile_size: 8→2, threshold.min_range: 10→0, threshold.constant: 0→-20
raw_pipeline R 58.00% / P 100% / 17.40 ms R 60.00% / P 100% / 19.05 ms Recall +2 pts (marginal) quad.min_area: 16→8, quad.min_edge_length: 4→2, quad.min_edge_score: 4→1

Findings:

  • tag16h5 is the clear configuration win. Halving the allowed Hamming distance is the dominant lever; raising the decoder's contrast gate and the upstream quad edge-score/fill-ratio gates trim the remaining false-positive pool without sacrificing a single true positive. A +41.2 precision-point lift at zero recall cost is the ceiling we could find; pushing further (min_edge_score: 10, min_fill_ratio: 0.22) recovers ~3 more precision points at the cost of 1-2 recall points — not a trade we took. The remaining ~31% of false positives are dense structural ambiguities in the 16h5 codebook itself and are no longer configuration-bound.
  • low_key has a configuration ceiling at 22% recall. A finer adaptive-threshold grid (tile_size=2) with zero min_range and a strongly negative constant unlocks the tags buried in the low-DR black level. A full sweep (tile_size ∈ {2,4,8,16}, min_range ∈ {0,2,5,10}, constant ∈ {0,-3,-5,-10,-15,-20,-30,-50}, plus decoder and quad-gate relaxation, plus structure-tensor radius and adaptive window) converges to the same 22% ceiling. Confirms the report's framing: this subset is the KPI for future contrast-robust threshold work.
  • raw_pipeline is largely config-bound at its baseline. Threshold and decoder relaxation do not move the needle; only quad-gate relaxation adds a marginal +2 recall points. EdLines extraction gives the same recall at materially better RMSE (0.33 vs 0.76 px) but we did not swap the default in this batch because the suite's baseline is ContourRdp and switching extractors is a separate regression axis. The remaining 40% are likely noise-fragmented contours that would require a pre-filtering pass to recover.

6. Distortion — AprilGrid

All 1920×1080, 50 images per subset, AprilTag36h11, Accurate pose mode. Pose-mode coverage lives in § 2; this suite isolates the undistortion path.

Model Recall Precision RMSE (px) Repro RMSE (px) Trans P50 (m) Rot P50 (°) Latency (ms)
Brown–Conrady 92.93% 99.43% 1.0947 1.1336 0.0634 2.151 26.52
Kannala–Brandt 81.30% 98.97% 1.5204 3.0482 0.0926 7.313 24.76

Kannala–Brandt (fisheye) recall sits ~11 points below Brown–Conrady at equivalent pixel RMSE, confirming that the wider field of view pushes a larger fraction of tags into the distortion-heavy periphery where the detector's undistorted assumptions cost yield.

7. Board-Level (reference — dataset unchanged)

Included for completeness. These snapshots pre-date PR #198 and were not regenerated; the board datasets themselves were not re-rendered.

Board Frames Frames with board Mean Trans Err (m) Mean Rot Err (°) Mean Coverage
charuco_golden_v1_1920x1080 150 149 0.0081 1.929 99.0%
aprilgrid_golden_v1_1920x1080 145 145 0.0108 2.035 95.0%

Reproduction

TRACY_NO_INVARIANT_CHECK=1 \
LOCUS_HUB_DATASET_DIR=tests/data/hub_cache \
cargo test --release \
  --test regression_render_tag \
  --test regression_render_tag_robustness \
  --test regression_distortion_hub \
  --test regression_board_hub \
  --features bench-internals \
  -- --test-threads=1 --nocapture

The regression_render_tag_robustness suite includes both the baseline watchlist runs (regression_hub_<subset>_1080p) and the configuration- tuned follow-up runs (regression_hub_<subset>_1080p_tuned). The tuned variants load their JSON profiles from crates/locus-core/tests/fixtures/robustness/*_tuned.json.