Benchmarking Lessons (consolidated, 2026-05-02)
This doc captures the durable learnings from two months of detector
optimization (Mar–Apr 2026). It supersedes
historical_evolution.md, sota_metrology_20260321.md,
edlines_gauss_newton_20260321.md, and three point-in-time progress
reports — what survives below is the part that should still inform
future work even after the surrounding numbers go stale.
Note (2026-05-03): Soft decode (
DecodeMode::Soft) was removed from the codebase entirely. The Soft-vs-Hard analysis below is kept as historical context — the lesson is preserved so the rationale for rejecting LLR-based decoding on real-camera data isn't relearned.Note (2026-05-15):
PoseEstimationMode(theFast/Accuratetoggle) was removed in the same spirit. Fast was latency-neutral onhigh_accuracy, ≤ 13 % faster onstandard, and 5–6× worse pose RMSE on every measured workload. The kwarg existed because a Fast / Accurate toggle felt like API "completeness" — but its only effect at the default value was to make pose worse. The underlying unweighted Huber LM still survives as an internal fallback when callers skip both the image view and external covariances (e.g. pure pose-only refits). Default-value artifacts of this kind get deleted, not preserved as configuration knobs.
For current numbers, see release_performance_20260418.md,
hub_regression_20260423.md, render_tag_sota_20260425.md,
render_tag_2160p_20260425.md, and quad_truncation_fix_20260426.md.
§1 Optimization timeline (Mar 2 → Apr 26)
Mar 02 55.94 ms/frame ── Initial baseline (decoding = 70 % of frame)
Mar 03 14.55 ms/frame ── SoA migration (3.8× total)
Mar 12 accuracy ── GWLF: 7× rotation improvement vs ContourRdp+ERF
Mar 16 0.063 ms/1024 ── DDA-SIMD decode: decode is no longer the bottleneck
Mar 19 ~3× CCL ── SIMD CCL Fusion (RLE + LSL): defeated the memory wall
Mar 21 accuracy ── EdLines + Joint Gauss-Newton (planarity-constrained corners)
Mar 21 presets ── Three SOTA profiles ship: high_accuracy / standard / grid
Apr 18 v0.3.1 ── All shipped profiles default to Hard decode (precision floor)
Apr 23 regressions ── Hub-rename + robustness suite (tag16h5 / low_key / raw_pipeline)
Apr 25 SOTA ── render_tag_hub profile beats AprilTag-C on every t-percentile
Apr 25 2160p fix ── pixel_count_descending_order — MAX_CANDIDATES truncation bug
Apr 26 correction ── Truncate post-filter, not pre-filter (distortion +6.5 pp)
Per-stage speedups that still hold (single-thread, 1080p):
| Stage | Before SIMD | After (Apr 18) |
|---|---|---|
| Thresholding | 4.26 ms | 1.51 ms (2.8×) |
| Segmentation (CCL) | ~9 ms scalar | 2.57 ms |
| Quad extraction | 34.35 ms | 23.79 ms |
| Decoding (1024 cands) | ~200 ms | 0.063 ms |
Decode is no longer the bottleneck; preprocessing dominates at every resolution above 720p. Future micro-optimization cycles should target thresholding and quad extraction first.
§2 Architecture-shaping decisions
These are the decisions that constrain future work. Violating any of them re-introduces a class of bug we've already paid for once.
§2.1 SoA DetectionBatch with hard MAX_CANDIDATES = 1024
Why it matters. Identity = index. corners[7], homographies[7],
ids[7], poses[7] belong to the same physical marker. Phase
isolation (Phase A/B/C/D) is enforced by
contract_detection_batch.rs. The L1-friendly layout is what made the
3.8× SoA migration win.
The 1024 ceiling is real. EdLines accepts more candidate quads
than ContourRdp (more permissive geometric gates → smaller noise
components survive). On dense scenes the candidate count exceeds
MAX_CANDIDATES. Two bugs already cost us 6 weeks of confusion:
- Pre-filter truncation by raster order dropped tag-bearing
candidates with high label indices when noise filled the first
1024 slots (2160p recall stuck at 47/50). Fix: sort by
pixel_countdescending. (render_tag_2160p_20260425.md) - Pre-filter truncation by pixel-count kept giant background blobs
that geometric gates would have rejected anyway, and starved the
downstream survivor set of true tag candidates (Brown–Conrady
recall stuck at 0.870). Fix: truncate post-filter; let Rayon's
ordered
collect()preserve descending order so the smallest survivors are the ones dropped. (quad_truncation_fix_20260426.md)
Forward rule: any change to candidate ordering or truncation must be regression-tested across all suites — render-tag is order-insensitive (clean PSF/Blender renders), but ICRA and distortion are order- and content-sensitive and will silently lose 2–3 pp recall.
§2.2 Zero-allocation hot loop
bumpalo::Bump per frame; reset (cursor → 0) at frame start. No
Vec::new, no Box::new, no HashMap::new in detect(). Stack
allocation (SmallVec, fixed [T; N]) only.
The extract_quads_soa API was over-engineered to take a &Bump
parameter that nobody read; the per-component allocator is
WORKSPACE_ARENA.with(...) per Rayon worker. Lesson: pass the
allocator at the level it's used, not "for symmetry".
§2.3 GIL release across the entire detect pass
The Python wrapper releases the GIL for the whole pipeline. Concurrent
detection of N frames scales linearly. Adding any Python callback
inside detect() would re-acquire the GIL and serialize all workers —
don't.
§2.4 Telemetry erasure
tracing deps must carry features = ["release_max_level_info"] in
both locus-core and locus-py. debug!/trace! spans are
compile-time erased. Without this, per-pixel debug spans cost
~5–10 % even when no subscriber is attached.
TELEMETRY_MODE env is mutually exclusive: tracy (high-fidelity GUI)
or json (structured dumps). Never both — the JSON serialization
pollutes Tracy's nanosecond ring buffers (the "Observer Effect").
§3 Profile philosophy: four shipped JSON presets
Detector settings live in crates/locus-core/profiles/*.json,
include_str!'d into the Rust crate and re-exposed to Python via
_shipped_profile_json. The JSON is authoritative; if a Rust
constant disagrees, the JSON wins.
The four presets target mutually incompatible scenarios:
| Profile | Target | Quad extract | Refinement | Sharpen | Decode | Notes |
|---|---|---|---|---|---|---|
high_accuracy |
Pose-precision SOTA (synthetic + lab metrology) | AdaptivePpb (ContourRdp+Erf low / EdLines+None high) + axis-imbalance gate | None on EdLines route | off | Hard | Beats AprilTag-C on every translation percentile; subsumes the former render_tag_hub profile (PR #2xx) |
standard |
Dense multi-tag production tracking | ContourRdp | Erf | on | Hard | v0.3.1+ ships Hard for 100 % precision (was Soft Mar 21) |
grid |
Touching-tag checkerboard | ContourRdp + 4-conn | Erf | off | Hard | 4-connectivity separates touching borders; sharpening creates halos |
max_recall_adaptive |
Production tracking with per-candidate PPB routing | AdaptivePpb (ContourRdp+Erf low / EdLines+None high) | Erf on low / None on high | on | Hard | Adds max_hamming_error=2 and post_decode_refinement=true for blurry / distant tags |
§3.1 Why four and not one
- HighAccuracy needs the lowest corner RMSE. EdLines + GN gives it (0.16–0.29 px on hub vs 0.99–1.15 px for production); AdaptivePpb routing falls back to ContourRdp+Erf for tags below ~2.5 PPB so ICRA-class data doesn't collapse inside EdLines, and the axis-imbalance gate rescues near-axis-aligned synthetic tags.
- Standard needs maximum recall without losing precision. ContourRdp + sharpening keeps recall at 100 % on dense scenes. Soft decode would add +19 pp recall on ICRA but causes a 10–22 % precision collapse when paired with EdLines (which surfaces more background-edge candidates) — so Standard ships ContourRdp + Hard.
- Grid has non-negotiable topological constraints (4-connectivity, relaxed contrast / edge-score gates) that hurt isolated-tag performance.
- MaxRecallAdaptive trades the two specialty pose-tail knobs
(χ² gate, σ_gate=0.5) for
max_hamming_error=2andpost_decode_refinement=trueto recover blurry / distant tags in real-camera tracking.
§3.2 Decode mode is profile-bound, not user-tunable
| Decode | Use with | Effect |
|---|---|---|
| Hard | Default everywhere | Standard threshold → bit majority |
| Soft | ContourRdp on synthetic ICRA-like data only | +19 pp recall, ~+15 % RMSE |
Soft + EdLines is forbidden by every shipped profile — the precision collapse is well-characterized and will recur if anyone re-enables it.
§4 Algorithm tradeoffs that should inform future tuning
§4.1 Refinement modes
GN corners from EdLines Phase 5 are already sub-pixel. With EdLines as the extractor, the post-refinement stage is counter-productive for raw RMSE:
| EdLines + refinement (720p hub) | Corner RMSE | Repro RMSE | Rot P50 | Use case |
|---|---|---|---|---|
None (GN direct) |
0.166 px | 0.541 | 0.27° | Best for metrology |
Erf |
0.592 px | 0.848 | 0.28° | (don't — degrades GN) |
Gwlf |
0.563 px | 0.564 | 0.06° | Best rotation stability |
Rule: with QuadExtractionMode::EdLines, default to
CornerRefinementMode::None. Use Gwlf only when angular stability
matters more than per-corner pixel error.
ERF was designed for integer ContourRdp corners — it expects to push
them to sub-pixel positions. Applied to already-sub-pixel GN
corners, it just adds noise. The Gwlf row improved over earlier
numbers (was 0.621 / 0.656 / 0.16°) once the quad-stage refit was
removed from the EdLines + Gwlf path: the gradient-peak refit between
EdLines GN and the detector-level GWLF pass was degrading the
initialization GWLF received.
Warm-start matters for ContourRdp + Gwlf — the opposite call.
ContourRdp's coarse corners are integer-precision midpoints. Without a
quad-stage warm-start, the detector-level GWLF pass converges to a
worse minimum and regresses mean RMSE by ~15 % and p90 rotation by
~210 % on the 1080p render-tag hub. Resolution:
refine_quad_corners in crates/locus-core/src/refinement.rs splits
on (extraction, refinement) and runs a gradient-peak warm-start only
on ContourRdp + Gwlf; EdLines + Gwlf passes through unrefined. The split is empirical, not symmetric, and codifies the
trade-off this section describes.
§4.2 GN trades corner RMSE for pose precision
Joint 8-DOF Gauss-Newton (Phase 5) optimizes all four corners simultaneously under the planarity constraint. Result: ~1 % worse corner RMSE, but 34–39 % better median rotation and 26–43 % better P90 rotation. Geometrically consistent corners → better poses, even if individual pixel positions move slightly from their locally-optimal locations.
§4.3 EdLines axis-aligned imbalance gate
EdLines's Phase 1 partitions the boundary at the topmost / rightmost / bottommost / leftmost extremals (TRBL). On a near-axis-aligned tag, two adjacent corners can collapse onto the same TRBL extremal — lumping their shared edge into one arc and compressing the opposite to near-zero. Phases 2–5 fit a wrong-but-validation-passing quad and the decoder rejects it on Hamming margin.
The fix:
DetectorConfig.edlines_imbalance_gate: EdLinesImbalanceGatePolicy
(default Disabled). When Enabled, AXIS-mode arcs that hit > 40 % / < 16 %
imbalance divert to DIAG-mode (NW/NE/SE/SW extremals) which gives
clean four-way partitioning on axis-aligned tags.
Why the high_accuracy profile can leave the gate on. The
upstream guard at crates/locus-core/src/detector.rs:194-198
rejects EdLines under any non-trivial distortion, routing distorted
inputs through ContourRdp. So the gate is unreachable on the
distortion-bearing inputs (Brown–Conrady aprilgrid sub-tags in the
8–15 % min-arc band) where it would have collapsed legitimate
detections. The previous "opt-in" reasoning predates that guard;
post-guard, the gate is safe wherever EdLines runs.
§4.4 PPB (pixels per bit) sets the recall floor
| PPB | Symptom | Recommendation |
|---|---|---|
| > 5.0 | Clean | high_accuracy is fine |
| 1.5–5.0 | Marginal | Sharpening + EdLines OK |
| 1.2–1.5 | EdLines collapses to < 1 % recall | Use ContourRdp |
| < 1.2 | Sub-pixel bit boundaries | Hard limit — Soft decode + ContourRdp + sharpening best we can do |
ICRA forward has ~18 % of tags below 1.2 PPB; circle has
thousands below 1.0 PPB (some at 0.03 PPB). This is why
high_accuracy reports 46.3 % recall on ICRA but 95.6 % on hub
1080p. Real-world robotics tracking should default to standard,
not high_accuracy.
§4.5 Pruned algorithms (don't reintroduce)
- Bilateral filtering distorts the PSF; the GN solver's weighted least-squares already gives better noise rejection without geometric distortion. Removed Apr 11.
- GridFit corner refinement assumes a rigid 2D grid template and is mathematically incompatible with edge-driven GN. Removed Apr 11.
- Iterate-until-converged threshold-tile invalid propagation,
larger threshold tile sizes, opt-in 2× decimation were tried as
fixes for the 2160p recall gap. None moved the needle on detector
recall — the gap was the SoA truncation bug, not segmentation.
Documented in
render_tag_2160p_20260425.md§2.
§4.6 AdaptivePpb gracefully falls back on distortion
EdLines is geometrically incompatible with distorted intrinsics (Huber IRLS line fit + GN solver assume Euclidean pixel geometry). Two guards work together to keep this safe:
- Static
EdLines+ distortion: the run-pipeline guard atdetector.rs:194-198errors withEdLinesUnsupportedWithDistortion. This is the user-explicit-misconfiguration signal — they asked for EdLines on a distorted frame, we tell them. - AdaptivePpb with
EdLineshigh route + distortion: the per-tag distortion extraction path (extract_single_quad_with_camera) silently degrades to ContourRdp, paired with the low-route's refinement (typically Erf) rather than the high-route's refinement. The high-PPB route'sNonerefinement would skip sub-pixel refinement entirely on aprilgrid sub-tags, costing ~5 pp recall on the Brown-Conrady regression suite.
The graceful-fallback for AdaptivePpb means a single profile (e.g.
max_recall_adaptive, currently bound to rectified-only tests) can
load on either rectified or distorted frames without per-camera
profile switching. Static EdLines stays explicit-error so the
misconfiguration doesn't silently degrade.
§5 Benchmark methodology traps
Things that look like algorithmic regressions but aren't:
§5.1 Pose convention misalignment between detectors
| Detector | Convention | Fix in tools/bench/utils.py |
|---|---|---|
AprilTag-C (pupil_apriltags) |
Tag frame 180° rotated about z vs Hub GT and Locus | R @ diag(-1, -1, 1) |
OpenCV cv2.aruco |
solvePnP(SOLVEPNP_ITERATIVE) with y-down centered object points |
Wrap with explicit object points; never use IPPE_SQUARE (picks an inconsistent branch on ~half of symmetric synthetic tags) |
| Locus | Center-origin pose (centered_tag_corners) |
— |
Without these wrapper fixes, rotation error reads ~180° for AprilTag-C and OpenCV even when translation is correct.
§5.2 Cargo integration test CWD
Cargo runs each integration test with CWD = package root, not
workspace root. Always resolve dataset paths via CARGO_MANIFEST_DIR,
never bare canonicalize. The 2026-04-26 ICRA "byte-identical"
errata was caused by tests/common::resolve_dataset_root falling
through to a 1-frame stub when LOCUS_ICRA_DATASET_DIR resolved
relative to a wrong CWD — masking a real 2–3 pp regression.
§5.3 Dataset priority (when results conflict)
- render-tag (Hub, PSF/Blender) — production-relevant, treat as ground truth.
- distortion (Brown–Conrady, Kannala–Brandt) — production-relevant for non-pinhole optics.
- ICRA forward / checkerboard — community research benchmark; useful regression signal, but its synthetic imaging differs from real cameras (no PSF, hard pixel edges), so never trade render-tag pose tail or mean RMSE for ICRA recall gains.
- robustness (tag16h5, low_key, high_iso, raw_pipeline) — KPI watchlist, not a regression bar.
§5.4 Hardware metadata is mandatory
Every performance report must include lscpu / /proc/cpuinfo
output captured in the same session as the benchmark. State the
build profile (--release), thread count, and RAYON_NUM_THREADS.
Placeholder hardware specs (e.g. "Intel CPU" without verification)
are treated as a quality-gate failure.
§5.5 Soft decode is a footgun outside specific scenarios
| Combo | Result |
|---|---|
| ContourRdp + Soft + ICRA | +19 pp recall, precision holds at 100 % |
| EdLines + Soft + hub | Precision collapses to 78–90 % (more candidates surfaced from background edges) |
| Anything + Soft + render-tag PSF | Marginal at best; EdLines+sharp+Hard is already at 100 % recall |
§6 KPI watchlist (set by regression_render_tag_robustness)
These four 1080p single-tag subsets stress the detector under conditions the golden render-tag suite doesn't exercise. Treat each as a forward-looking watchlist, not a regression bar.
| Subset | Family | Baseline | Tuned ceiling | Knob |
|---|---|---|---|---|
tag16h5 |
AprilTag16h5 | R 100 % / P 27.6 % | P 68.9 % | decoder.max_hamming_error: 2→1 is the dominant lever |
high_iso |
AprilTag36h11 | R 100 % | — | No regression — high-noise demosaic is fine |
low_key |
AprilTag36h11 | R 10.0 % | R 22.0 % | threshold.tile_size: 8→2 + neg constant — 22 % is the config ceiling, future contrast-robust threshold work needed |
raw_pipeline |
AprilTag36h11 | R 58.0 % | R 60.0 % | Largely config-bound; remaining 40 % is noise-fragmented contours |
Movement on tag16h5 precision (up = win, down = regression) and
low_key recall (up = win, down = regression) is interesting in
either direction. Movement on the others toward the baseline is a
regression.
§7 ICRA 2020 community-benchmark fixture
The ICRA 2020 fiducial-tag benchmark is a long-standing research
dataset used widely in the AprilTag / fiducial-marker literature. Its
synthesis pipeline produces hard pixel edges and no PSF (different
imaging characteristics from real-camera and Blender-rendered data),
which is fine for the role it plays — head-to-head detector
comparison on a shared, reproducible reference. Its
forward/pure_tags_images subset has ~18 % of tags below 1.2 PPB
(sub-pixel bit boundaries), so the production-default recall on this
subset is ~74 % — the rest are structurally unrecoverable on Hard
decode without changing the decoder's noise model.
Soft decode bridges the gap on this dataset's imaging characteristics
specifically: it treats each bit as an LLR and accepts ambiguous
decodes that Hard would reject. The trade-off (precision falls on
real-camera data with PSF) is documented in §3.2 / §5.5, so the
tuning lives in a test-only fixture at
crates/locus-core/tests/fixtures/icra_synthetic.json rather than as
a shipped profile. Production users get Hard decode by default;
researchers reproducing literature numbers on ICRA 2020 can opt into
the fixture-driven Soft pathway.
The regression_icra_forward_synthetic test records the published
recall using the winning combination from the original sweep:
Soft decode + threshold tile_size=4. Recall = 0.9403, with
~6 fewer false positives per worst-case scene than Soft alone.
Sweep findings (from the original 4-variant exploration, kept here as documentation rather than as a permanent matrix of fixtures):
- Soft decode is the only recall lever. ContourRdp + Hard sits at
0.738 on
forward/pure_tags_images; Soft lifts it to 0.940 (+20.2 pp). The remaining ~6 pp miss is structural — those tags are below 1.2 PPB and decode-failed at the bit level. min_area=20has zero ICRA effect. Neither recall nor FP rate changed in the sweep. Negative result: this knob is not a useful ICRA lever.tile_size=4is a precision win on top of Soft. Same recall as Soft alone, but ~6 fewer FPs per worst-case scene. Finer threshold tiles let the gate reject background-edge candidates that Soft's probabilistic decode would otherwise call. This is what the shipped fixture uses.- The published number for community comparison is recall = 0.9403
from
regression_icra_forward_synthetic.
Why this fixture is not a shipped profile. Soft decode causes a
10-22 % precision collapse on data with PSF (real cameras, Blender-
rendered hub data) per §5.5. ICRA 2020's imaging characteristics don't
trigger that collapse because there's no PSF to amplify the LLR's
spurious-decode candidates — but production users almost always have
PSF and would lose precision. Keep the fixture in tests/fixtures/
so researchers can opt in for community comparison; never promote it
to crates/locus-core/profiles/.
§8 Where to look next
- Preprocessing dominates above 720p. Thresholding + integral image is the next micro-optimization target.
low_keyrecall ceiling at 22 %. Future contrast-robust threshold work — anything beyond histogram-equalization-style preprocessing.tag16h5precision floor at 27 %. Codebook-aware FP rejection (the remaining FPs are dense structural ambiguities in the 16h5 dictionary itself, not detector bugs).raw_pipelinerecall at 58 %. Pre-filtering pass for noise- fragmented contours.- Rotation-tail residual on 1080p hub. Post PR #216 we lead OpenCV on rotation p99 (0.77° vs 1.23°). The remaining residual is EdLines pose ambiguity on grazing-angle synthetic renders; further headroom comes from the diagnostic-driven Phase 1 work (corner outlier / branch-flip dominance analysis).