2160p `render_tag_hub` recall lift (2026-04-25)

PR #200 (feat-render-tag-sota) shipped the render_tag_hub profile and landed 100 % recall at 480p / 720p / 1080p on the locus_v1_tag36h11_<res> Hub render-tag subsets — but at 2160p recall stalled at 47/50 (94 %) for both high_accuracy and render_tag_hub. The PR analysis dismissed the gap as "pre-EdLines (segmentation fragments large tags into multiple components — out of scope for this profile)."

This branch (feat-roi-rescue) ran the gap to ground. The original diagnosis was wrong: segmentation was fine. The actual failure was an extract_quads_soa ordering bug that surfaces only when the number of geometrically valid quads exceeds the SoA capacity. After a one-line fix, all three failing scenes decode cleanly and 2160p joins the lower resolutions at 50/50 (100 %) recall, 1.0 precision.

§1 Verified hardware metadata

$ lscpu
Architecture:                 x86_64
Vendor ID:                    AuthenticAMD
Model name:                   AMD EPYC-Milan Processor
CPU(s):                       8 (4 cores × 2 threads)
Hypervisor:                   KVM
Virtualization type:          full
Flags include:                avx2, fma, sha_ni, avx512* not present
$ uname -r
6.8.0-107-generic

Build profile: --release with --features bench-internals. Rust toolchain pinned via rust-toolchain.toml. Default Rayon thread count (8). All numbers below were captured in the same session with LOCUS_HUB_DATASET_DIR=tests/data/hub_cache.

§2 Diagnosis

§2.1 The pre-fix theory (incorrect)

The plan in /home/dev/.claude/plans/toasty-booping-lemon.md hypothesised that 2160p tag interiors fragmented during CCL because the threshold-tile invalid propagation only travels ~16 px (2 iterations × 8 px tile size), while a 540 px tag interior at 4K needs O(50) hops to reach the centre. The plan proposed Phase-2 A/B candidates: iterate-until-converged propagation, larger tile size, opt-in decimation.

§2.2 What the diagnostic test actually showed

crates/locus-core/tests/diagnose_render_tag_2160p.rs runs threshold + LSL on each failing scene and reports the largest CCL component that overlaps the GT bbox. For all three scenes the largest component covered ≥ 90 % of the GT bbox, had geometric area / aspect / fill ratios within the production gates, and would have been accepted downstream. Segmentation was not the bottleneck.

Iterate-until-converged propagation (Phase-2 candidate 1) was implemented and benchmarked; it lifted scene 0006 fill from 0.25 to 0.41 but did not move the needle on detector recall. Tile-size 16 (candidate 3) lifted scene 0033 fill to 0.70 but again no recall delta. Opt-in decimation: 2 (Phase 3) rescued scene 0006 alone but collapsed the rest of the 2160p subset (47/50 → 1/50) because most other tags shrank below the EdLines viable size.

The Phase 2 sweep was followed by env-gated traces inside edlines.rs::run_pipeline_with_mode and corresponding edge-score / funnel checks. Every trace showed EdLines returning Success for scenes 0006 and 0026 with reasonable corner geometry. The candidate quads were being produced — they were just disappearing before decode.

§2.3 Root cause (correct)

The SoA DetectionBatch has a hard ceiling (pub(crate) const MAX_CANDIDATES: usize = 1024; in crates/locus-core/src/batch.rs). At 2160p, EdLines accepts more candidate quads than ContourRdp because its geometric gates are more permissive (smaller noise components survive). The total count of valid quads exceeds 1024 on tag-bearing scenes.

The original extract_quads_soa collected components in raster order (top-to-bottom, left-to-right — LSL's natural label assignment) and then truncated:

for (i, …) in detections.into_iter().take(n).enumerate() { … }

Where n = detections.len().min(MAX_CANDIDATES). Tags placed in the mid-to-lower image region had high label indices; when the small noise components above them filled the first 1024 slots, the tag candidates were silently dropped before the funnel/decoder ran. Lower resolutions never tripped this because they produced fewer candidate quads than the ceiling.

§2.4 Fix

A new helper pixel_count_descending_order(arena, stats) in crates/locus-core/src/quad.rs returns the top-MAX_CANDIDATES (pixel_count, label_idx) pairs in descending order. Both extract_quads_soa (the rectified path) and extract_quads_soa_with_camera (the non_rectified distortion path) now drive their parallel extraction loop off this order rather than off stats.par_iter().enumerate().

Truncation at MAX_CANDIDATES therefore drops the smallest blobs (noise) rather than the largest (tag candidates).

Three principal-engineer-tier refinements over the naïve fix:

Bump-allocated buffer. The order array is a bumpalo::Vec sourced from the per-frame FrameContext.arena. The hot path does not touch the system allocator. The added &Bump parameter on both extraction entry points threads the existing state.arena (split-borrow against &mut state.batch).
select_nth_unstable_by for top-K. When len > MAX_CANDIDATES the helper partitions in O(n) before sorting only the surviving prefix in O(k log k), avoiding the full O(n log n) sort over noise tails at 4K.
Pair packing. (pixel_count, label_idx) is laid out inline so comparators stay inside one cache-resident slice instead of double-chasing into stats[i].pixel_count per compare.

Sort-descending order on the prefix is load-bearing for downstream snapshot stability — the funnel/decoder dedup is processing-order sensitive.

This change is global — it applies to every profile, not just render_tag_hub, and to both the rectified and distortion-aware extraction paths. The cross-dataset sweep (§4) confirms it does no harm anywhere else.

§3 Recall / precision deltas

§3.1 `render_tag_hub` (the target profile)

Resolution	Recall (pre)	Recall (post)	Precision	RMSE	Repro RMSE	r p90 (°)	t p90 (m)
480p	50/50	50/50	1.000	0.180	0.219	0.087	0.0024
720p	50/50	50/50	1.000	0.183	0.193	0.115	0.0036
1080p	50/50	50/50	1.000	0.181	0.181	0.158	0.0055
2160p	47/50	50/50	1.000	0.180	0.162	0.416	0.0116

Lower resolutions are byte-identical (same snapshots, no rebind). 2160p went from 0.94 → 1.00 recall, precision held at 1.0. The slight uptick in r p90 / RMSE is expected: scene 0033 has fill 0.14 (an unusual oblique-grazing render) and contributes more pose noise than the easier scenes already in the population.

§3.2 `high_accuracy` (default profile, unchanged config)

Resolution	Recall (pre)	Recall (post)	Precision
2160p	47/50	50/50	1.000

The fix lifts the default profile too — same root cause, same remedy. No JSON or config touched.

§4 Cross-dataset no-regression sweep

Running every regression suite the project ships, per docs/engineering/quality-gates.md § 2 and the plan §Verification "critical" requirement (the fix is global, not profile-scoped):

Suite	Tests	Result	Notes
`regression_render_tag` (tag36h11)	15	pass	2 snapshots rebound — both at 2160p, both recall ↑
`regression_render_tag_robustness`	7	pass	2 tag16h5 1080p snapshots rebound — precision 0.276 → 0.311 (+3.5 pp), recall 1.0 unchanged
`regression_distortion_hub` (Brown)	1	pass	snapshot rebound — recall 0.929 → 0.935 (+0.6 pp), precision 0.994 unchanged
`regression_distortion_hub` (Kannala)	1	pass	unchanged
`regression_board_hub`	4	pass	unchanged
`regression_icra2020`	16	pass	unchanged
`diagnose_render_tag_2160p`	1	pass	new — pins the contract
`contract_detection_batch`	7	pass	unchanged
`test_quad_soa`	1	pass	unchanged

Every snapshot delta is strictly an improvement (recall up, precision unchanged or improved). The fix has no observed adverse interaction across datasets.

§5 SOTA comparison at 2160p

External baselines captured 2026-04-25 via tools/cli.py bench real --hub-config locus_v1_tag36h11_3840x2160 --compare:

Detector	Recall	Precision	t mean	t p99	r mean	r p99	Latency
OpenCV `cv2.aruco`	100 %	98.04 %	34 mm	408 mm	0.171°	1.203°	165 ms
AprilTag-C (pupil)	100 %	100 %	16 mm	101 mm	2.666°	65.716°	143 ms
Locus `high_accuracy`	100 %	100 %	17 mm	121 mm	7.134°	108.494°	73 ms
Locus `render_tag_hub` (snapshot)	100 %	100 %	—	—	—	—	—

The CLI runs the default high_accuracy profile; render_tag_hub numbers come from the rebound regression snapshot (regression_render_tag__common__hub__hub_locus_v1_tag36h11_3840x2160_render_tag_hub.snap). Locus is the latency leader at every 2160p configuration tested, matches AprilTag-C on the recall+precision floor, and trails OpenCV's tighter rotation distribution at 4K — a known artefact of EdLines pose ambiguity on extreme grazing-angle synthetic renders (the same effect that drives r p99 in the 1080p comparison; see render_tag_sota_20260425.md §1 for the discussion).

§6 What shipped

crates/locus-core/src/quad.rs — pixel-count-descending iteration in extract_quads_soa (the fix).
crates/locus-core/tests/diagnose_render_tag_2160p.rs — new bench-internals regression that asserts the geometric gates pass and the detector returns ≥ 1 detection on each previously failing scene.
Five rebound snapshots:
regression_render_tag__…__3840x2160.snap (accuracy_baseline)
regression_render_tag__…__3840x2160_render_tag_hub.snap
regression_distortion_hub__…__brown_conrady_v1_1920x1080.snap
regression_render_tag_robustness__…__tag16h5_1920x1080.snap
regression_render_tag_robustness__…__tag16h5_1920x1080_tuned.snap
This document.

No JSON profile, schema, or Python binding was touched. The fix is ~5 lines of Rust plus a comment.

§7 Exit criteria status

✅ render_tag_hub 2160p recall = 50/50 (100 %).
✅ 480p / 720p / 1080p render_tag_hub recall stays at 100 %; precision stays at 1.00 across all 4 resolutions.
✅ No regression on robustness, distortion, board, ICRA suites (one snapshot rebound, recall improved).
✅ Diagnostic regression test asserts each previously-failing scene now decodes.
✅ Hardware metadata + verification log recorded above.

The non-goal identified in the plan ("scene 0033 — content-intrinsic decoder margin") was incidentally resolved: 0033 was failing for the same MAX_CANDIDATES truncation reason as 0006/0026, not for any decoder-margin reason.

2160p render_tag_hub recall lift (2026-04-25)