SIMD CCL Fusion Benchmarks
Track: Segmentation (CCL): Defeating the Memory Wall (~40 ms) Date: March 19, 2026 Target: Replace scalar pixel-based Union-Find with a SIMD-accelerated Fused Run-Length Encoding (RLE) and Light-Speed Labeling (LSL) pipeline.
Results
Executed via Divan benchmarking framework on the segmentation_bench suite.
Timer precision: 59 ns
segmentation_bench fastest │ slowest │ median │ mean │ samples │ iters
├─ bench_segmentation_real_icra_threshold_model 8.458 ms │ 43.41 ms │ 8.92 ms │ 9.484 ms │ 100 │ 100
├─ bench_segmentation_real_icra_threshold_model_4k 12.52 ms │ 34.27 ms │ 13.26 ms │ 13.6 ms │ 100 │ 100
╰─ bench_segmentation_real_icra_threshold_model_1080p 3.372 ms │ 3.971 ms │ 3.539 ms │ 3.55 ms │ 100 │ 100
Analysis
- 1080p Performance: Segmentation takes just ~3.5 ms.
- 4K Performance: Segmentation takes ~13.2 ms.
- Impact: The pipeline completely defeated the memory wall. By processing vectors (up to 64 pixels at a time with AVX-512, 32 with AVX2) and extracting 1D segments directly from the threshold pass, the 2D connected components problem was reduced to a 1D run-merging problem. Latency dropped well below the original 40 ms target.