System Architecture
This document provides a high-level overview of the Locus system architecture, designed for high-performance fiducial marker detection. For deeper dives, see:
- Detection Pipeline — chronological data flow and stage-by-stage execution
- Memory Model — SoA layout, arena lifecycle, and zero-copy FFI
- Algorithms — mathematical foundations of each solver
High-Level Overview
Locus is built as a hybrid Rust/Python system. The core logic resides in a high-performance Rust crate (locus-core), which is exposed to Python via pyo3 bindings (locus-py). All operations are conducted through the Detector class, which manages persistent state and enforces strict zero-copy, GIL-free execution.
flowchart TD
User[User / Application] -->|Images| PyBindings["Python Bindings<br/>(locus-py)"]
PyBindings -->|PyReadonlyArray2| RustCore["Rust Core<br/>(locus-core)"]
subgraph RustCore
Pipeline["Detection Pipeline"]
Memory["Arena Memory<br/>(Bumpalo)"]
SIMD["SIMD Kernels<br/>(Multiversion)"]
end
RustCore -->|Detections| PyBindings
PyBindings -->|"DetectionBatch (Vectorized)"| User
Component Diagram
The system is structured around the Detector struct, which manages configuration and state.
classDiagram
class Detector {
-DetectorConfig config
-DetectorState state
-Vec~Box~TagDecoder~~ decoders
+detect(image) DetectionBatch
}
class DetectorConfig {
+usize threshold_tile_size
+bool enable_bilateral
+f64 quad_min_edge_score
+...
}
class TagDecoder {
<<interface>>
+decode(bits) Option~id, hamming~
+sample_points()
+rotated_codes()
}
class DecodingStrategy {
<<interface>>
+Code from_intensities(intensities, thresholds)
+u32 distance(code, target)
+decode(code, decoder)
}
class HardStrategy {
+Code = u64
}
class SoftStrategy {
+Code = SoftCode (stack-allocated)
}
class AprilTag36h11 {
+decode()
}
class ArUco4x4 {
+decode()
}
class Detection {
+u32 id
+Point center
+Point[4] corners
+Pose pose
}
Detector *-- DetectorConfig
Detector o-- TagDecoder
TagDecoder <|-- AprilTag36h11
TagDecoder <|-- ArUco4x4
Detector ..> Detection : Produces
Design Principles
- Encapsulated Facade: The
Detectorstruct provides a single, robust entry point that owns all complex memory lifetimes (arenas, SoA batches), removing the cognitive burden of resource management from the user. - Zero-Copy Integration: Utilizes the Python Buffer Protocol to access NumPy arrays directly. Python results are returned as a vectorized
DetectionBatchdataclass containing zero-copy NumPy views of the internal SoA layout, maximizing throughput for downstream consumers. - Thread Concurrency (GIL-Free): Releases the Python Global Interpreter Lock (GIL) during the heavy perception pipeline, allowing true multi-threaded execution and preventing blocking in concurrent Python applications.
- Arena Memory: Internal per-frame scratchpad (
bumpalo) eliminatesmalloc/freeoverhead in the hot path. See Memory Model for details. - Cache Locality: Algorithms (thresholding, CCL) process data in linear, cache-friendly passes.
- Runtime SIMD Dispatch: Uses
multiversionto target AVX2, AVX-512, or NEON based on host CPU capabilities. - Structure of Arrays (SoA) Layout: Built around the
DetectionBatch, which eliminates L1 cache misses during math-heavy passes and ensures SIMD-alignment for all mathematical operations. - Semantic Configuration: Employs a
DetectorBuilderto provide a human-friendly API for pipeline tuning, abstracting 20+ fine-grained parameters into high-level semantic methods. - Fast-Path Funnel: Implements a multi-stage rejection gate and sampling routine. It uses an O(1) contrast gate to reject background artifacts early, followed by SIMD-accelerated coordinate generation via Digital Differential Analyzer (DDA) and vectorized bilinear interpolation.
- Fast-Math Sampling: Rewrites homography projection and bilinear interpolation using hardware reciprocal approximation (
rcp_nr_v8) and vectorized FMA instructions to minimize latency. - Hybrid ROI Caching: Minimizes L1 cache misses by copying tag candidates into contiguous stack (small tags) or arena (large tags) buffers before sampling.
- Hybrid Parallelism: Scales via
rayonfor data-parallel tasks while maintaining sequential cache-coherence for state-heavy stages.
Observability & Debugging
Locus includes built-in instrumentation for performance profiling and visual debugging, designed for high-resolution visibility without runtime overhead.
- Zero-Cost Tracing: Uses the
tracingcrate to emit static spans for the 6 major pipeline stages. Production builds utilize compile-time erasure (release_max_level_info) to ensure zero runtime cost when deployed. - Mutually Exclusive Telemetry Matrix: To eliminate the "Observer Effect" during profiling, the regression suite implements a decoupled telemetry architecture via the
TELEMETRY_MODEenvironment variable. This ensures that heavy JSON serialization does not pollute high-fidelity Tracy timings.TELEMETRY_MODE=tracy: Enables the high-fidelityTracyLayerfor deep GUI-based pipeline analysis.TELEMETRY_MODE=json: Enables a non-blocking JSON subscriber, dumping structured pipeline timings totarget/profiling/{test_id}_events.jsonfor AI analysis.Unset: Telemetry remains silent for maximum general test performance.
- Visual Debugging (Rerun): When enabled, Locus logs intermediate processing artifacts to the Rerun SDK for real-time inspection. This system is designed for zero production overhead:
- Zero-Copy Views: Rejected quads and Hamming distances are exposed via zero-copy slices from the existing SoA batch.
- Arena-Allocated Telemetry: Complex diagnostics (subpixel jitter vectors, reprojection RMSE) are computed on-demand and allocated in the frame-local
arenascratching pool. - Remote/Edge Ready: Supports remote connectivity via
--rerun-addr(usingrerun+http://schemes) and local web serving, allowing seamless debugging of edge devices from a local host.
- Developer CLI: Provides a unified
tools/cli.py(executed viauv run) for benchmarking, visualization, and dictionary validation.
Performance Characteristics
Targets a low latency budget for high-resolution frames on modern CPUs.
| Stage | Complexity | Latency (50 Tags, 720p) | Notes |
|---|---|---|---|
| Preprocessing | \(O(N)\) | ~0.9 ms | Adaptive thresholding + Integral Image. |
| Segmentation | \(O(N)\) | ~0.5 ms | SIMD Fused RLE + Light-Speed Labeling (LSL). |
| Quad Extraction | \(O(K \cdot M)\) | ~1.5 ms | Massive gain from SoA extraction. |
| Decoding (Hard) | \(O(Q)\) | ~10.0 ms | SoA math pass; SIMD bilinear sampling. |
| Pose Refinement | \(O(V)\) | ~0.2 ms | Partitioned solver (Valid tags only). |
Note: Total latency ~14.5ms for 50 tags (720p) on a modern desktop CPU (e.g., Zen 4).
Extensibility
Locus is designed to support new fiducial marker systems without modifying the core pipeline.
Adding a New Tag Family
The TagDecoder trait serves as the extension point. To add a new family (e.g., STag or a custom ArUco dictionary):
- Implement
TagDecoder: Define the grid dimension and bit extraction logic. - Define
TagDictionary: Provide the hamming distance lookup table. - Register: Pass the new decoder to the detector (typically via
DetectorBuilder).
struct MyCustomDecoder;
impl TagDecoder for MyCustomDecoder {
fn name(&self) -> &str { "CustomTags" }
fn dimension(&self) -> usize { 4 } // 4x4 grid
// ... implementation ...
}
// Usage
let mut detector = DetectorBuilder::new()
.with_family(TagFamily::AprilTag36h11)
.build();
Packaging & Distribution
Locus uses maturin to bridge the Rust and Python worlds, creating a native Python extension module.
flowchart LR
subgraph Build ["Build Process"]
RustSrc["Rust Source<br/>(locus-core)"] -->|Cargo| CoreLib["Static Lib"]
CoreLib -->|Maturin| PyMod["Python Module<br/>(locus.abi3.so)"]
PyStub["Type Stubs<br/>(.pyi)"] -->|Maturin| Wheel
end
subgraph Dist ["Distribution"]
PyMod --> Wheel[".whl File"]
Wheel -->|pip install| Env["User Environment"]
end
Source Code Organization
The locus-core crate is organized into logical modules mirroring the pipeline stages.
| Module | Description | Key Structs |
|---|---|---|
image |
Zero-copy image views and pixel access. | ImageView |
threshold |
Adaptive thresholding and integral images. | ThresholdEngine |
segmentation |
Connected components labeling. | UnionFind |
simd_ccl_fusion |
SIMD Fused RLE & LSL. | extract_rle_segments |
quad |
Contour tracing and quad fitting. | extract_quads |
gwlf |
Gradient-Weighted Line Fitting. | refine_quad_gwlf |
homography |
Projective geometry primitives (DLT, DDA). | Homography, HomographyDda |
decoder |
Bit extraction and hamming decoding. | TagDecoder |
funnel |
Fast-path rejection gate (O(1) contrast). | apply_funnel_gate |
pose |
3D pose estimation (PnP). | Pose, CameraIntrinsics |
pose_weighted |
Structure Tensor & Weighted LM. | refine_pose_lm_weighted |
gradient |
Image gradients & Sub-pixel windows. | compute_structure_tensor |
filter |
Pre-processing filters (Bilateral, Sharpen). | bilateral_filter |
edge_refinement |
Unified ERF sub-pixel refinement. | ErfEdgeFitter |
simd::math |
Centralized math kernels (erf, rcp). | erf_approx, erf_approx_v4 |