Skip to content

System Architecture

This document provides a high-level overview of the Locus system architecture, designed for high-performance fiducial marker detection. For deeper dives, see:

High-Level Overview

Locus is built as a hybrid Rust/Python system. The core logic resides in a high-performance Rust crate (locus-core), which is exposed to Python via pyo3 bindings (locus-py). All operations go through a single Detector class: single-frame via detect(), or concurrent batch via detect_concurrent() when constructed with max_concurrent_frames > 1.

flowchart TD
    User[User / Application] -->|Images| PyBindings["Python Bindings<br/>(locus-py)"]
    PyBindings -->|PyReadonlyArray2| RustCore["Rust Core<br/>(locus-core)"]

    subgraph RustCore
        Pipeline["Detection Pipeline"]
        Memory["Arena Memory<br/>(Bumpalo)"]
        SIMD["SIMD Kernels<br/>(Multiversion)"]
    end

    RustCore -->|Detections| PyBindings
    PyBindings -->|"DetectionBatch (Vectorized)"| User

Component Diagram

The system is structured around a single Detector class that wraps an immutable LocusEngine (config + decoders + pool) and a dedicated FrameContext for single-frame use. LocusEngine and FrameContext are internal implementation details; the Python API exposes only Detector.

classDiagram
    class Detector {
        -LocusEngine engine
        -FrameContext ctx
        +detect(image) DetectionBatch
        +detect_concurrent(images) List~DetectionBatch~
    }

    class LocusEngine {
        -DetectorConfig config
        -Vec~Box~TagDecoder~~ decoders
        -ArrayQueue~FrameContext~ pool
    }

    class FrameContext {
        -Bump arena
        -DetectionBatch batch
        -Vec~u8~ upscale_buf
    }

    class DetectorConfig {
        +usize threshold_tile_size
        +f64 quad_min_edge_score
        +...
    }

    class TagDecoder {
        <<interface>>
        +decode(bits) Option~id, hamming~
        +sample_points()
        +rotated_codes()
    }

    class AprilTag36h11 {
        +decode()
    }

    class ArUco4x4 {
        +decode()
    }

    class Detection {
        +u32 id
        +Point center
        +Point[4] corners
        +Pose pose
    }

    Detector *-- LocusEngine
    Detector *-- FrameContext
    LocusEngine *-- DetectorConfig
    LocusEngine o-- TagDecoder
    LocusEngine --> FrameContext : pool
    TagDecoder <|-- AprilTag36h11
    TagDecoder <|-- ArUco4x4
    Detector ..> Detection : Produces

Design Principles

  1. Direct Memory Access: Locus uses the Python Buffer Protocol to read NumPy arrays directly, avoiding copies at the FFI boundary.
  2. GIL-Free Pipeline: The Python Global Interpreter Lock is released for the entire detection pass, enabling true multi-threaded perception in Python.
  3. Arena Allocation: A per-frame bumpalo arena handles all ephemeral scratch memory, resulting in zero malloc/free calls in the detection hot-path.
  4. Structure of Arrays (SoA): Internal state is stored in parallel arrays (DetectionBatch) to maximize L1 cache hits and enable SIMD-aligned loads.
  5. Runtime SIMD Dispatch: Mathematical kernels (bilinear sampling, DDA, thresholding) are specialized for AVX2, AVX-512, or NEON at runtime.
  6. Fast-Path Rejection: A multi-stage funnel rejects 70-80% of false-positive candidates using O(1) photometric gates before expensive bit-sampling.
  7. Immutable Topology: Board geometries (AprilGrid, ChAruco) are immutable structs shared across threads via Arc, with strict validation at construction.
  8. Zero-Overhead Telemetry: Performance tracing is compiled out in release builds unless explicitly requested via debug_telemetry=True.

Observability & Debugging

Locus includes built-in instrumentation for performance profiling and visual debugging, designed for high-resolution visibility without runtime overhead.

  1. Zero-Cost Tracing: Uses the tracing crate to emit static spans for the 6 major pipeline stages. Production builds utilize compile-time erasure (release_max_level_info) to ensure zero runtime cost when deployed.
  2. Mutually Exclusive Telemetry Matrix: To eliminate the "Observer Effect" during profiling, the regression suite implements a decoupled telemetry architecture via the TELEMETRY_MODE environment variable. This ensures that heavy JSON serialization does not pollute high-fidelity Tracy timings.
    • TELEMETRY_MODE=tracy: Enables the high-fidelity TracyLayer for deep GUI-based pipeline analysis.
    • TELEMETRY_MODE=json: Enables a non-blocking JSON subscriber, dumping structured pipeline timings to target/profiling/{test_id}_events.json for AI analysis.
    • Unset: Telemetry remains silent for maximum general test performance.
  3. Visual Debugging (Rerun): When enabled, Locus logs intermediate processing artifacts to the Rerun SDK for real-time inspection. This system is designed for zero production overhead:
    • Zero-Copy Views: Rejected quads and Hamming distances are exposed via zero-copy slices from the existing SoA batch.
    • Arena-Allocated Telemetry: Complex diagnostics (subpixel jitter vectors, reprojection RMSE) are computed on-demand and allocated in the frame-local arena scratching pool.
    • Remote/Edge Ready: Supports remote connectivity via --rerun-addr (using rerun+http:// schemes) and local web serving, allowing seamless debugging of edge devices from a local host.
  4. Developer CLI: Provides a unified tools/cli.py (executed via uv run) for benchmarking, visualization, and dictionary validation.

Performance Characteristics

Targets a low latency budget for high-resolution frames on modern CPUs.

Stage Complexity Latency (50 Tags, 720p) Notes
Preprocessing \(O(N)\) ~0.9 ms Adaptive thresholding + Integral Image.
Segmentation \(O(N)\) ~0.5 ms SIMD Fused RLE + Light-Speed Labeling (LSL).
Quad Extraction \(O(K \cdot M)\) ~1.5 ms Massive gain from SoA extraction.
Decoding (Hard) \(O(Q)\) ~10.0 ms SoA math pass; SIMD bilinear sampling.
Pose Refinement \(O(V)\) ~0.2 ms Partitioned solver (Valid tags only).

Note: Total latency ~14.5ms for 50 tags (720p) on a modern desktop CPU (e.g., Zen 4).

Extensibility

Locus is designed to support new fiducial marker systems without modifying the core pipeline.

Adding a New Tag Family

The TagDecoder trait serves as the extension point. To add a new family (e.g., STag or a custom ArUco dictionary):

  1. Implement TagDecoder: Define the grid dimension and bit extraction logic.
  2. Define TagDictionary: Provide the hamming distance lookup table.
  3. Register: Pass the new decoder to the detector (typically via the Rust DetectorBuilder).
struct MyCustomDecoder;

impl TagDecoder for MyCustomDecoder {
    fn name(&self) -> &str { "CustomTags" }
    fn dimension(&self) -> usize { 4 } // 4x4 grid

    // ... implementation ...
}

// Usage (Rust)
let mut detector = DetectorBuilder::new()
    .with_family(TagFamily::AprilTag36h11)
    .build();

Packaging & Distribution

Locus uses maturin to bridge the Rust and Python worlds, creating a native Python extension module.

flowchart LR
    subgraph Build ["Build Process"]
        RustSrc["Rust Source<br/>(locus-core)"] -->|Cargo| CoreLib["Static Lib"]
        CoreLib -->|Maturin| PyMod["Python Module<br/>(locus.abi3.so)"]
        PyStub["Type Stubs<br/>(.pyi)"] -->|Maturin| Wheel
    end

    subgraph Dist ["Distribution"]
        PyMod --> Wheel[".whl File"]
        Wheel -->|pip install| Env["User Environment"]
    end

Source Code Organization

The locus-core crate is organized into logical modules mirroring the pipeline stages.

Module Description Key Structs
image Zero-copy image views and pixel access. ImageView
threshold Adaptive thresholding and integral images. ThresholdEngine
segmentation Connected components labeling. UnionFind
simd_ccl_fusion SIMD Fused RLE & LSL. extract_rle_segments
quad Contour tracing and quad fitting. extract_quads
gwlf Gradient-Weighted Line Fitting. refine_quad_gwlf
homography Projective geometry primitives (DLT, DDA). Homography, HomographyDda
decoder Bit extraction and hamming decoding. TagDecoder
funnel Fast-path rejection gate (O(1) contrast). apply_funnel_gate
pose 3D pose estimation (PnP). Pose, CameraIntrinsics
pose_weighted Structure Tensor & Weighted LM. refine_pose_lm_weighted
gradient Image gradients & Sub-pixel windows. compute_structure_tensor
filter Pre-processing filters (Sharpen). laplacian_sharpen
edge_refinement Unified ERF sub-pixel refinement. ErfEdgeFitter
simd::math Centralized math kernels (erf, rcp). erf_approx, erf_approx_v4
board Board topology types and multi-tag pose estimation. AprilGridTopology, CharucoTopology, BoardEstimator
charuco ChAruco saddle-point extraction and board pose. CharucoRefiner, CharucoResult