Skip to content

vernier.instance

Detection / segmentation / boundary / keypoints evaluation. Built around the Evaluator config + evaluate(...) call, with a BackgroundEvaluator variant for overlapping the kernel with the rest of the training step.

Instance-segmentation / detection / keypoints evaluation surface.

Per ADR-0029, the AP-fold evaluation paradigm (bbox, segm, boundary, keypoints) lives under vernier.instance. Sibling to :mod:vernier.panoptic and :mod:vernier.semantic.

CompressedRLE

Bases: TypedDict

COCO compressed RLE shape (6-bit ASCII bytes, as emitted by pycocotools.mask.encode).

counts is the compressed bytes payload, validated as UTF-8 ASCII at ingest. size is (height, width) in COCO order.

Detections

Bases: TypedDict

One per-image detection batch in array form.

Fields are gated by iou_type:

  • bbox: image_id, boxes, scores, labels.
  • segm / boundary: above plus rles.
  • keypoints: image_id, boxes, scores, labels, keypoints.

Required dtypes (no silent promotion — opt in via cast_inputs=True):

  • boxes: float64 (N, 4) C-contiguous, xywh.
  • scores: float64 (N,).
  • labels: int64 (N,).
  • rles[i] (uncompressed dict): counts: uint32 1-D contiguous, size: (h, w).
  • rles[i] (compressed dict): counts: bytes (COCO 6-bit ASCII), size: (h, w).
  • rles[i] (bitmask): 2-D bool or uint8, shape (H, W), C- or F-order.
  • keypoints: float64 (N, K, 3) C-contiguous.

UncompressedRLE

Bases: TypedDict

COCO RLE shape on the array-ingest path (uncompressed counts).

counts is the uncompressed run-length array (uint32, contiguous). size is (height, width) in COCO order.

BackgroundEvaluator

BackgroundEvaluator(
    gt: bytes | CocoDataset,
    *,
    iou_type: Literal[
        "bbox", "segm", "boundary", "keypoints"
    ] = ...,
    parity_mode: Literal["strict", "corrected"] = ...,
    max_dets: list[int] = ...,
    use_cats: bool = ...,
    memory_budget_bytes: int | None = ...,
    dilation_ratio: float = ...,
    sigmas: dict[int, list[float]] | None = ...,
    queue_capacity: int = ...,
    worker_affinity: int | None = ...,
    worker_nice: int = ...,
    shutdown_timeout_seconds: float = ...,
    retain_iou: bool = ...,
    cast_inputs: bool = ...,
    rank_id: int | None = ...,
    record_latency_samples: bool = ...,
)

Background-evaluator surface (ADR-0014). Wraps a worker thread that owns the StreamingEvaluator<K>; every public method either sends on the channel or reads atomic counters. Not frozen — finalize() and __exit__ need to mutate state.

detections_seen property

detections_seen: int

Mirror of StreamingEvaluator::detections_seen(). Advisory.

images_seen property

images_seen: int

Mirror of StreamingEvaluator::images_seen(). Advisory — updated by the worker after each successful submit.

memory_used_bytes property

memory_used_bytes: int

Mirror of StreamingEvaluator::memory_used_bytes(). Advisory.

queue_depth property

queue_depth: int

Approximate count of Update messages waiting in the channel.

drain_latency_samples_ns method descriptor

drain_latency_samples_ns() -> list[int]

Drain the worker's accumulated submit-latency samples (B5).

Each sample is the wall-time of one submit() call's channel-send leg, in nanoseconds. The buffer is reset to empty on each call so subsequent submits keep accumulating; returns an empty list when the evaluator was constructed without record_latency_samples=True (the default) or after finalize has consumed the worker.

finalize method descriptor

finalize() -> Summary

Drain the queue, finalize the evaluator, and join the worker. Subsequent calls error with the "already finalized" message.

finalize_to_partial method descriptor

finalize_to_partial() -> bytes

ADR-0031 / ADR-0035: drain the queue, serialize the worker's final state as a partial blob, and shut the worker down. Subsequent calls raise "already finalized".

finalize_with_tables method descriptor

finalize_with_tables(
    per_image: bool = False,
    per_class: bool = False,
    per_detection: bool = False,
    per_pair: bool = False,
    per_pair_iou_floor: float = 0.1,
    per_pair_max_rows: int = 10000000,
    per_detection_with_geometry: bool = False,
) -> _TablesResult

Tables-aware finalize. Drains the queue and consumes the worker.

submit method descriptor

submit(
    detections: DetectionsInput,
    *,
    timeout: float | None = None,
) -> None

Submit a detection batch to the worker. Accepts either loadRes- shaped JSON bytes (legacy) or an ADR-0030 Detections dict / sequence of Detections dicts (numpy/DLPack). timeout controls backpressure:

  • None (default) → block until a slot is free
  • 0.0 → single non-blocking attempt; raise QueueFullError if the queue is full
  • t > 0.0 → wait up to t seconds; raise QueueFullError on timeout

Breakdown

Python wrapper around [Breakdown] / [ClassGroupBreakdown].

axis property

axis: str

Axis name (e.g., "area", "vehicle_taxonomy").

buckets property

buckets: list[tuple[str, float, float]]

Range buckets as a list of (label, lo, hi) triples in construction order.

Raises AttributeError if this Breakdown was built via from_class_groups. Use class_groups instead.

class_groups property

class_groups: list[tuple[str, list[int]]]

Class-id groups as a list of (label, class_ids) pairs in construction order.

Raises AttributeError if this Breakdown was built via from_ranges. Use buckets instead.

kind property

kind: Literal['range', 'class_groups']

Variant discriminator: "range" for from_ranges-constructed breakdowns, "class_groups" for from_class_groups-constructed ones. Use this to dispatch in validators that accept a Breakdown of a specific shape.

from_class_groups builtin

from_class_groups(
    axis: str, groups: Sequence[tuple[str, Sequence[int]]]
) -> Breakdown

Construct from class-id-keyed groups.

groups is a sequence of (label, class_ids) pairs, one per group. Group order on input determines the group axis index (first pair is index 0). Strict partition discipline is enforced — no class id may appear in two groups.

Raises ValueError on:

  • empty groups;
  • any group with empty class_ids;
  • duplicate group labels;
  • the same class id appearing in more than one group.

from_ranges builtin

from_ranges(
    axis: str, buckets: Sequence[tuple[str, float, float]]
) -> Breakdown

Construct from f64-keyed buckets.

buckets is a sequence of (label, lo, hi) triples, one per bucket. [lo, hi] is closed on both ends per ADR-0016 (quirk D6); an annotation whose key sits exactly on a boundary lands in both adjacent buckets.

Raises ValueError on:

  • empty buckets;
  • NaN or infinite lo / hi;
  • lo < 0;
  • lo > hi;
  • duplicate bucket labels.

CocoDataset

Parsed-once COCO ground-truth dataset.

Construct with [PyDataset::from_json]; pass to the evaluate_*_summary_with_dataset family. Reusing the same instance across evaluate calls reuses the GT-side derivations the cached kernels populate on first use (per ADR-0020). The handle is frozen — its identity is the cache key.

boundary_cache_len property

boundary_cache_len: int

Observability-only: count of GT annotations whose boundary band is currently cached (ADR-0020). Useful for debugging or tests that need to assert cache reuse; not a stable contract, and the value can change shape as new cache slots are added.

category_frequency property

category_frequency: (
    Mapping[int, LvisFrequencyLiteral] | None
)

Per-category frequency tag as the LVIS single-letter form ("r" / "c" / "f"; quirk AB1). None when this dataset is not federated.

is_federated property

is_federated: bool

True when this dataset carries LVIS federated metadata — equivalent to pos_category_ids is not None. Cheap shortcut for orchestration code that gates behaviour on the federated flag.

neg_category_ids property

neg_category_ids: Mapping[int, frozenset[int]] | None

Per-image negative-category set (quirk AA2). None when this dataset is not federated.

not_exhaustive_category_ids property

not_exhaustive_category_ids: (
    Mapping[int, frozenset[int]] | None
)

Per-image not-exhaustive-category set (quirk AA3). None when this dataset is not federated.

num_annotations property

num_annotations: int

Number of GT annotations carried by the dataset.

num_categories property

num_categories: int

Number of categories.

num_images property

num_images: int

Number of images.

pos_category_ids property

pos_category_ids: Mapping[int, frozenset[int]] | None

Per-image positive-category set (quirk AA1, derived from GTs at load). None when this dataset was loaded via [Self::from_json] (COCO path) rather than [Self::from_lvis_json].

segm_cache_len property

segm_cache_len: int

Observability-only: count of GT annotations whose segm bbox+area derivation is currently cached (ADR-0020). Same caveats as [Self::boundary_cache_len].

clear_cache method descriptor

clear_cache() -> None

Drops every cached GT-side derivation. Reset point for users who want to free memory between long-lived training cycles without dropping the dataset itself.

from_json staticmethod

from_json(gt_json: bytes) -> CocoDataset

Parses a COCO ground-truth JSON payload into a reusable [Dataset] handle. Raises ValueError on malformed JSON.

from_lvis_json staticmethod

from_lvis_json(gt_json: bytes) -> CocoDataset

Parses an LVIS v1 ground-truth JSON payload into a reusable [Dataset] handle. The handle exposes the federated metadata (pos_category_ids, neg_category_ids, not_exhaustive_category_ids, category_frequency) the orchestrator reads to apply LVIS evaluation semantics (ADR-0026).

Raises ValueError on malformed JSON, on the disjointness violations of quirk AA7 (a category in both not_exhaustive and neg, or a neg category that has GT on the same image), or on missing frequency tags (quirk AB6).

Migration guide for users coming from lvis-api: docs/explanation/lvis-migration.md. Lead with the silent- federated-semantics gotcha — loading LVIS-shaped JSON via Dataset.from_json (the COCO loader) silently drops the federated extras and produces systematically lower AP under COCO semantics.

MemoryBudgetWarning

Bases: builtins.UserWarning

Streaming evaluator's memory usage crossed the soft-warn threshold.

OutOfBudgetError

Bases: builtins.RuntimeError

Memory budget for the streaming evaluator was exceeded.

Attributes: used_bytes, budget_bytes, breakdown.

PartialDatasetMismatch

Bases: builtins.RuntimeError

Distributed-eval partial was computed against a different dataset than the receiving evaluator.

Attributes: expected (bytes), actual (bytes).

PartialFormatMismatch

Bases: builtins.RuntimeError

Distributed-eval partial blob is structurally malformed (magic / version / CRC / kernel kind / parity / retain_iou / grid dimensions / rkyv archive).

Attributes: kind (string discriminator).

PartialParamsMismatch

Bases: builtins.RuntimeError

Distributed-eval partial was computed against different evaluation params than the receiving evaluator.

Attributes: expected (bytes), actual (bytes).

PartialPartitionOverlap

Bases: builtins.RuntimeError

Two distributed-eval partials cover the same image_id (sampler bug).

Attributes: rank_a, rank_b, image_id.

PartialRankCollision

Bases: builtins.RuntimeError

Two strict-mode distributed-eval partials share a rank_id.

Attributes: rank_id.

QueueFullError

Bases: builtins.RuntimeError

Background evaluator's submit queue was full.

Attributes: queue_capacity, timeout.

Summary

Pythonic view over a [vernier_core::Summary]. Frozen — the underlying value is constructed once by [evaluate_bbox_summary] and never mutated (per ADR-0006).

stats property

stats: list[float]

12 detection stats in canonical pycocotools order.

pretty_lines method descriptor

pretty_lines() -> list[str]

One pretty-printed line per stat, matching the pycocotools Average Precision (AP) @[ ... ] = 0.xxx shape.

FpIouHistogram dataclass

FpIouHistogram(
    iou_same: ndarray,
    iou_cross: ndarray,
    kernel: KernelName,
    t_f: float,
    n_total_dts: int,
    n_fps: int,
)

FP-IoU histogram for ADR-0022 t_b ratification.

For every detection that bin assignment classifies as a false positive (Cls / Loc / Both / Dupe / Bkg — anything that's not TP and not Ignore), :attr:iou_same and :attr:iou_cross carry the best same-class and cross-class IoUs at the time of the bin pick.

Bin-as-Bkg fraction at a candidate t_b is::

max_iou = np.maximum(h.iou_same, h.iou_cross)
bkg_fraction = (max_iou < t_b).mean()

Sweeping t_b over a range and plotting bkg_fraction(t_b) surfaces the "valley" between genuine backgrounds (IoU≈0) and near-misses; the right t_b sits in that valley.

See the analysis CLI at tests/python/integration/real_models/tide/extract_fp_histogram.py for the recipe.

TideConfig dataclass

TideConfig(t_f: float, t_b: float, kernel: KernelName)

Resolved TIDE configuration recorded on every :class:TideReport.

Mirrors :class:vernier_core::tide::report::TideConfig (see crates/vernier-core/src/tide/report.rs). The fields carry the resolved thresholds the call ran under — never None — so a report screenshot is self-describing per ADR-0022.

cross_class_topk (per ADR-0023) is intentionally absent from this Python surface in 0.5.0. The Rust default (None = materialize the full per-detection cross-class IoU vector) is the only behavior reachable from Python today; the knob will be exposed once a workload demands it.

TideReport dataclass

TideReport(
    baseline_map: float,
    delta: dict[str, float],
    delta_all_fp_removed: float,
    config: TideConfig,
)

Six-bin TIDE decomposition of a detection model's mAP gap.

Returned by :func:error_decomposition. Each :attr:delta entry is the mAP increase the model would achieve if every detection assigned to that bin were corrected; :attr:baseline_map is the headline number before any correction; :attr:delta_all_fp_removed is the paper's "perfect rejection" upper bound (what mAP would be if every FP were dropped). The per-bin deltas should sum to at most :attr:delta_all_fp_removed — useful as a sanity check that the rewrite layer is internally consistent.

Fields mirror :class:vernier_core::tide::report::TideReport (see crates/vernier-core/src/tide/report.rs). The :attr:config field is the resolved :class:TideConfig so a single report tells the reader which thresholds it was produced under (ADR-0022).

Bin names follow the TIDE paper:

  • cls — Classification: matched a GT of the wrong class.
  • loc — Localization: right class, IoU in [t_b, t_f).
  • both — Both classification and localization wrong.
  • dupe — Duplicate: a higher-scoring detection already matched.
  • bkg — Background: IoU < t_b against every GT.
  • missed — Missed GT: no detection survived to match.

See the debugging tutorial (docs/tutorials/debugging-with-tide.md) and ADR-0021 for the algorithmic spec.

CategoryFilterAll dataclass

CategoryFilterAll()

Match every category. The COCO default.

CategoryFilterByGrouping dataclass

CategoryFilterByGrouping(label: str)

Match every class id in the named group of the active class_grouping breakdown.

Only meaningful when the Evaluator's class_grouping is also set; the validator at __post_init__ rejects ByGrouping when no grouping is configured or when label is not a grouping label.

CategoryFilterByIds dataclass

CategoryFilterByIds(ids: frozenset[int])

Match an explicit set of class / category ids.

CategoryFilterFrequency dataclass

CategoryFilterFrequency(tag: Literal['r', 'c', 'f'])

Match by LVIS frequency tag ("r", "c", "f").

Valid only on instance evaluation against an LVIS-shaped dataset (ADR-0026). Semantic and panoptic Evaluators reject this variant at construction time per ADR-0041 / ADR-0042 — frequency tags are a sum type that doesn't generalize to non-numeric axes; class groupings carry the user's per-group rollup intent on those paradigms.

EvalResult dataclass

EvalResult(
    summary: Summary | None,
    _per_image_batch: object | None = None,
    _per_class_batch: object | None = None,
    _per_detection_batch: object | None = None,
    _per_pair_batch: object | None = None,
)

Result of an opt-in result-tables evaluate(...) call.

Returned only when tables= is non-None on :meth:vernier.Evaluator.evaluate. The default tables=None path still returns the bit-identical :class:vernier.Summary it always has.

Tables are exposed as cached :class:polars.DataFrame properties; polars is imported lazily on first attribute access (installed via the vernier[tables] extra). pandas / duckdb / pyarrow consumers can round-trip on the returned DataFrame, or call the underlying Arrow producer (self._per_image_batch.__arrow_c_array__()) directly — the leading-underscore name signals implementation detail.

stats property

stats: list[float]

Pass-through to self.summary.stats. Raises :class:AttributeError on ADR-0040 custom-grid results — the slot-indexed summary doesn't apply; read per-axis tables instead.

per_image cached property

per_image: DataFrame

One row per image rollup. Raises RuntimeError if per_image was not in the tables= request.

per_class cached property

per_class: DataFrame

One row per category. Raises RuntimeError if per_class was not in the tables= request.

per_detection cached property

per_detection: DataFrame

One row per detection. Raises RuntimeError if per_detection was not in the tables= request.

per_pair cached property

per_pair: DataFrame

One row per (DT, GT) pair. Raises RuntimeError if per_pair was not in the tables= request.

IncompatibleSummaryPlan

IncompatibleSummaryPlan(
    *,
    field: str,
    value: object,
    plan: str,
    remediation: str,
)

Bases: ValueError

Raised when a custom evaluation grid is incompatible with the canonical fixed-shape summary plan.

The COCO 12-stat / keypoints 10-stat / LVIS 13-stat summary plans address slots in the (T, R, K, A, M) accumulator by hardcoded indices — AP_S is "the second area-bucket entry of the all-IoU slice at maxDet=100", not "the small-area slot". A custom iou_thresholds ladder, recall_thresholds ladder, or area_ranges breakdown breaks this index assumption.

Per ADR-0040, custom-grid users get the result-tables surface (Evaluator.evaluate_tables(...), ADR-0019), which carries explicit labels per row and composes cleanly with arbitrary grid layouts. The remediation field on this exception names the method to call.

InvalidEvalParams

InvalidEvalParams(
    *, field: str, value: object, remediation: str
)

Bases: ValueError

Base for paradigm-specific Evaluator construction errors.

Raised at Evaluator.__post_init__ time by every paradigm in response to invalid parameter values (out of range, wrong shape, duplicate, conflicting, etc.). Per ADR-0039, validation runs at construction so misconfiguration surfaces fast — evaluate() cannot fail on misconfigured params, only on bad data.

Each subclass carries the offending field name, the offending value, and a one-line remediation pointer (typically the relevant ADR or doc page).

InvalidInstanceParams

InvalidInstanceParams(
    *, field: str, value: object, remediation: str
)

Bases: InvalidEvalParams

Invalid vernier.instance.Evaluator parameter (ADR-0040).

TablesConfig dataclass

TablesConfig(
    per_pair_iou_floor: float = 0.1,
    per_pair_max_rows: int = 10000000,
    per_detection_with_geometry: bool = False,
)

Configuration knobs for the expensive result tables. Inert when the corresponding flag is not requested via tables=.

Bbox dataclass

Bbox()

Bounding-box IoU kernel selector. No parameters.

Segm dataclass

Segm()

Segmentation-mask IoU kernel selector. No parameters.

Boundary dataclass

Boundary(dilation_ratio: float = DEFAULT_DILATION_RATIO)

Boundary IoU kernel selector (ADR-0010).

dilation_ratio is the boundary band width as a fraction of the image diagonal. 0.02 is the COCO default; 0.008 is the LVIS variant.

Keypoints dataclass

Keypoints(
    sigmas: Mapping[int, tuple[float, ...]] = (
        lambda: dict[int, tuple[float, ...]]()
    )(),
)

OKS (Object Keypoint Similarity) kernel selector (ADR-0012).

sigmas maps category_id -> per-keypoint sigma tuple. An empty mapping (the default) uses pycocotools' COCO-person 17-sigma table for every category. Per-category overrides honor quirk F1 ("corrected"): pycocotools hard-codes the COCO-person sigmas; vernier accepts a per-category mapping while keeping the default byte-identical on single-category-person datasets.

Evaluator dataclass

Evaluator(
    iou: IouKind = Bbox(),
    parity_mode: ParityMode = "corrected",
    max_dets: tuple[int, ...] | None = None,
    iou_thresholds: tuple[float, ...] | None = None,
    recall_thresholds: tuple[float, ...] | None = None,
    area_ranges: Breakdown | None = None,
    use_cats: bool = True,
    cast_inputs: bool = False,
)

Extended-API COCO-style evaluator.

The instance is immutable per ADR-0006: construct once, call :meth:evaluate per dataset/detections pair. To change a parameter, use :meth:with_options (which returns a new evaluator).

Defaults match pycocotools' detection eval grid, except for parity_mode, which defaults to "corrected" (the ADR-0002 recommendation for net-new users); migrating users wanting bit-exact pycocotools behavior should set parity_mode="strict".

The iou field is a discriminated dataclass union (:data:IouKind); each variant carries its own kernel-specific parameters (per ADR-0011). Use Bbox() / Segm() / Boundary(dilation_ratio=...).

max_dets defaults to None, meaning "use the canonical ladder for the selected iou kernel" (ADR-0012). Resolution happens at dispatch via :data:_KERNEL_MAX_DETS; explicit values always win. The current three kernels all resolve to (1, 10, 100).

cast_inputs (ADR-0030) gates one-shot f32→f64 / i32→i64 promotion when array-form Detections are passed to :meth:evaluate; off by default to preserve the strict ADR-0004 boundary. JSON-bytes detections ignore this flag.

with_options

with_options(
    *,
    iou: IouKind | None = None,
    parity_mode: ParityMode | None = None,
    max_dets: tuple[int, ...] | None | _UnsetType = _UNSET,
    iou_thresholds: tuple[float, ...]
    | None
    | _UnsetType = _UNSET,
    recall_thresholds: tuple[float, ...]
    | None
    | _UnsetType = _UNSET,
    area_ranges: Breakdown | None | _UnsetType = _UNSET,
    use_cats: bool | None = None,
    cast_inputs: bool | None = None,
) -> Evaluator

Return a copy of this evaluator with the given fields overridden.

Sentinel-keyed fields (max_dets, iou_thresholds, recall_thresholds, area_ranges) are three-valued: the default _UNSET leaves the field unchanged, None resets to the kernel-canonical default, and a value sets an explicit override.

evaluate

evaluate(
    gt: bytes | CocoDataset,
    dt: DetectionsInput,
    *,
    tables: None = None,
    tables_config: TablesConfig | None = None,
) -> Summary
evaluate(
    gt: bytes | CocoDataset,
    dt: DetectionsInput,
    *,
    tables: Literal["all"] | tuple[TableName, ...],
    tables_config: TablesConfig | None = None,
) -> EvalResult
evaluate(
    gt: bytes | CocoDataset,
    dt: DetectionsInput,
    *,
    tables: Literal["all"]
    | tuple[TableName, ...]
    | None = None,
    tables_config: TablesConfig | None = None,
) -> Summary | EvalResult

Run the evaluation pipeline against a GT/DT pair.

dt accepts the COCO loadRes-shaped JSON payload as bytes, or the array-form Detections shapes introduced by ADR-0030 (a single per-image dict or a sequence of them). The array path skips JSON serialization end-to-end and reads NumPy / DLPack buffers directly into the kernel.

gt is either the GT JSON bytes (parse-and-discard, identical to prior behavior) or a :class:CocoDataset handle (parsed-once, with the cache reused across calls — see ADR-0020).

tables= is the opt-in keyword for result tables. Defaults to None, returning :class:Summary (existing behavior, bit-identical to 0.0.1). Pass "all" or a tuple of :data:TableName\ s to opt into the wider :class:EvalResult return type.

Per ADR-0040, raises :class:IncompatibleSummaryPlan when iou_thresholds / recall_thresholds / area_ranges is set explicitly: the canonical 12-stat summary plan is keyed on hardcoded slot indices that don't generalize. Use :meth:evaluate_tables for tabular output that carries explicit labels per row.

evaluate_tables

evaluate_tables(
    gt: bytes | CocoDataset,
    dt: DetectionsInput,
    *,
    tables: Literal["all"] | tuple[TableName, ...] = "all",
    tables_config: TablesConfig | None = None,
) -> EvalResult

Tables-only evaluate path (ADR-0040 redirect target).

Equivalent to :meth:evaluate with tables= set, but bypasses the :class:IncompatibleSummaryPlan redirect so custom-grid users can reach the result tables. Honors iou_thresholds / recall_thresholds / area_ranges when set, falling through to the canonical COCO grid otherwise.

evaluate_to_partial

evaluate_to_partial(
    gt: bytes, dt: DetectionsInput, *, rank_id: int
) -> bytes

Run the evaluation as a per-rank streaming submit and return the serialized partial bytes (ADR-0031, ADR-0035).

rank_id identifies this evaluator's rank in a multi-process eval. The partial bytes can be gathered across ranks (e.g. via torch.distributed.all_gather_object) and merged on the head rank with :meth:from_partials to produce a global Summary bit-equal to a batch :meth:evaluate over the union (in parity_mode="strict" once the (score, rank_id, local_position) tiebreak lands; under ADR-0004's 4-ULP envelope today).

Per ADR-0040, raises :class:InvalidInstanceParams when any of iou_thresholds / recall_thresholds / area_ranges is set: extending the ADR-0031 wire format to carry the resolved custom grid + bumping params_hash to cover the new fields is a follow-up. Batch :meth:evaluate_tables already honors the custom grid; pair it with a single-rank run until the streaming follow-up ships.

from_partials classmethod

from_partials(
    gt: bytes,
    partials: Sequence[bytes],
    /,
    *,
    iou: IouKind | None = None,
    parity_mode: ParityMode = "corrected",
    max_dets: tuple[int, ...] | None = None,
    use_cats: bool = True,
    cast_inputs: bool = False,
) -> Summary

Merge partials (one per rank) into a global :class:Summary (ADR-0031, ADR-0035).

The kwargs mirror :class:Evaluator's config fields and must match what each rank used to produce its partial. Mismatches raise the structured Partial* errors (re-exported on this module).

background

background(
    gt: bytes | CocoDataset,
    *,
    memory_budget_bytes: int | None = None,
    queue_capacity: int = 8,
    worker_affinity: int | None = None,
    worker_nice: int = 5,
    shutdown_timeout_seconds: float = 5.0,
    retain_iou: bool = False,
    rank_id: int | None = None,
    record_latency_samples: bool = False,
) -> BackgroundEvaluator

Build a :class:BackgroundEvaluator (ADR-0014, ADR-0020) that shares this evaluator's iou, parity_mode, max_dets, use_cats, and cast_inputs.

Passing a :class:CocoDataset for gt reuses the parsed-once handle's per-kernel GT-side derivation caches across every submit() round (ADR-0020). For boundary IoU this collapses the dominant per-epoch cost — building the GT band per annotation — from O(epochs) to O(1). Bbox and keypoints have no GT-side cache today, so the win there is just the JSON parse; segm sits between.

The five queueing / scheduling knobs mirror the keyword-only parameters on :class:BackgroundEvaluator's constructor.

confusion_matrix

confusion_matrix(
    gt: bytes | CocoDataset,
    dt: bytes,
    *,
    iou: IouKind | None = None,
    t_f: float = 0.5,
    max_dets_per_image: int = 100,
    use_cats: bool = True,
    parity_mode: ParityMode = "corrected",
) -> DataFrame

Confusion matrix counts in long format (ADR-0023).

Counts (true_class, predicted_class) pairs across the dataset using the same cross-class IoU side pass that powers :func:vernier.error_decomposition. One per-image walk produces:

  • Diagonal cells (gt_class == dt_class) — true positives.
  • Off-diagonal cells (gt_class != dt_class) — classification confusion (a detection of class B was the best overlap of a GT of class A at IoU >= t_f).
  • __none__ row (gt_class == "__none__") — false positives: the detection had no overlapping GT at the threshold.
  • __none__ column (dt_class == "__none__") — missed GTs: a non-ignore GT was not covered by any detection at the threshold.

Output is a :class:polars.DataFrame with three columns:

  • gt_class: str — the true class id as a decimal string, or the sentinel "__none__" for false-positive rows.
  • dt_class: str — the predicted class id as a decimal string, or "__none__" for missed-GT columns.
  • count: i64 — number of (gt_class, dt_class) pairs in the dataset.

The class columns are typed str rather than mixed int|str because polars does not have a clean dtype for the union (the __none__ sentinel is fundamentally not an integer). Callers wanting numeric ids can df.with_columns(pl.col("gt_class").cast(pl.Int64, strict=False))__none__ rows surface as null, the natural representation of "no class".

Output is long-format (one row per non-zero cell) rather than wide-format (a square matrix) because long-format composes better with polars' filter / group / agg idioms. Pivot to wide via df.pivot(values="count", index="gt_class", on="dt_class") if needed for visualization.

Parameters:

Name Type Description Default
gt bytes | CocoDataset

Ground-truth COCO JSON payload as bytes (the same shape pycocotools.COCO(...) consumes). The :class:CocoDataset handle from ADR-0020 is not yet wired through this path — passing one raises :class:NotImplementedError.

required
dt bytes

Detection COCO JSON payload as bytes.

required
iou IouKind | None

Kernel selector. Pass :class:Bbox() (default), :class:Segm(), or :class:Boundary(dilation_ratio=...). :class:Keypoints is rejected per ADR-0024 (OKS is single-class in COCO; cross-class confusion is undefined).

None
t_f float

Foreground IoU threshold for declaring a (gt, dt) pair matched. Default 0.5 matches the COCO convention.

0.5
max_dets_per_image int

Per-image detection cap (matches the matching path's cap). Default 100.

100
use_cats bool

Reserved; must be True. A category-collapsed evaluation has no meaningful confusion matrix (every cell collapses to a single virtual class).

True
parity_mode ParityMode

"strict" or "corrected" per ADR-0002. Defaults to "corrected".

'corrected'

Returns:

Name Type Description
A DataFrame

class:polars.DataFrame with columns gt_class,

DataFrame

dt_class, count.

Raises:

Type Description
NotImplementedError

iou=Keypoints(...) (ADR-0024) or gt is a :class:CocoDataset handle (ADR-0020 forward-compat marker not yet wired through).

ValueError

t_f outside [0, 1], max_dets_per_image < 1, or use_cats=False.

ImportError

polars not installed (install via pip install 'vernier[tables]').

Example

import vernier df = vernier.confusion_matrix(gt_bytes, dt_bytes, iou=vernier.Bbox()) df.filter(pl.col("gt_class") != pl.col("dt_class")) # only mistakes df.pivot(values="count", index="gt_class", on="dt_class") # wide

error_decomposition

error_decomposition(
    gt: bytes | CocoDataset,
    dt: bytes,
    *,
    iou: object = None,
    t_f: float | None = None,
    t_b: float | None = None,
    max_dets_per_image: int = 100,
    use_cats: bool = True,
    parity_mode: ParityMode = "corrected",
) -> TideReport

TIDE error decomposition (Bolya et al. 2020).

Splits the gap between a model's measured mAP and the perfect-mAP upper bound into six interpretable bins (Cls / Loc / Both / Dupe / Bkg / Missed), telling the user which kind of error is costing them the most points. Eight evaluation passes per call (one baseline plus one per bin plus the all-FPs-removed sanity total); expect roughly 6x the cost of a single :class:Evaluator.evaluate call.

gt is the GT JSON bytes (the same shape pycocotools' COCO constructor consumes). dt is the detections JSON bytes (the shape COCO.loadRes consumes). The :class:vernier.CocoDataset parsed-once handle (ADR-0020) is accepted in the type signature for forward-compat but raises :class:NotImplementedError today — the TIDE FFI is not yet wired through the CocoDataset cache. Tracked as a 0.5.x follow-up.

iou selects the kernel: Bbox() (default), Segm(), or Boundary(dilation_ratio=...). Keypoints(...) raises :class:NotImplementedError per ADR-0024 — TIDE on OKS has no published convention and the Cls/Both bins are structurally empty on COCO keypoints (single-class).

t_f (foreground / match threshold) and t_b (background threshold) carve the bin assignment. None resolves to the per-kernel defaults from ADR-0022:

+-----------+--------+--------+ | Kernel | t_f| t_b| +===========+========+========+ | bbox | 0.5 | 0.1 | +-----------+--------+--------+ | segm | 0.5 | 0.1 | +-----------+--------+--------+ | boundary | 0.5 | 0.05 | +-----------+--------+--------+

The bbox row matches the TIDE paper; segm and boundary rows are defensible-by-extrapolation defaults (segm) and geometry-anchored (boundary), both tentative pending the empirical work tracked in ADR-0022's "Decision gate" section. Override per call by passing explicit t_f / t_b floats; the report's :attr:config records the resolved values either way.

max_dets_per_image defaults to 100 (the largest rung of the standard COCO detection ladder). use_cats defaults to True (per-class evaluation, the COCO standard); set False for class-agnostic decomposition.

parity_mode follows :class:vernier.Evaluator: "corrected" (default) applies vernier's opinionated fixes for known pycocotools quirks; "strict" reproduces pycocotools bit-exactly (per ADR-0002).

Returns a :class:TideReport carrying the six per-bin ΔmAP values, the baseline mAP, the all-FPs-removed sanity total, and the resolved :class:TideConfig.

See the debugging tutorial (docs/tutorials/debugging-with-tide.md) for a worked example, ADR-0021 for the algorithmic spec, ADR-0022 for the threshold defaults, and ADR-0024 for the keypoints deferral.

.. note:: The opt-in mode="per_threshold" variant of TIDE (10x passes, one per IoU threshold in the AP grid) is not exposed in 0.5.0; planned as a 0.5.x follow-up. The single-t_f form is the paper-faithful default. Per-class drill-down on :class:TideReport is similarly deferred to a 0.5.x follow-up; composing this call with :class:vernier.Evaluator's tables="per_class" path is the recommended workaround until it lands.

fp_iou_histogram

fp_iou_histogram(
    gt: bytes | CocoDataset,
    dt: bytes,
    *,
    iou: object = None,
    t_f: float | None = None,
    max_dets_per_image: int = 100,
    use_cats: bool = True,
    parity_mode: ParityMode = "corrected",
) -> FpIouHistogram

Extract per-FP (iou_same, iou_cross) for ADR-0022 ratification.

Sister entry point to :func:error_decomposition. Same dispatch logic (kernel selection, parity mode, max-dets) but emits the raw IoU pairs instead of the six-bin ΔmAP. Caller bins the values Python-side to compute the bin-as-Bkg fraction at candidate t_b.

Parameters:

Name Type Description Default
gt bytes | CocoDataset

GT JSON bytes (CocoDataset handle deferred, same as :func:error_decomposition).

required
dt bytes

Detection JSON bytes.

required
iou object

Kernel selector — :class:vernier.Bbox (default), :class:vernier.Segm, or :class:vernier.Boundary. :class:vernier.Keypoints raises per ADR-0024.

None
t_f float | None

Foreground threshold for identifying TP / Ignore. Defaults to 0.5 (ADR-0022 standard); the t_b parameter on :func:error_decomposition is not consumed here.

None
max_dets_per_image int

Per-image detection cap; same default as :func:error_decomposition.

100
use_cats bool

Per-class evaluation; same default.

True
parity_mode ParityMode

Same as :func:error_decomposition.

'corrected'

Returns:

Type Description
FpIouHistogram

class:FpIouHistogram carrying parallel iou_same /

FpIouHistogram

iou_cross numpy arrays plus the metadata the report

FpIouHistogram

consumer needs.

Type aliases