vernier.instance
Detection / segmentation / boundary / keypoints evaluation. Built around
the Evaluator config + evaluate(...) call, with a BackgroundEvaluator
variant for overlapping the kernel with the rest of the training step.
Instance-segmentation / detection / keypoints evaluation surface.
Per ADR-0029, the AP-fold evaluation paradigm (bbox, segm, boundary,
keypoints) lives under vernier.instance. Sibling to
:mod:vernier.panoptic and :mod:vernier.semantic.
CompressedRLE
Bases: TypedDict
COCO compressed RLE shape (6-bit ASCII bytes, as emitted by
pycocotools.mask.encode).
counts is the compressed bytes payload, validated as UTF-8 ASCII at ingest.
size is (height, width) in COCO order.
Detections
Bases: TypedDict
One per-image detection batch in array form.
Fields are gated by iou_type:
bbox:image_id,boxes,scores,labels.segm/boundary: above plusrles.keypoints:image_id,boxes,scores,labels,keypoints.
Required dtypes (no silent promotion — opt in via
cast_inputs=True):
boxes:float64(N, 4)C-contiguous, xywh.scores:float64(N,).labels:int64(N,).rles[i](uncompressed dict):counts: uint321-D contiguous,size: (h, w).rles[i](compressed dict):counts: bytes(COCO 6-bit ASCII),size: (h, w).rles[i](bitmask): 2-Dbooloruint8, shape(H, W), C- or F-order.keypoints:float64(N, K, 3)C-contiguous.
UncompressedRLE
Bases: TypedDict
COCO RLE shape on the array-ingest path (uncompressed counts).
counts is the uncompressed run-length array (uint32, contiguous).
size is (height, width) in COCO order.
BackgroundEvaluator
BackgroundEvaluator(
gt: bytes | CocoDataset,
*,
iou_type: Literal[
"bbox", "segm", "boundary", "keypoints"
] = ...,
parity_mode: Literal["strict", "corrected"] = ...,
max_dets: list[int] = ...,
use_cats: bool = ...,
memory_budget_bytes: int | None = ...,
dilation_ratio: float = ...,
sigmas: dict[int, list[float]] | None = ...,
queue_capacity: int = ...,
worker_affinity: int | None = ...,
worker_nice: int = ...,
shutdown_timeout_seconds: float = ...,
retain_iou: bool = ...,
cast_inputs: bool = ...,
rank_id: int | None = ...,
record_latency_samples: bool = ...,
)
Background-evaluator surface (ADR-0014). Wraps a worker thread that
owns the StreamingEvaluator<K>; every public method either sends on
the channel or reads atomic counters. Not frozen — finalize() and
__exit__ need to mutate state.
detections_seen
property
Mirror of StreamingEvaluator::detections_seen(). Advisory.
images_seen
property
Mirror of StreamingEvaluator::images_seen(). Advisory — updated
by the worker after each successful submit.
memory_used_bytes
property
Mirror of StreamingEvaluator::memory_used_bytes(). Advisory.
drain_latency_samples_ns
method descriptor
Drain the worker's accumulated submit-latency samples (B5).
Each sample is the wall-time of one submit() call's
channel-send leg, in nanoseconds. The buffer is reset to empty
on each call so subsequent submits keep accumulating; returns
an empty list when the evaluator was constructed without
record_latency_samples=True (the default) or after
finalize has consumed the worker.
finalize
method descriptor
finalize() -> Summary
Drain the queue, finalize the evaluator, and join the worker. Subsequent calls error with the "already finalized" message.
finalize_to_partial
method descriptor
ADR-0031 / ADR-0035: drain the queue, serialize the worker's final state as a partial blob, and shut the worker down. Subsequent calls raise "already finalized".
finalize_with_tables
method descriptor
finalize_with_tables(
per_image: bool = False,
per_class: bool = False,
per_detection: bool = False,
per_pair: bool = False,
per_pair_iou_floor: float = 0.1,
per_pair_max_rows: int = 10000000,
per_detection_with_geometry: bool = False,
) -> _TablesResult
Tables-aware finalize. Drains the queue and consumes the worker.
submit
method descriptor
submit(
detections: DetectionsInput,
*,
timeout: float | None = None,
) -> None
Submit a detection batch to the worker. Accepts either loadRes-
shaped JSON bytes (legacy) or an ADR-0030 Detections dict /
sequence of Detections dicts (numpy/DLPack). timeout controls
backpressure:
None(default) → block until a slot is free0.0→ single non-blocking attempt; raiseQueueFullErrorif the queue is fullt > 0.0→ wait up totseconds; raiseQueueFullErroron timeout
Breakdown
Python wrapper around [Breakdown] / [ClassGroupBreakdown].
buckets
property
Range buckets as a list of (label, lo, hi) triples in
construction order.
Raises AttributeError if this Breakdown was built via
from_class_groups. Use class_groups instead.
class_groups
property
Class-id groups as a list of (label, class_ids) pairs in
construction order.
Raises AttributeError if this Breakdown was built via
from_ranges. Use buckets instead.
kind
property
Variant discriminator: "range" for from_ranges-constructed
breakdowns, "class_groups" for from_class_groups-constructed
ones. Use this to dispatch in validators that accept a
Breakdown of a specific shape.
from_class_groups
builtin
from_class_groups(
axis: str, groups: Sequence[tuple[str, Sequence[int]]]
) -> Breakdown
Construct from class-id-keyed groups.
groups is a sequence of (label, class_ids) pairs, one per
group. Group order on input determines the group axis index
(first pair is index 0). Strict partition discipline is enforced
— no class id may appear in two groups.
Raises ValueError on:
- empty
groups; - any group with empty
class_ids; - duplicate group labels;
- the same class id appearing in more than one group.
from_ranges
builtin
from_ranges(
axis: str, buckets: Sequence[tuple[str, float, float]]
) -> Breakdown
Construct from f64-keyed buckets.
buckets is a sequence of (label, lo, hi) triples, one per
bucket. [lo, hi] is closed on both ends per ADR-0016 (quirk
D6); an annotation whose key sits exactly on a boundary lands in
both adjacent buckets.
Raises ValueError on:
- empty
buckets; - NaN or infinite
lo/hi; lo < 0;lo > hi;- duplicate bucket labels.
CocoDataset
Parsed-once COCO ground-truth dataset.
Construct with [PyDataset::from_json]; pass to the
evaluate_*_summary_with_dataset family. Reusing the same instance
across evaluate calls reuses the GT-side derivations the cached
kernels populate on first use (per ADR-0020). The handle is frozen
— its identity is the cache key.
boundary_cache_len
property
Observability-only: count of GT annotations whose boundary band is currently cached (ADR-0020). Useful for debugging or tests that need to assert cache reuse; not a stable contract, and the value can change shape as new cache slots are added.
category_frequency
property
Per-category frequency tag as the LVIS single-letter form
("r" / "c" / "f"; quirk AB1). None when this
dataset is not federated.
is_federated
property
True when this dataset carries LVIS federated metadata —
equivalent to pos_category_ids is not None. Cheap shortcut
for orchestration code that gates behaviour on the federated
flag.
neg_category_ids
property
Per-image negative-category set (quirk AA2). None when
this dataset is not federated.
not_exhaustive_category_ids
property
Per-image not-exhaustive-category set (quirk AA3). None
when this dataset is not federated.
pos_category_ids
property
Per-image positive-category set (quirk AA1, derived from
GTs at load). None when this dataset was loaded via
[Self::from_json] (COCO path) rather than
[Self::from_lvis_json].
segm_cache_len
property
Observability-only: count of GT annotations whose segm
bbox+area derivation is currently cached (ADR-0020). Same
caveats as [Self::boundary_cache_len].
clear_cache
method descriptor
Drops every cached GT-side derivation. Reset point for users who want to free memory between long-lived training cycles without dropping the dataset itself.
from_json
staticmethod
from_json(gt_json: bytes) -> CocoDataset
Parses a COCO ground-truth JSON payload into a reusable
[Dataset] handle. Raises ValueError on malformed JSON.
from_lvis_json
staticmethod
from_lvis_json(gt_json: bytes) -> CocoDataset
Parses an LVIS v1 ground-truth JSON payload into a reusable
[Dataset] handle. The handle exposes the federated metadata
(pos_category_ids, neg_category_ids,
not_exhaustive_category_ids, category_frequency) the
orchestrator reads to apply LVIS evaluation semantics
(ADR-0026).
Raises ValueError on malformed JSON, on the disjointness
violations of quirk AA7 (a category in both not_exhaustive
and neg, or a neg category that has GT on the same image),
or on missing frequency tags (quirk AB6).
Migration guide for users coming from lvis-api:
docs/explanation/lvis-migration.md. Lead with the silent-
federated-semantics gotcha — loading LVIS-shaped JSON via
Dataset.from_json (the COCO loader) silently drops the
federated extras and produces systematically lower AP under
COCO semantics.
MemoryBudgetWarning
Bases: builtins.UserWarning
Streaming evaluator's memory usage crossed the soft-warn threshold.
OutOfBudgetError
Bases: builtins.RuntimeError
Memory budget for the streaming evaluator was exceeded.
Attributes: used_bytes, budget_bytes, breakdown.
PartialDatasetMismatch
Bases: builtins.RuntimeError
Distributed-eval partial was computed against a different dataset than the receiving evaluator.
Attributes: expected (bytes), actual (bytes).
PartialFormatMismatch
Bases: builtins.RuntimeError
Distributed-eval partial blob is structurally malformed (magic / version / CRC / kernel kind / parity / retain_iou / grid dimensions / rkyv archive).
Attributes: kind (string discriminator).
PartialParamsMismatch
Bases: builtins.RuntimeError
Distributed-eval partial was computed against different evaluation params than the receiving evaluator.
Attributes: expected (bytes), actual (bytes).
PartialPartitionOverlap
Bases: builtins.RuntimeError
Two distributed-eval partials cover the same image_id (sampler bug).
Attributes: rank_a, rank_b, image_id.
PartialRankCollision
Bases: builtins.RuntimeError
Two strict-mode distributed-eval partials share a rank_id.
Attributes: rank_id.
QueueFullError
Bases: builtins.RuntimeError
Background evaluator's submit queue was full.
Attributes: queue_capacity, timeout.
Summary
Pythonic view over a [vernier_core::Summary]. Frozen — the underlying
value is constructed once by [evaluate_bbox_summary] and never
mutated (per ADR-0006).
FpIouHistogram
dataclass
FpIouHistogram(
iou_same: ndarray,
iou_cross: ndarray,
kernel: KernelName,
t_f: float,
n_total_dts: int,
n_fps: int,
)
FP-IoU histogram for ADR-0022 t_b ratification.
For every detection that bin assignment classifies as a false
positive (Cls / Loc / Both / Dupe / Bkg — anything that's not TP
and not Ignore), :attr:iou_same and :attr:iou_cross carry the
best same-class and cross-class IoUs at the time of the bin pick.
Bin-as-Bkg fraction at a candidate t_b is::
max_iou = np.maximum(h.iou_same, h.iou_cross)
bkg_fraction = (max_iou < t_b).mean()
Sweeping t_b over a range and plotting bkg_fraction(t_b)
surfaces the "valley" between genuine backgrounds (IoU≈0) and
near-misses; the right t_b sits in that valley.
See the analysis CLI at
tests/python/integration/real_models/tide/extract_fp_histogram.py
for the recipe.
TideConfig
dataclass
Resolved TIDE configuration recorded on every :class:TideReport.
Mirrors :class:vernier_core::tide::report::TideConfig (see
crates/vernier-core/src/tide/report.rs). The fields carry the
resolved thresholds the call ran under — never None — so a
report screenshot is self-describing per ADR-0022.
cross_class_topk (per ADR-0023) is intentionally absent from
this Python surface in 0.5.0. The Rust default (None =
materialize the full per-detection cross-class IoU vector) is the
only behavior reachable from Python today; the knob will be exposed
once a workload demands it.
TideReport
dataclass
TideReport(
baseline_map: float,
delta: dict[str, float],
delta_all_fp_removed: float,
config: TideConfig,
)
Six-bin TIDE decomposition of a detection model's mAP gap.
Returned by :func:error_decomposition. Each :attr:delta entry is
the mAP increase the model would achieve if every detection assigned
to that bin were corrected; :attr:baseline_map is the headline
number before any correction; :attr:delta_all_fp_removed is the
paper's "perfect rejection" upper bound (what mAP would be if every
FP were dropped). The per-bin deltas should sum to at most
:attr:delta_all_fp_removed — useful as a sanity check that the
rewrite layer is internally consistent.
Fields mirror :class:vernier_core::tide::report::TideReport (see
crates/vernier-core/src/tide/report.rs). The :attr:config
field is the resolved :class:TideConfig so a single report tells
the reader which thresholds it was produced under (ADR-0022).
Bin names follow the TIDE paper:
cls— Classification: matched a GT of the wrong class.loc— Localization: right class, IoU in[t_b, t_f).both— Both classification and localization wrong.dupe— Duplicate: a higher-scoring detection already matched.bkg— Background: IoU< t_bagainst every GT.missed— Missed GT: no detection survived to match.
See the debugging tutorial (docs/tutorials/debugging-with-tide.md)
and ADR-0021 for the algorithmic spec.
CategoryFilterByGrouping
dataclass
Match every class id in the named group of the active
class_grouping breakdown.
Only meaningful when the Evaluator's class_grouping is also
set; the validator at __post_init__ rejects ByGrouping
when no grouping is configured or when label is not a
grouping label.
CategoryFilterByIds
dataclass
Match an explicit set of class / category ids.
CategoryFilterFrequency
dataclass
Match by LVIS frequency tag ("r", "c", "f").
Valid only on instance evaluation against an LVIS-shaped dataset (ADR-0026). Semantic and panoptic Evaluators reject this variant at construction time per ADR-0041 / ADR-0042 — frequency tags are a sum type that doesn't generalize to non-numeric axes; class groupings carry the user's per-group rollup intent on those paradigms.
EvalResult
dataclass
EvalResult(
summary: Summary | None,
_per_image_batch: object | None = None,
_per_class_batch: object | None = None,
_per_detection_batch: object | None = None,
_per_pair_batch: object | None = None,
)
Result of an opt-in result-tables evaluate(...) call.
Returned only when tables= is non-None on
:meth:vernier.Evaluator.evaluate. The default tables=None path
still returns the bit-identical :class:vernier.Summary it always
has.
Tables are exposed as cached :class:polars.DataFrame properties;
polars is imported lazily on first attribute access (installed via
the vernier[tables] extra). pandas / duckdb / pyarrow consumers
can round-trip on the returned DataFrame, or call the underlying
Arrow producer (self._per_image_batch.__arrow_c_array__())
directly — the leading-underscore name signals implementation detail.
stats
property
Pass-through to self.summary.stats. Raises
:class:AttributeError on ADR-0040 custom-grid results — the
slot-indexed summary doesn't apply; read per-axis tables instead.
per_image
cached
property
One row per image rollup. Raises RuntimeError if
per_image was not in the tables= request.
per_class
cached
property
One row per category. Raises RuntimeError if
per_class was not in the tables= request.
per_detection
cached
property
One row per detection. Raises RuntimeError if
per_detection was not in the tables= request.
IncompatibleSummaryPlan
Bases: ValueError
Raised when a custom evaluation grid is incompatible with the canonical fixed-shape summary plan.
The COCO 12-stat / keypoints 10-stat / LVIS 13-stat summary plans
address slots in the (T, R, K, A, M) accumulator by hardcoded
indices — AP_S is "the second area-bucket entry of the all-IoU
slice at maxDet=100", not "the small-area slot". A custom
iou_thresholds ladder, recall_thresholds ladder, or
area_ranges breakdown breaks this index assumption.
Per ADR-0040, custom-grid users get the result-tables surface
(Evaluator.evaluate_tables(...), ADR-0019), which carries
explicit labels per row and composes cleanly with arbitrary grid
layouts. The remediation field on this exception names the
method to call.
InvalidEvalParams
Bases: ValueError
Base for paradigm-specific Evaluator construction errors.
Raised at Evaluator.__post_init__ time by every paradigm in
response to invalid parameter values (out of range, wrong shape,
duplicate, conflicting, etc.). Per ADR-0039, validation runs at
construction so misconfiguration surfaces fast — evaluate()
cannot fail on misconfigured params, only on bad data.
Each subclass carries the offending field name, the offending value, and a one-line remediation pointer (typically the relevant ADR or doc page).
InvalidInstanceParams
TablesConfig
dataclass
TablesConfig(
per_pair_iou_floor: float = 0.1,
per_pair_max_rows: int = 10000000,
per_detection_with_geometry: bool = False,
)
Configuration knobs for the expensive result tables. Inert when
the corresponding flag is not requested via tables=.
Boundary
dataclass
Boundary IoU kernel selector (ADR-0010).
dilation_ratio is the boundary band width as a fraction of the
image diagonal. 0.02 is the COCO default; 0.008 is the LVIS
variant.
Keypoints
dataclass
Keypoints(
sigmas: Mapping[int, tuple[float, ...]] = (
lambda: dict[int, tuple[float, ...]]()
)(),
)
OKS (Object Keypoint Similarity) kernel selector (ADR-0012).
sigmas maps category_id -> per-keypoint sigma tuple. An empty
mapping (the default) uses pycocotools' COCO-person 17-sigma table
for every category. Per-category overrides honor quirk F1
("corrected"): pycocotools hard-codes the COCO-person sigmas; vernier
accepts a per-category mapping while keeping the default byte-identical
on single-category-person datasets.
Evaluator
dataclass
Evaluator(
iou: IouKind = Bbox(),
parity_mode: ParityMode = "corrected",
max_dets: tuple[int, ...] | None = None,
iou_thresholds: tuple[float, ...] | None = None,
recall_thresholds: tuple[float, ...] | None = None,
area_ranges: Breakdown | None = None,
use_cats: bool = True,
cast_inputs: bool = False,
)
Extended-API COCO-style evaluator.
The instance is immutable per ADR-0006: construct once, call
:meth:evaluate per dataset/detections pair. To change a parameter,
use :meth:with_options (which returns a new evaluator).
Defaults match pycocotools' detection eval grid, except for
parity_mode, which defaults to "corrected" (the ADR-0002
recommendation for net-new users); migrating users wanting bit-exact
pycocotools behavior should set parity_mode="strict".
The iou field is a discriminated dataclass union (:data:IouKind);
each variant carries its own kernel-specific parameters (per ADR-0011).
Use Bbox() / Segm() / Boundary(dilation_ratio=...).
max_dets defaults to None, meaning "use the canonical ladder
for the selected iou kernel" (ADR-0012). Resolution happens at
dispatch via :data:_KERNEL_MAX_DETS; explicit values always win.
The current three kernels all resolve to (1, 10, 100).
cast_inputs (ADR-0030) gates one-shot f32→f64 / i32→i64
promotion when array-form Detections are passed to
:meth:evaluate; off by default to preserve the strict ADR-0004
boundary. JSON-bytes detections ignore this flag.
with_options
with_options(
*,
iou: IouKind | None = None,
parity_mode: ParityMode | None = None,
max_dets: tuple[int, ...] | None | _UnsetType = _UNSET,
iou_thresholds: tuple[float, ...]
| None
| _UnsetType = _UNSET,
recall_thresholds: tuple[float, ...]
| None
| _UnsetType = _UNSET,
area_ranges: Breakdown | None | _UnsetType = _UNSET,
use_cats: bool | None = None,
cast_inputs: bool | None = None,
) -> Evaluator
Return a copy of this evaluator with the given fields overridden.
Sentinel-keyed fields (max_dets, iou_thresholds,
recall_thresholds, area_ranges) are three-valued:
the default _UNSET leaves the field unchanged, None
resets to the kernel-canonical default, and a value sets an
explicit override.
evaluate
evaluate(
gt: bytes | CocoDataset,
dt: DetectionsInput,
*,
tables: None = None,
tables_config: TablesConfig | None = None,
) -> Summary
evaluate(
gt: bytes | CocoDataset,
dt: DetectionsInput,
*,
tables: Literal["all"] | tuple[TableName, ...],
tables_config: TablesConfig | None = None,
) -> EvalResult
evaluate(
gt: bytes | CocoDataset,
dt: DetectionsInput,
*,
tables: Literal["all"]
| tuple[TableName, ...]
| None = None,
tables_config: TablesConfig | None = None,
) -> Summary | EvalResult
Run the evaluation pipeline against a GT/DT pair.
dt accepts the COCO loadRes-shaped JSON payload as
bytes, or the array-form Detections shapes
introduced by ADR-0030 (a single per-image dict or a sequence
of them). The array path skips JSON serialization end-to-end
and reads NumPy / DLPack buffers directly into the kernel.
gt is either the GT JSON bytes (parse-and-discard, identical
to prior behavior) or a :class:CocoDataset handle (parsed-once,
with the cache reused across calls — see ADR-0020).
tables= is the opt-in keyword for result tables. Defaults
to None, returning :class:Summary (existing behavior,
bit-identical to 0.0.1). Pass "all" or a tuple of
:data:TableName\ s to opt into the wider :class:EvalResult
return type.
Per ADR-0040, raises :class:IncompatibleSummaryPlan when
iou_thresholds / recall_thresholds / area_ranges
is set explicitly: the canonical 12-stat summary plan is keyed
on hardcoded slot indices that don't generalize. Use
:meth:evaluate_tables for tabular output that carries
explicit labels per row.
evaluate_tables
evaluate_tables(
gt: bytes | CocoDataset,
dt: DetectionsInput,
*,
tables: Literal["all"] | tuple[TableName, ...] = "all",
tables_config: TablesConfig | None = None,
) -> EvalResult
Tables-only evaluate path (ADR-0040 redirect target).
Equivalent to :meth:evaluate with tables= set, but
bypasses the :class:IncompatibleSummaryPlan redirect so
custom-grid users can reach the result tables. Honors
iou_thresholds / recall_thresholds / area_ranges
when set, falling through to the canonical COCO grid otherwise.
evaluate_to_partial
evaluate_to_partial(
gt: bytes, dt: DetectionsInput, *, rank_id: int
) -> bytes
Run the evaluation as a per-rank streaming submit and return the serialized partial bytes (ADR-0031, ADR-0035).
rank_id identifies this evaluator's rank in a multi-process
eval. The partial bytes can be gathered across ranks (e.g. via
torch.distributed.all_gather_object) and merged on the head
rank with :meth:from_partials to produce a global Summary
bit-equal to a batch :meth:evaluate over the union (in
parity_mode="strict" once the (score, rank_id,
local_position) tiebreak lands; under ADR-0004's 4-ULP
envelope today).
Per ADR-0040, raises :class:InvalidInstanceParams when any
of iou_thresholds / recall_thresholds / area_ranges
is set: extending the ADR-0031 wire format to carry the
resolved custom grid + bumping params_hash to cover the
new fields is a follow-up. Batch :meth:evaluate_tables
already honors the custom grid; pair it with a single-rank
run until the streaming follow-up ships.
from_partials
classmethod
from_partials(
gt: bytes,
partials: Sequence[bytes],
/,
*,
iou: IouKind | None = None,
parity_mode: ParityMode = "corrected",
max_dets: tuple[int, ...] | None = None,
use_cats: bool = True,
cast_inputs: bool = False,
) -> Summary
Merge partials (one per rank) into a global :class:Summary
(ADR-0031, ADR-0035).
The kwargs mirror :class:Evaluator's config fields and must
match what each rank used to produce its partial. Mismatches
raise the structured Partial* errors (re-exported on this
module).
background
background(
gt: bytes | CocoDataset,
*,
memory_budget_bytes: int | None = None,
queue_capacity: int = 8,
worker_affinity: int | None = None,
worker_nice: int = 5,
shutdown_timeout_seconds: float = 5.0,
retain_iou: bool = False,
rank_id: int | None = None,
record_latency_samples: bool = False,
) -> BackgroundEvaluator
Build a :class:BackgroundEvaluator (ADR-0014, ADR-0020) that
shares this evaluator's iou, parity_mode, max_dets,
use_cats, and cast_inputs.
Passing a :class:CocoDataset for gt reuses the parsed-once
handle's per-kernel GT-side derivation caches across every
submit() round (ADR-0020). For boundary IoU this collapses
the dominant per-epoch cost — building the GT band per
annotation — from O(epochs) to O(1). Bbox and keypoints have
no GT-side cache today, so the win there is just the JSON
parse; segm sits between.
The five queueing / scheduling knobs mirror the keyword-only
parameters on :class:BackgroundEvaluator's constructor.
confusion_matrix
confusion_matrix(
gt: bytes | CocoDataset,
dt: bytes,
*,
iou: IouKind | None = None,
t_f: float = 0.5,
max_dets_per_image: int = 100,
use_cats: bool = True,
parity_mode: ParityMode = "corrected",
) -> DataFrame
Confusion matrix counts in long format (ADR-0023).
Counts (true_class, predicted_class) pairs across the dataset
using the same cross-class IoU side pass that powers
:func:vernier.error_decomposition. One per-image walk produces:
- Diagonal cells (
gt_class == dt_class) — true positives. - Off-diagonal cells (
gt_class != dt_class) — classification confusion (a detection of class B was the best overlap of a GT of class A at IoU >=t_f). __none__row (gt_class == "__none__") — false positives: the detection had no overlapping GT at the threshold.__none__column (dt_class == "__none__") — missed GTs: a non-ignore GT was not covered by any detection at the threshold.
Output is a :class:polars.DataFrame with three columns:
gt_class: str— the true class id as a decimal string, or the sentinel"__none__"for false-positive rows.dt_class: str— the predicted class id as a decimal string, or"__none__"for missed-GT columns.count: i64— number of(gt_class, dt_class)pairs in the dataset.
The class columns are typed str rather than mixed int|str
because polars does not have a clean dtype for the union (the
__none__ sentinel is fundamentally not an integer). Callers
wanting numeric ids can df.with_columns(pl.col("gt_class").cast(pl.Int64,
strict=False)) — __none__ rows surface as null, the
natural representation of "no class".
Output is long-format (one row per non-zero cell) rather than
wide-format (a square matrix) because long-format composes better
with polars' filter / group / agg idioms. Pivot to wide via
df.pivot(values="count", index="gt_class", on="dt_class") if
needed for visualization.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
gt
|
bytes | CocoDataset
|
Ground-truth COCO JSON payload as bytes (the same shape
|
required |
dt
|
bytes
|
Detection COCO JSON payload as bytes. |
required |
iou
|
IouKind | None
|
Kernel selector. Pass :class: |
None
|
t_f
|
float
|
Foreground IoU threshold for declaring a |
0.5
|
max_dets_per_image
|
int
|
Per-image detection cap (matches the
matching path's cap). Default |
100
|
use_cats
|
bool
|
Reserved; must be |
True
|
parity_mode
|
ParityMode
|
|
'corrected'
|
Returns:
| Name | Type | Description |
|---|---|---|
A |
DataFrame
|
class: |
DataFrame
|
|
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
|
ValueError
|
|
ImportError
|
|
Example
import vernier df = vernier.confusion_matrix(gt_bytes, dt_bytes, iou=vernier.Bbox()) df.filter(pl.col("gt_class") != pl.col("dt_class")) # only mistakes df.pivot(values="count", index="gt_class", on="dt_class") # wide
error_decomposition
error_decomposition(
gt: bytes | CocoDataset,
dt: bytes,
*,
iou: object = None,
t_f: float | None = None,
t_b: float | None = None,
max_dets_per_image: int = 100,
use_cats: bool = True,
parity_mode: ParityMode = "corrected",
) -> TideReport
TIDE error decomposition (Bolya et al. 2020).
Splits the gap between a model's measured mAP and the perfect-mAP
upper bound into six interpretable bins (Cls / Loc / Both / Dupe /
Bkg / Missed), telling the user which kind of error is costing
them the most points. Eight evaluation passes per call (one
baseline plus one per bin plus the all-FPs-removed sanity total);
expect roughly 6x the cost of a single :class:Evaluator.evaluate
call.
gt is the GT JSON bytes (the same shape pycocotools'
COCO constructor consumes). dt is the detections JSON
bytes (the shape COCO.loadRes consumes). The
:class:vernier.CocoDataset parsed-once handle (ADR-0020) is accepted
in the type signature for forward-compat but raises
:class:NotImplementedError today — the TIDE FFI is not yet wired
through the CocoDataset cache. Tracked as a 0.5.x follow-up.
iou selects the kernel: Bbox() (default), Segm(), or
Boundary(dilation_ratio=...). Keypoints(...) raises
:class:NotImplementedError per ADR-0024 — TIDE on OKS has no
published convention and the Cls/Both bins are structurally empty
on COCO keypoints (single-class).
t_f (foreground / match threshold) and t_b (background
threshold) carve the bin assignment. None resolves to the
per-kernel defaults from ADR-0022:
+-----------+--------+--------+
| Kernel | t_f| t_b|
+===========+========+========+
| bbox | 0.5 | 0.1 |
+-----------+--------+--------+
| segm | 0.5 | 0.1 |
+-----------+--------+--------+
| boundary | 0.5 | 0.05 |
+-----------+--------+--------+
The bbox row matches the TIDE paper; segm and boundary rows are
defensible-by-extrapolation defaults (segm) and geometry-anchored
(boundary), both tentative pending the empirical work tracked in
ADR-0022's "Decision gate" section. Override per call by passing
explicit t_f / t_b floats; the report's :attr:config
records the resolved values either way.
max_dets_per_image defaults to 100 (the largest rung of the
standard COCO detection ladder). use_cats defaults to True
(per-class evaluation, the COCO standard); set False for
class-agnostic decomposition.
parity_mode follows :class:vernier.Evaluator: "corrected"
(default) applies vernier's opinionated fixes for known
pycocotools quirks; "strict" reproduces pycocotools bit-exactly
(per ADR-0002).
Returns a :class:TideReport carrying the six per-bin ΔmAP values,
the baseline mAP, the all-FPs-removed sanity total, and the
resolved :class:TideConfig.
See the debugging tutorial (docs/tutorials/debugging-with-tide.md)
for a worked example, ADR-0021 for the algorithmic spec, ADR-0022
for the threshold defaults, and ADR-0024 for the keypoints
deferral.
.. note::
The opt-in mode="per_threshold" variant of TIDE (10x
passes, one per IoU threshold in the AP grid) is not exposed in
0.5.0; planned as a 0.5.x follow-up. The single-t_f form is
the paper-faithful default. Per-class drill-down on
:class:TideReport is similarly deferred to a 0.5.x follow-up;
composing this call with :class:vernier.Evaluator's
tables="per_class" path is the recommended workaround until
it lands.
fp_iou_histogram
fp_iou_histogram(
gt: bytes | CocoDataset,
dt: bytes,
*,
iou: object = None,
t_f: float | None = None,
max_dets_per_image: int = 100,
use_cats: bool = True,
parity_mode: ParityMode = "corrected",
) -> FpIouHistogram
Extract per-FP (iou_same, iou_cross) for ADR-0022 ratification.
Sister entry point to :func:error_decomposition. Same dispatch
logic (kernel selection, parity mode, max-dets) but emits the raw
IoU pairs instead of the six-bin ΔmAP. Caller bins the values
Python-side to compute the bin-as-Bkg fraction at candidate t_b.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
gt
|
bytes | CocoDataset
|
GT JSON bytes (CocoDataset handle deferred, same as
:func: |
required |
dt
|
bytes
|
Detection JSON bytes. |
required |
iou
|
object
|
Kernel selector — :class: |
None
|
t_f
|
float | None
|
Foreground threshold for identifying TP / Ignore.
Defaults to |
None
|
max_dets_per_image
|
int
|
Per-image detection cap; same default as
:func: |
100
|
use_cats
|
bool
|
Per-class evaluation; same default. |
True
|
parity_mode
|
ParityMode
|
Same as :func: |
'corrected'
|
Returns:
| Type | Description |
|---|---|
FpIouHistogram
|
class: |
FpIouHistogram
|
|
FpIouHistogram
|
consumer needs. |