Skip to content

vernier.semantic

Semantic-segmentation evaluation (mIoU + per-class confusion matrix). Evaluator + evaluate(...) mirror the instance API; the *_IGNORE_LABEL / *_N_CLASSES constants are convenience defaults for common public datasets.

Semantic-segmentation evaluation surface (ADR-0028).

Per ADR-0029, the semantic-segmentation evaluation paradigm lives under vernier.semantic. Sibling to :mod:vernier.instance (the AP fold) and :mod:vernier.panoptic (panoptic-quality). The Rust kernel ships in the vernier-semantic crate; this module is a thin Python wrapper.

Surface:

  • :class:Dataset / :class:Predictions — frozen dataclasses carrying the per-image label maps + dataset-level config (n_classes / ignore_label).
  • :class:Evaluator — frozen dataclass holding parity_mode and optional label_remap. Evaluator.evaluate(gt, dt) returns a :class:Summary (the FFI pyclass).
  • :class:Summary / :class:ClassSemanticStats / :class:ConfusionMatrix — re-exported FFI pyclasses (under their unprefixed names per ADR-0029).
  • Per-dataset presets — :meth:Dataset.cityscapes, :meth:Dataset.ade20k, :meth:Dataset.pascal_voc — bake the canonical ignore-label and class-count conventions; the user only passes the PNG paths.

BackgroundEvaluator

BackgroundEvaluator(
    n_classes: int,
    parity_mode: str,
    *,
    ignore_label: int | None = ...,
    rank_id: int | None = ...,
    queue_capacity: int = ...,
    worker_affinity: int | None = ...,
    worker_nice: int = ...,
    shutdown_timeout_seconds: float = ...,
)

Background semantic-segmentation evaluator (ADR-0014 + ADR-0032).

Wraps a single dedicated worker thread that owns the underlying [StreamingSemanticEvaluator]. submit(image_id, gt, dt) posts one image; snapshot() / finalize() block on a worker reply.

Mirrors the panoptic and instance background surfaces: same constructor knobs (queue_capacity, worker_affinity, worker_nice, shutdown_timeout_seconds), same context-manager lifecycle, same to_partial / finalize_to_partial for distributed-eval gather (ADR-0032).

n_classes property

n_classes: int

Number of evaluation classes (constant for the lifetime of the evaluator).

n_images property

n_images: int

Mirror of the underlying evaluator's n_images. Advisory — updated by the worker after each successful submit.

queue_depth property

queue_depth: int

Approximate count of Update messages waiting in the channel.

finalize method descriptor

finalize() -> SemanticSummary

Drain the queue, finalize the evaluator, and join the worker.

finalize_to_partial method descriptor

finalize_to_partial() -> bytes

ADR-0032 / ADR-0035: drain, serialize the final state, and shut the worker down.

submit method descriptor

submit(
    image_id: int,
    gt: NDArray[unsignedinteger[Any]],
    dt: NDArray[unsignedinteger[Any]],
    *,
    timeout: float | None = None,
) -> None

Submit one image's (gt, dt) label-map pair to the worker. Accepts uint8 / uint16 / uint32 2-D ndarrays (ADR-0037); the worker walks at native dtype without an upcast. timeout mirrors the instance background:

  • None (default) → block until a slot is free
  • 0.0 → single non-blocking attempt; raise QueueFullError if the queue is full
  • t > 0.0 → wait up to t seconds; raise QueueFullError on timeout

submit_png method descriptor

submit_png(
    image_id: int,
    gt_png_bytes: bytes,
    dt_png_bytes: bytes,
    *,
    timeout: float | None = None,
) -> None

Submit one image's (gt_png_bytes, dt_png_bytes) 8-bit grayscale PNG pair to the worker (ADR-0037). Decodes synchronously on the FFI thread (under py.detach) and sends the native-width u8 label maps across the channel; the worker folds at native width without a 4× upcast.

Strictly equivalent to submit(image_id, decode_label_map_png(gt_path).astype(uint32), decode_label_map_png(dt_path).astype(uint32), ...); the diff is wall-time, not correctness.

Format contract: 8-bit grayscale only. timeout mirrors submit.

Breakdown

Python wrapper around [Breakdown] / [ClassGroupBreakdown].

axis property

axis: str

Axis name (e.g., "area", "vehicle_taxonomy").

buckets property

buckets: list[tuple[str, float, float]]

Range buckets as a list of (label, lo, hi) triples in construction order.

Raises AttributeError if this Breakdown was built via from_class_groups. Use class_groups instead.

class_groups property

class_groups: list[tuple[str, list[int]]]

Class-id groups as a list of (label, class_ids) pairs in construction order.

Raises AttributeError if this Breakdown was built via from_ranges. Use buckets instead.

kind property

kind: Literal['range', 'class_groups']

Variant discriminator: "range" for from_ranges-constructed breakdowns, "class_groups" for from_class_groups-constructed ones. Use this to dispatch in validators that accept a Breakdown of a specific shape.

from_class_groups builtin

from_class_groups(
    axis: str, groups: Sequence[tuple[str, Sequence[int]]]
) -> Breakdown

Construct from class-id-keyed groups.

groups is a sequence of (label, class_ids) pairs, one per group. Group order on input determines the group axis index (first pair is index 0). Strict partition discipline is enforced — no class id may appear in two groups.

Raises ValueError on:

  • empty groups;
  • any group with empty class_ids;
  • duplicate group labels;
  • the same class id appearing in more than one group.

from_ranges builtin

from_ranges(
    axis: str, buckets: Sequence[tuple[str, float, float]]
) -> Breakdown

Construct from f64-keyed buckets.

buckets is a sequence of (label, lo, hi) triples, one per bucket. [lo, hi] is closed on both ends per ADR-0016 (quirk D6); an annotation whose key sits exactly on a boundary lands in both adjacent buckets.

Raises ValueError on:

  • empty buckets;
  • NaN or infinite lo / hi;
  • lo < 0;
  • lo > hi;
  • duplicate bucket labels.

ClassSemanticStats

Per-class semantic-segmentation row exposed to Python.

Mirrors [vernier_semantic::ClassSemanticStats] one-to-one. The NaN-vs-0.0 disposition for zero-support classes (quirk AL2) is already baked in by the time the row reaches Python; the Python side reads the values as-is.

accuracy property

accuracy: float

class_id property

class_id: int

iou property

iou: float

n_dt_pixels property

n_dt_pixels: int

n_gt_pixels property

n_gt_pixels: int

precision property

precision: float

ConfusionMatrix

(N, N) confusion-matrix view exposed to Python.

The flat Vec<u64> storage on the Rust side is exposed as a 2-D numpy.ndarray via [PyConfusionMatrix::counts]. ADR-0028 §F1 promotes the matrix to a first-class output: downstream calibration / error-decomposition / model-diff tools consume it directly.

n_classes property

n_classes: int

Number of evaluation classes; the matrix is (n_classes, n_classes).

total property

total: int

Total pixel count across all cells. Equals sum(counts) and is useful for sanity checks.

trace property

trace: int

Trace (number of correct-class pixels). Equals sum(diag(counts)) and is the numerator of pixel accuracy.

counts method descriptor

counts() -> NDArray[uint64]

(N, N) numpy view of the confusion matrix as a fresh numpy.uint64 array. The buffer is materialized on each call (cheap for typical N ≤ 150) so the FFI boundary doesn't have to manage a long-lived borrow into the Rust-side Vec<u64>.

get method descriptor

get(g: int, d: int) -> int

counts[g, d] lookup. Returns the integer pixel count for the given (gt_class, pred_class) cell. Bounds-checked.

PartialDatasetMismatch

Bases: builtins.RuntimeError

Distributed-eval partial was computed against a different dataset than the receiving evaluator.

Attributes: expected (bytes), actual (bytes).

PartialFormatMismatch

Bases: builtins.RuntimeError

Distributed-eval partial blob is structurally malformed (magic / version / CRC / kernel kind / parity / retain_iou / grid dimensions / rkyv archive).

Attributes: kind (string discriminator).

PartialParamsMismatch

Bases: builtins.RuntimeError

Distributed-eval partial was computed against different evaluation params than the receiving evaluator.

Attributes: expected (bytes), actual (bytes).

PartialPartitionOverlap

Bases: builtins.RuntimeError

Two distributed-eval partials cover the same image_id (sampler bug).

Attributes: rank_a, rank_b, image_id.

PartialRankCollision

Bases: builtins.RuntimeError

Two strict-mode distributed-eval partials share a rank_id.

Attributes: rank_id.

Summary

Top-level semantic-evaluation result. Read via field accessors.

Sibling to [crate::PySummary] (instance) and [crate::panoptic::PyPanopticSummary] (panoptic) per ADR-0028 §"Public Python surface".

confusion_matrix property

confusion_matrix: ConfusionMatrix

The accumulated (N, N) confusion matrix as a Python wrapper. Constructed fresh on each access from a clone of the underlying matrix (the matrix is small — at most 150² = 22500 cells — so the clone is cheap).

fwiou property

fwiou: float

mean_accuracy property

mean_accuracy: float

miou property

miou: float

pixel_accuracy property

pixel_accuracy: float

per_class method descriptor

per_class() -> dict[int, ClassSemanticStats]

Per-class rows keyed by class id. Returns a Python dict (constructed fresh on each call from the underlying BTreeMap).

per_group method descriptor

per_group() -> dict[str, GroupSemanticStats]

Per-group rollup keyed by group label (ADR-0041). Empty when the evaluator was run without class_grouping. Returns a fresh dict on each call.

CategoryFilterAll dataclass

CategoryFilterAll()

Match every category. The COCO default.

CategoryFilterByGrouping dataclass

CategoryFilterByGrouping(label: str)

Match every class id in the named group of the active class_grouping breakdown.

Only meaningful when the Evaluator's class_grouping is also set; the validator at __post_init__ rejects ByGrouping when no grouping is configured or when label is not a grouping label.

CategoryFilterByIds dataclass

CategoryFilterByIds(ids: frozenset[int])

Match an explicit set of class / category ids.

CategoryFilterFrequency dataclass

CategoryFilterFrequency(tag: Literal['r', 'c', 'f'])

Match by LVIS frequency tag ("r", "c", "f").

Valid only on instance evaluation against an LVIS-shaped dataset (ADR-0026). Semantic and panoptic Evaluators reject this variant at construction time per ADR-0041 / ADR-0042 — frequency tags are a sum type that doesn't generalize to non-numeric axes; class groupings carry the user's per-group rollup intent on those paradigms.

InvalidEvalParams

InvalidEvalParams(
    *, field: str, value: object, remediation: str
)

Bases: ValueError

Base for paradigm-specific Evaluator construction errors.

Raised at Evaluator.__post_init__ time by every paradigm in response to invalid parameter values (out of range, wrong shape, duplicate, conflicting, etc.). Per ADR-0039, validation runs at construction so misconfiguration surfaces fast — evaluate() cannot fail on misconfigured params, only on bad data.

Each subclass carries the offending field name, the offending value, and a one-line remediation pointer (typically the relevant ADR or doc page).

InvalidSemanticParams

InvalidSemanticParams(
    *, field: str, value: object, remediation: str
)

Bases: InvalidEvalParams

Invalid vernier.semantic.Evaluator parameter (ADR-0041).

EvalResult dataclass

EvalResult(
    summary: SemanticSummary,
    _per_class_batch: object | None = None,
)

Opt-in result of :meth:Evaluator.evaluate when tables= is passed. Carries :class:Summary plus a polars DataFrame view of the per-class semantic breakdown in the canonical mmseg / ADE20K column order. Tables on :meth:Evaluator.evaluate_from_pngs (ADR-0037 fused path) are deferred.

per_class cached property

per_class: DataFrame

One row per class. Columns: category_id, iou, accuracy, precision, n_gt_pixels, n_dt_pixels, tp_pixels, fp_pixels, fn_pixels.

Dataset dataclass

Dataset(
    label_maps: Mapping[int, ndarray],
    n_classes: int,
    ignore_label: int | None = None,
)

Semantic-segmentation ground truth.

Carries per-image label maps, the evaluation class count, and the ignore label. Constructors:

  • :meth:from_arrays — pre-decoded uint8 / uint16 / uint32 ndarray per image. Zero-copy on the FFI hot path; the kernel walks at native dtype (ADR-0037).
  • :meth:from_files — PNG paths; decoded via Pillow.
  • :meth:cityscapes / :meth:ade20k / :meth:pascal_voc — per-dataset presets that bake the canonical n_classes and ignore_label conventions.

Predictions are constructed via the sibling :class:Predictions type — same shape, no class-count metadata (the dataset is the authoritative source).

from_arrays classmethod

from_arrays(
    label_maps: Mapping[int, ndarray],
    n_classes: int,
    ignore_label: int | None = None,
) -> Dataset

Construct from pre-decoded label-map ndarray per image.

Each label_maps[image_id] must be a 2-D integer array. The FFI boundary accepts uint8 / uint16 / uint32 natively (ADR-0037); the kernel walks at native dtype, so passing uint8 from a torch.Tensor avoids the 4x upcast that earlier wheel versions paid here.

from_files classmethod

from_files(
    png_paths: Mapping[int, str | Path],
    n_classes: int,
    ignore_label: int | None = None,
) -> Dataset

Decode PNG paths via Pillow into a :class:Dataset.

Each PNG must be a single-channel label map. Multi-channel (RGB-encoded) panoptic PNGs belong in :class:vernier.panoptic.Dataset — they are rejected here with a :class:ValueError.

cityscapes classmethod

cityscapes(png_paths: Mapping[int, str | Path]) -> Dataset

Cityscapes 19-class evaluation preset.

n_classes=19, ignore_label=255. PNGs are expected to be *_labelTrainIds.png files (already in trainId space — the 30+ raw label space pre-mapped to 19 classes by the cityscapesscripts/preparation/createTrainIdLabelImgs.py helper).

Users with predictions in the raw 30+ label space should construct :class:Evaluator with an explicit label_remap dict that maps raw ids to trainIds.

ade20k classmethod

ade20k(png_paths: Mapping[int, str | Path]) -> Dataset

ADE20K (SceneParse150) preset.

n_classes=150, ignore_label=0. Class 0 is the "other/unlabeled" sentinel; predictions are expected in [1, 150].

pascal_voc classmethod

pascal_voc(png_paths: Mapping[int, str | Path]) -> Dataset

Pascal VOC preset.

n_classes=21, ignore_label=255. 20 object classes plus a background class indexed at 0; the 255 sentinel marks void pixels at object boundaries.

Predictions dataclass

Predictions(label_maps: Mapping[int, ndarray])

Semantic-segmentation predictions.

Carries per-image label maps only — class-count and ignore-label metadata live on the sibling :class:Dataset (the authoritative source). Constructors mirror :class:Dataset:

  • :meth:from_arrays — pre-decoded uint8 / uint16 / uint32 ndarray per image (ADR-0037).
  • :meth:from_files — PNG paths; decoded via Pillow.
  • :meth:from_binary_masks — per-class binary masks merged into a single class-id label map per image (quirk AN2).

from_arrays classmethod

from_arrays(
    label_maps: Mapping[int, ndarray],
) -> Predictions

Construct from pre-decoded label-map ndarray per image.

Accepts uint8 / uint16 / uint32 natively (ADR-0037); the FFI walks at the input dtype.

from_files classmethod

from_files(
    png_paths: Mapping[int, str | Path],
) -> Predictions

Decode PNG paths via Pillow into a :class:Predictions.

See :meth:Dataset.from_files for the PNG shape contract.

from_binary_masks classmethod

from_binary_masks(
    masks: Mapping[int, NDArray[uint8]],
    *,
    merge: Literal[
        "argmax", "first", "highest_class_id"
    ] = "argmax",
    unlabeled_class: int = 255,
) -> Predictions

Merge per-image (n_classes, H, W) binary mask stacks into single class-id label maps (quirks AN2, AN3, AN4).

merge selects the precedence rule for overlapping pixels:

  • "argmax" (default) — argmax over channels; on score input the highest score wins, on binary input ties go to the lowest class id.
  • "first" — the first channel with value 1 wins (order-dependent).
  • "highest_class_id" — the highest class id with value 1 wins (convention-dependent).

unlabeled_class is written into pixels where every mask is zero (default 255 matches the Cityscapes / Pascal VOC ignore convention).

Evaluator dataclass

Evaluator(
    parity_mode: ParityMode = "corrected",
    label_remap: Mapping[int, int] | None = None,
    class_filter: CategoryFilter | None = None,
    class_grouping: Breakdown | None = None,
)

Semantic-segmentation evaluator (ADR-0028, ADR-0041).

Sibling to :class:vernier.instance.Evaluator (AP fold) and :class:vernier.panoptic.Evaluator (panoptic-quality). Computes mIoU, FWIoU, pixel accuracy, mean accuracy, per-class IoU, and per-class accuracy from per-image confusion matrices accumulated into a global confusion matrix.

The instance is immutable per ADR-0006: construct once, call :meth:evaluate per dataset/predictions pair.

Defaults match net-new-user expectations: parity_mode="corrected" (per ADR-0002 recommendation); migrating users wanting bit-exact mmsegmentation behavior should set parity_mode="strict".

class_filter and class_grouping (ADR-0041) parameterize the evaluation scope and rollup. class_filter restricts the headline scalars (miou etc.) to a subset of classes; class_grouping adds an optional per-group rollup alongside per_class. Both default to None (kernel-canonical: every class contributes to headline scalars; no per-group rollup).

PR scope cut: the kernel-side plumbing for honoring custom class_filter / class_grouping is a follow-up to ADR-0041 (per_group lives on SemanticSummary once the Rust struct gains it). Until that lands, evaluate() raises :class:NotImplementedError when either is set; the surface (fields, validation, with_options threading) is in place.

evaluate

evaluate(
    gt: Dataset, dt: Predictions, *, tables: None = None
) -> SemanticSummary
evaluate(
    gt: Dataset,
    dt: Predictions,
    *,
    tables: Literal["all"] | tuple[TableName, ...],
) -> EvalResult
evaluate(
    gt: Dataset,
    dt: Predictions,
    *,
    tables: Literal["all"]
    | tuple[TableName, ...]
    | None = None,
) -> SemanticSummary | EvalResult

Run the semantic-segmentation evaluation.

gt and dt may be constructed via any combination of the per-class constructors (:meth:Dataset.from_arrays, :meth:Predictions.from_files, etc.); the FFI consumes whatever uint32 ndarrays the constructors produced.

Returns a :class:Summary carrying the four headline scalars (mIoU / FWIoU / pixel_accuracy / mean_accuracy), the per-class breakdown, and the accumulated :class:ConfusionMatrix.

tables= is the opt-in keyword for result tables (ADR-0038). Defaults to None, returning :class:Summary (existing behavior). Pass "all" or a tuple of :data:TableName values to opt into the wider :class:EvalResult return type.

evaluate_from_pngs

evaluate_from_pngs(
    gt_paths: Mapping[int, str | Path],
    dt_paths: Mapping[int, str | Path],
    *,
    n_classes: int,
    ignore_label: int | None = None,
) -> SemanticSummary

Run the evaluation directly against 8-bit grayscale PNG label maps on disk (ADR-0037).

Skips the Pillow → ndarray → astype(np.uint32) pipeline: the libpng decode and the per-image confusion-matrix fold run in Rust under py.detach, so the GIL is released for the whole batch and only one decoded label-map per side is in flight at a time. Memory ceiling is bounded by image size, not dataset size.

Format contract: 8-bit grayscale PNGs only (the natural width for class-id label maps up to 256 classes; covers every dataset in the per-paradigm presets — Cityscapes, ADE20K, Pascal VOC, plus panoptic-derived semantic). Wider class-id ranges should use :meth:evaluate with np.uint16 / np.uint32 ndarrays.

label_remap does not propagate to the fused path; callers needing a remap should rewrite the DT PNGs upstream or fall back to :meth:evaluate.

evaluate_to_partial

evaluate_to_partial(
    gt: Dataset, dt: Predictions, *, rank_id: int
) -> bytes

Run the evaluation as a per-rank streaming submit and return the serialized partial bytes (ADR-0032, ADR-0035).

rank_id identifies this evaluator's rank in a multi-process eval. The partial bytes can be gathered across ranks and merged with :meth:from_partials to produce a global :class:Summary bit-equal to a batch :meth:evaluate over the union (semantic confusion-matrix sums are u64-additive, so strict-mode bit-equality is unconditional per ADR-0032).

label_remap does not propagate to the streaming path — callers needing remap apply it on the DT arrays themselves before passing the dataset in.

from_partials classmethod

from_partials(
    n_classes: int,
    partials: Sequence[bytes],
    /,
    *,
    parity_mode: ParityMode = "corrected",
    ignore_label: int | None = None,
) -> SemanticSummary

Merge partials (one per rank) into a global :class:Summary (ADR-0032, ADR-0035).

n_classes, parity_mode, and ignore_label must match what each rank used to produce its partial. Mismatches raise the structured Partial* errors re-exported on this module.

background

background(
    n_classes: int,
    ignore_label: int | None = None,
    *,
    rank_id: int | None = None,
    queue_capacity: int = 8,
    worker_affinity: int | None = None,
    worker_nice: int = 5,
    shutdown_timeout_seconds: float = 5.0,
) -> BackgroundSemanticEvaluator

Build a :class:BackgroundEvaluator (ADR-0014 + ADR-0032) that shares this evaluator's parity_mode.

The returned wrapper owns a single dedicated worker thread running a :class:StreamingEvaluator of the same shape; calls to :meth:BackgroundEvaluator.submit enqueue per-image (gt, dt) pairs and return immediately. Use this when the confusion-matrix fold measurably stalls the training loop (most users won't notice — semantic is fast — but per-image update on dense Cityscapes pixels is the case where this starts to matter).

rank_id, when set, identifies this evaluator's rank in a multi-process eval (ADR-0032). The five queueing / scheduling knobs mirror :class:vernier.instance.Evaluator .background.

label_remap does not propagate to the background path — callers needing remap on a streaming path apply it on the DT arrays themselves before each submit call.

decode_label_map_png

decode_label_map_png(path: str | Path) -> NDArray[uint32]

Load a single-channel PNG label map as a (H, W) uint32 array.

Lazy-imports Pillow; raises a structured :class:ImportError if Pillow is not installed. Multi-channel (RGB-encoded) panoptic PNGs belong in :func:vernier.panoptic.decode_label_map_png — they are rejected here with :class:ValueError.

Type aliases

Per-dataset constants