vernier.semantic
Semantic-segmentation evaluation (mIoU + per-class confusion matrix).
Evaluator + evaluate(...) mirror the instance API; the
*_IGNORE_LABEL / *_N_CLASSES constants are convenience defaults for
common public datasets.
Semantic-segmentation evaluation surface (ADR-0028).
Per ADR-0029, the semantic-segmentation evaluation paradigm lives
under vernier.semantic. Sibling to :mod:vernier.instance (the
AP fold) and :mod:vernier.panoptic (panoptic-quality). The Rust
kernel ships in the vernier-semantic crate; this module is a thin
Python wrapper.
Surface:
- :class:
Dataset/ :class:Predictions— frozen dataclasses carrying the per-image label maps + dataset-level config (n_classes/ignore_label). - :class:
Evaluator— frozen dataclass holdingparity_modeand optionallabel_remap.Evaluator.evaluate(gt, dt)returns a :class:Summary(the FFI pyclass). - :class:
Summary/ :class:ClassSemanticStats/ :class:ConfusionMatrix— re-exported FFI pyclasses (under their unprefixed names per ADR-0029). - Per-dataset presets — :meth:
Dataset.cityscapes, :meth:Dataset.ade20k, :meth:Dataset.pascal_voc— bake the canonical ignore-label and class-count conventions; the user only passes the PNG paths.
BackgroundEvaluator
BackgroundEvaluator(
n_classes: int,
parity_mode: str,
*,
ignore_label: int | None = ...,
rank_id: int | None = ...,
queue_capacity: int = ...,
worker_affinity: int | None = ...,
worker_nice: int = ...,
shutdown_timeout_seconds: float = ...,
)
Background semantic-segmentation evaluator (ADR-0014 + ADR-0032).
Wraps a single dedicated worker thread that owns the underlying
[StreamingSemanticEvaluator]. submit(image_id, gt, dt) posts
one image; snapshot() / finalize() block on a worker reply.
Mirrors the panoptic and instance background surfaces: same
constructor knobs (queue_capacity, worker_affinity,
worker_nice, shutdown_timeout_seconds), same context-manager
lifecycle, same to_partial / finalize_to_partial for
distributed-eval gather (ADR-0032).
n_classes
property
Number of evaluation classes (constant for the lifetime of the evaluator).
n_images
property
Mirror of the underlying evaluator's n_images. Advisory —
updated by the worker after each successful submit.
finalize
method descriptor
finalize() -> SemanticSummary
Drain the queue, finalize the evaluator, and join the worker.
finalize_to_partial
method descriptor
ADR-0032 / ADR-0035: drain, serialize the final state, and shut the worker down.
submit
method descriptor
submit(
image_id: int,
gt: NDArray[unsignedinteger[Any]],
dt: NDArray[unsignedinteger[Any]],
*,
timeout: float | None = None,
) -> None
Submit one image's (gt, dt) label-map pair to the worker.
Accepts uint8 / uint16 / uint32 2-D ndarrays (ADR-0037);
the worker walks at native dtype without an upcast. timeout
mirrors the instance background:
None(default) → block until a slot is free0.0→ single non-blocking attempt; raiseQueueFullErrorif the queue is fullt > 0.0→ wait up totseconds; raiseQueueFullErroron timeout
submit_png
method descriptor
submit_png(
image_id: int,
gt_png_bytes: bytes,
dt_png_bytes: bytes,
*,
timeout: float | None = None,
) -> None
Submit one image's (gt_png_bytes, dt_png_bytes) 8-bit grayscale
PNG pair to the worker (ADR-0037). Decodes synchronously on the
FFI thread (under py.detach) and sends the native-width u8
label maps across the channel; the worker folds at native width
without a 4× upcast.
Strictly equivalent to
submit(image_id, decode_label_map_png(gt_path).astype(uint32),
decode_label_map_png(dt_path).astype(uint32), ...); the diff
is wall-time, not correctness.
Format contract: 8-bit grayscale only. timeout mirrors submit.
Breakdown
Python wrapper around [Breakdown] / [ClassGroupBreakdown].
buckets
property
Range buckets as a list of (label, lo, hi) triples in
construction order.
Raises AttributeError if this Breakdown was built via
from_class_groups. Use class_groups instead.
class_groups
property
Class-id groups as a list of (label, class_ids) pairs in
construction order.
Raises AttributeError if this Breakdown was built via
from_ranges. Use buckets instead.
kind
property
Variant discriminator: "range" for from_ranges-constructed
breakdowns, "class_groups" for from_class_groups-constructed
ones. Use this to dispatch in validators that accept a
Breakdown of a specific shape.
from_class_groups
builtin
from_class_groups(
axis: str, groups: Sequence[tuple[str, Sequence[int]]]
) -> Breakdown
Construct from class-id-keyed groups.
groups is a sequence of (label, class_ids) pairs, one per
group. Group order on input determines the group axis index
(first pair is index 0). Strict partition discipline is enforced
— no class id may appear in two groups.
Raises ValueError on:
- empty
groups; - any group with empty
class_ids; - duplicate group labels;
- the same class id appearing in more than one group.
from_ranges
builtin
from_ranges(
axis: str, buckets: Sequence[tuple[str, float, float]]
) -> Breakdown
Construct from f64-keyed buckets.
buckets is a sequence of (label, lo, hi) triples, one per
bucket. [lo, hi] is closed on both ends per ADR-0016 (quirk
D6); an annotation whose key sits exactly on a boundary lands in
both adjacent buckets.
Raises ValueError on:
- empty
buckets; - NaN or infinite
lo/hi; lo < 0;lo > hi;- duplicate bucket labels.
ClassSemanticStats
Per-class semantic-segmentation row exposed to Python.
Mirrors [vernier_semantic::ClassSemanticStats] one-to-one. The
NaN-vs-0.0 disposition for zero-support classes (quirk AL2) is
already baked in by the time the row reaches Python; the Python
side reads the values as-is.
ConfusionMatrix
(N, N) confusion-matrix view exposed to Python.
The flat Vec<u64> storage on the Rust side is exposed as a 2-D
numpy.ndarray via [PyConfusionMatrix::counts]. ADR-0028 §F1
promotes the matrix to a first-class output: downstream calibration
/ error-decomposition / model-diff tools consume it directly.
n_classes
property
Number of evaluation classes; the matrix is (n_classes, n_classes).
total
property
Total pixel count across all cells. Equals
sum(counts) and is useful for sanity checks.
trace
property
Trace (number of correct-class pixels). Equals
sum(diag(counts)) and is the numerator of pixel accuracy.
counts
method descriptor
(N, N) numpy view of the confusion matrix as a fresh
numpy.uint64 array. The buffer is materialized on each call
(cheap for typical N ≤ 150) so the FFI boundary doesn't have
to manage a long-lived borrow into the Rust-side Vec<u64>.
PartialDatasetMismatch
Bases: builtins.RuntimeError
Distributed-eval partial was computed against a different dataset than the receiving evaluator.
Attributes: expected (bytes), actual (bytes).
PartialFormatMismatch
Bases: builtins.RuntimeError
Distributed-eval partial blob is structurally malformed (magic / version / CRC / kernel kind / parity / retain_iou / grid dimensions / rkyv archive).
Attributes: kind (string discriminator).
PartialParamsMismatch
Bases: builtins.RuntimeError
Distributed-eval partial was computed against different evaluation params than the receiving evaluator.
Attributes: expected (bytes), actual (bytes).
PartialPartitionOverlap
Bases: builtins.RuntimeError
Two distributed-eval partials cover the same image_id (sampler bug).
Attributes: rank_a, rank_b, image_id.
PartialRankCollision
Bases: builtins.RuntimeError
Two strict-mode distributed-eval partials share a rank_id.
Attributes: rank_id.
Summary
Top-level semantic-evaluation result. Read via field accessors.
Sibling to [crate::PySummary] (instance) and
[crate::panoptic::PyPanopticSummary] (panoptic) per ADR-0028
§"Public Python surface".
confusion_matrix
property
confusion_matrix: ConfusionMatrix
The accumulated (N, N) confusion matrix as a Python wrapper.
Constructed fresh on each access from a clone of the underlying
matrix (the matrix is small — at most 150² = 22500 cells —
so the clone is cheap).
per_class
method descriptor
per_class() -> dict[int, ClassSemanticStats]
Per-class rows keyed by class id. Returns a Python dict
(constructed fresh on each call from the underlying BTreeMap).
CategoryFilterByGrouping
dataclass
Match every class id in the named group of the active
class_grouping breakdown.
Only meaningful when the Evaluator's class_grouping is also
set; the validator at __post_init__ rejects ByGrouping
when no grouping is configured or when label is not a
grouping label.
CategoryFilterByIds
dataclass
Match an explicit set of class / category ids.
CategoryFilterFrequency
dataclass
Match by LVIS frequency tag ("r", "c", "f").
Valid only on instance evaluation against an LVIS-shaped dataset (ADR-0026). Semantic and panoptic Evaluators reject this variant at construction time per ADR-0041 / ADR-0042 — frequency tags are a sum type that doesn't generalize to non-numeric axes; class groupings carry the user's per-group rollup intent on those paradigms.
InvalidEvalParams
Bases: ValueError
Base for paradigm-specific Evaluator construction errors.
Raised at Evaluator.__post_init__ time by every paradigm in
response to invalid parameter values (out of range, wrong shape,
duplicate, conflicting, etc.). Per ADR-0039, validation runs at
construction so misconfiguration surfaces fast — evaluate()
cannot fail on misconfigured params, only on bad data.
Each subclass carries the offending field name, the offending value, and a one-line remediation pointer (typically the relevant ADR or doc page).
InvalidSemanticParams
EvalResult
dataclass
EvalResult(
summary: SemanticSummary,
_per_class_batch: object | None = None,
)
Opt-in result of :meth:Evaluator.evaluate when tables= is
passed. Carries :class:Summary plus a polars DataFrame view of
the per-class semantic breakdown in the canonical mmseg / ADE20K
column order. Tables on :meth:Evaluator.evaluate_from_pngs
(ADR-0037 fused path) are deferred.
Dataset
dataclass
Semantic-segmentation ground truth.
Carries per-image label maps, the evaluation class count, and the ignore label. Constructors:
- :meth:
from_arrays— pre-decodeduint8/uint16/uint32ndarrayper image. Zero-copy on the FFI hot path; the kernel walks at native dtype (ADR-0037). - :meth:
from_files— PNG paths; decoded via Pillow. - :meth:
cityscapes/ :meth:ade20k/ :meth:pascal_voc— per-dataset presets that bake the canonicaln_classesandignore_labelconventions.
Predictions are constructed via the sibling :class:Predictions
type — same shape, no class-count metadata (the dataset is the
authoritative source).
from_arrays
classmethod
from_arrays(
label_maps: Mapping[int, ndarray],
n_classes: int,
ignore_label: int | None = None,
) -> Dataset
Construct from pre-decoded label-map ndarray per image.
Each label_maps[image_id] must be a 2-D integer array. The
FFI boundary accepts uint8 / uint16 / uint32 natively
(ADR-0037); the kernel walks at native dtype, so passing
uint8 from a torch.Tensor avoids the 4x upcast that
earlier wheel versions paid here.
from_files
classmethod
from_files(
png_paths: Mapping[int, str | Path],
n_classes: int,
ignore_label: int | None = None,
) -> Dataset
Decode PNG paths via Pillow into a :class:Dataset.
Each PNG must be a single-channel label map. Multi-channel
(RGB-encoded) panoptic PNGs belong in
:class:vernier.panoptic.Dataset — they are rejected here
with a :class:ValueError.
cityscapes
classmethod
cityscapes(png_paths: Mapping[int, str | Path]) -> Dataset
Cityscapes 19-class evaluation preset.
n_classes=19, ignore_label=255. PNGs are expected to
be *_labelTrainIds.png files (already in trainId space —
the 30+ raw label space pre-mapped to 19 classes by the
cityscapesscripts/preparation/createTrainIdLabelImgs.py
helper).
Users with predictions in the raw 30+ label space should
construct :class:Evaluator with an explicit label_remap
dict that maps raw ids to trainIds.
ade20k
classmethod
ade20k(png_paths: Mapping[int, str | Path]) -> Dataset
ADE20K (SceneParse150) preset.
n_classes=150, ignore_label=0. Class 0 is the
"other/unlabeled" sentinel; predictions are expected in
[1, 150].
pascal_voc
classmethod
pascal_voc(png_paths: Mapping[int, str | Path]) -> Dataset
Pascal VOC preset.
n_classes=21, ignore_label=255. 20 object classes plus
a background class indexed at 0; the 255 sentinel marks
void pixels at object boundaries.
Predictions
dataclass
Semantic-segmentation predictions.
Carries per-image label maps only — class-count and ignore-label
metadata live on the sibling :class:Dataset (the authoritative
source). Constructors mirror :class:Dataset:
- :meth:
from_arrays— pre-decodeduint8/uint16/uint32ndarrayper image (ADR-0037). - :meth:
from_files— PNG paths; decoded via Pillow. - :meth:
from_binary_masks— per-class binary masks merged into a single class-id label map per image (quirk AN2).
from_arrays
classmethod
from_arrays(
label_maps: Mapping[int, ndarray],
) -> Predictions
Construct from pre-decoded label-map ndarray per image.
Accepts uint8 / uint16 / uint32 natively (ADR-0037);
the FFI walks at the input dtype.
from_files
classmethod
from_files(
png_paths: Mapping[int, str | Path],
) -> Predictions
Decode PNG paths via Pillow into a :class:Predictions.
See :meth:Dataset.from_files for the PNG shape contract.
from_binary_masks
classmethod
from_binary_masks(
masks: Mapping[int, NDArray[uint8]],
*,
merge: Literal[
"argmax", "first", "highest_class_id"
] = "argmax",
unlabeled_class: int = 255,
) -> Predictions
Merge per-image (n_classes, H, W) binary mask stacks into
single class-id label maps (quirks AN2, AN3, AN4).
merge selects the precedence rule for overlapping pixels:
"argmax"(default) — argmax over channels; on score input the highest score wins, on binary input ties go to the lowest class id."first"— the first channel with value 1 wins (order-dependent)."highest_class_id"— the highest class id with value 1 wins (convention-dependent).
unlabeled_class is written into pixels where every mask
is zero (default 255 matches the Cityscapes / Pascal VOC
ignore convention).
Evaluator
dataclass
Evaluator(
parity_mode: ParityMode = "corrected",
label_remap: Mapping[int, int] | None = None,
class_filter: CategoryFilter | None = None,
class_grouping: Breakdown | None = None,
)
Semantic-segmentation evaluator (ADR-0028, ADR-0041).
Sibling to :class:vernier.instance.Evaluator (AP fold) and
:class:vernier.panoptic.Evaluator (panoptic-quality). Computes
mIoU, FWIoU, pixel accuracy, mean accuracy, per-class IoU, and
per-class accuracy from per-image confusion matrices accumulated
into a global confusion matrix.
The instance is immutable per ADR-0006: construct once, call
:meth:evaluate per dataset/predictions pair.
Defaults match net-new-user expectations: parity_mode="corrected"
(per ADR-0002 recommendation); migrating users wanting bit-exact
mmsegmentation behavior should set parity_mode="strict".
class_filter and class_grouping (ADR-0041) parameterize the
evaluation scope and rollup. class_filter restricts the headline
scalars (miou etc.) to a subset of classes; class_grouping
adds an optional per-group rollup alongside per_class. Both
default to None (kernel-canonical: every class contributes to
headline scalars; no per-group rollup).
PR scope cut: the kernel-side plumbing for honoring custom
class_filter / class_grouping is a follow-up to ADR-0041
(per_group lives on SemanticSummary once the Rust struct gains
it). Until that lands, evaluate() raises
:class:NotImplementedError when either is set; the surface
(fields, validation, with_options threading) is in place.
evaluate
evaluate(
gt: Dataset, dt: Predictions, *, tables: None = None
) -> SemanticSummary
evaluate(
gt: Dataset,
dt: Predictions,
*,
tables: Literal["all"] | tuple[TableName, ...],
) -> EvalResult
evaluate(
gt: Dataset,
dt: Predictions,
*,
tables: Literal["all"]
| tuple[TableName, ...]
| None = None,
) -> SemanticSummary | EvalResult
Run the semantic-segmentation evaluation.
gt and dt may be constructed via any combination of
the per-class constructors (:meth:Dataset.from_arrays,
:meth:Predictions.from_files, etc.); the FFI consumes
whatever uint32 ndarrays the constructors produced.
Returns a :class:Summary carrying the four headline scalars
(mIoU / FWIoU / pixel_accuracy / mean_accuracy), the per-class
breakdown, and the accumulated :class:ConfusionMatrix.
tables= is the opt-in keyword for result tables (ADR-0038).
Defaults to None, returning :class:Summary (existing
behavior). Pass "all" or a tuple of :data:TableName
values to opt into the wider :class:EvalResult return type.
evaluate_from_pngs
evaluate_from_pngs(
gt_paths: Mapping[int, str | Path],
dt_paths: Mapping[int, str | Path],
*,
n_classes: int,
ignore_label: int | None = None,
) -> SemanticSummary
Run the evaluation directly against 8-bit grayscale PNG label maps on disk (ADR-0037).
Skips the Pillow → ndarray → astype(np.uint32) pipeline:
the libpng decode and the per-image confusion-matrix fold run
in Rust under py.detach, so the GIL is released for the
whole batch and only one decoded label-map per side is in
flight at a time. Memory ceiling is bounded by image size,
not dataset size.
Format contract: 8-bit grayscale PNGs only (the natural width
for class-id label maps up to 256 classes; covers every dataset
in the per-paradigm presets — Cityscapes, ADE20K, Pascal VOC,
plus panoptic-derived semantic). Wider class-id ranges should
use :meth:evaluate with np.uint16 / np.uint32 ndarrays.
label_remap does not propagate to the fused path; callers
needing a remap should rewrite the DT PNGs upstream or fall
back to :meth:evaluate.
evaluate_to_partial
evaluate_to_partial(
gt: Dataset, dt: Predictions, *, rank_id: int
) -> bytes
Run the evaluation as a per-rank streaming submit and return the serialized partial bytes (ADR-0032, ADR-0035).
rank_id identifies this evaluator's rank in a multi-process
eval. The partial bytes can be gathered across ranks and merged
with :meth:from_partials to produce a global :class:Summary
bit-equal to a batch :meth:evaluate over the union (semantic
confusion-matrix sums are u64-additive, so strict-mode
bit-equality is unconditional per ADR-0032).
label_remap does not propagate to the streaming path —
callers needing remap apply it on the DT arrays themselves
before passing the dataset in.
from_partials
classmethod
from_partials(
n_classes: int,
partials: Sequence[bytes],
/,
*,
parity_mode: ParityMode = "corrected",
ignore_label: int | None = None,
) -> SemanticSummary
Merge partials (one per rank) into a global :class:Summary
(ADR-0032, ADR-0035).
n_classes, parity_mode, and ignore_label must match
what each rank used to produce its partial. Mismatches raise
the structured Partial* errors re-exported on this module.
background
background(
n_classes: int,
ignore_label: int | None = None,
*,
rank_id: int | None = None,
queue_capacity: int = 8,
worker_affinity: int | None = None,
worker_nice: int = 5,
shutdown_timeout_seconds: float = 5.0,
) -> BackgroundSemanticEvaluator
Build a :class:BackgroundEvaluator (ADR-0014 + ADR-0032)
that shares this evaluator's parity_mode.
The returned wrapper owns a single dedicated worker thread
running a :class:StreamingEvaluator of the same shape; calls
to :meth:BackgroundEvaluator.submit enqueue per-image
(gt, dt) pairs and return immediately. Use this when the
confusion-matrix fold measurably stalls the training loop
(most users won't notice — semantic is fast — but per-image
update on dense Cityscapes pixels is the case where this
starts to matter).
rank_id, when set, identifies this evaluator's rank in a
multi-process eval (ADR-0032). The five queueing /
scheduling knobs mirror :class:vernier.instance.Evaluator
.background.
label_remap does not propagate to the background path —
callers needing remap on a streaming path apply it on the DT
arrays themselves before each submit call.
decode_label_map_png
Load a single-channel PNG label map as a (H, W) uint32
array.
Lazy-imports Pillow; raises a structured :class:ImportError if
Pillow is not installed. Multi-channel (RGB-encoded) panoptic
PNGs belong in :func:vernier.panoptic.decode_label_map_png —
they are rejected here with :class:ValueError.