Calibration Features¶
The winnow.calibration.features module provides a modular feature extraction system for PSM confidence calibration. Features are computed from peptide-spectrum match data and used by the ProbabilityCalibrator to transform raw confidence scores into calibrated probabilities.
Quick Reference¶
| Feature | Description | Requires |
|---|---|---|
| Mass Error Features | Precursor mass accuracy | precursor_mz, precursor_charge |
| Beam Features | Beam search diversity metrics | Beam predictions |
| Fragment Match Features | Theoretical vs observed spectrum agreement | precursor_charge, spectrum data |
| Chimeric Features | Runner-up peptide spectrum match | Beam predictions, spectrum data |
| Retention Time Feature | Chromatographic retention time error | retention_time |
| Sequence Features | Peptide sequence properties | prediction, precursor_charge |
| Token Score Features | Position-level confidence metrics | Beam predictions with token log-probs |
CalibrationFeatures Base Class¶
All features inherit from CalibrationFeatures and implement a common interface:
from winnow.calibration.features import CalibrationFeatures, FeatureDependency
from winnow.datasets.calibration_dataset import CalibrationDataset
from typing import List
class CustomFeature(CalibrationFeatures):
@property
def name(self) -> str:
"""Human-readable name for the feature."""
return "My Custom Feature"
@property
def columns(self) -> List[str]:
"""Column names that will be added to dataset.metadata."""
return ["custom_feature_1", "custom_feature_2"]
@property
def dependencies(self) -> List[FeatureDependency]:
"""Other features/computations that must run first."""
return []
def prepare(self, dataset: CalibrationDataset) -> None:
"""One-time setup before compute (e.g., model training)."""
pass
def compute(self, dataset: CalibrationDataset) -> None:
"""Compute and add feature columns to dataset.metadata."""
dataset.metadata["custom_feature_1"] = computed_values_1
dataset.metadata["custom_feature_2"] = computed_values_2
Key Methods¶
| Method | Description |
|---|---|
name |
Property returning a human-readable identifier |
columns |
Property returning list of column names this feature produces |
dependencies |
Property returning list of FeatureDependency objects |
prepare(dataset) |
Called once during ProbabilityCalibrator training, useful for feature-specific model training |
compute(dataset) |
Computes features and adds columns to dataset.metadata |
Feature Dependencies¶
Features can declare dependencies on shared computations using FeatureDependency. This enables:
- Shared computation: Dependencies are computed once and reused
- Reference counting: Automatic cleanup when no longer needed
- Ordered execution: Dependencies always run before dependent features
from winnow.calibration.features import FeatureDependency
class TheoreticalSpectrumDependency(FeatureDependency):
"""Example dependency that computes theoretical spectra once."""
def compute(self, dataset: CalibrationDataset) -> None:
# Expensive computation done once
dataset.metadata["theoretical_mz"] = compute_spectra(...)
def cleanup(self, dataset: CalibrationDataset) -> None:
# Remove intermediate columns when no longer needed
del dataset.metadata["theoretical_mz"]
The ProbabilityCalibrator automatically handles dependency resolution:
- Collects all dependencies from added features
- Computes each unique dependency once (reference counted)
- Executes feature computations in correct order
- Cleans up dependencies when reference count reaches zero
Handling Missing Features¶
Koina-dependent features (FragmentMatchFeatures, ChimericFeatures, RetentionTimeFeature) may not be computable for all peptides due to model-specific constraints:
- Peptides exceeding the model's maximum length
- Precursor charges exceeding the model's maximum
- Unsupported modifications or residue types
- Lack of runner-up sequences for chimeric features
Winnow provides two strategies controlled by the learn_from_missing parameter:
Filter Strategy (learn_from_missing=False, default)¶
Invalid PSMs are removed from the dataset before feature computation.
from winnow.calibration.features import FragmentMatchFeatures
feature = FragmentMatchFeatures(
mz_tolerance=20,
mz_tolerance_unit="ppm",
learn_from_missing=False, # Default
max_peptide_length=30,
max_precursor_charge=6,
)
Behaviour:
- Invalid rows are automatically filtered before Koina is called
- A
UserWarningis emitted reporting how many PSMs were removed - Filtered PSMs are gone entirely; no indicator column is added
- Calibrator trains only on remaining clean data
Use when: You want strict data quality and don't mind losing some PSMs.
Learn Strategy (learn_from_missing=True)¶
Invalid PSMs are retained with imputed feature values and an indicator column.
feature = FragmentMatchFeatures(
mz_tolerance=20,
mz_tolerance_unit="ppm",
learn_from_missing=True,
max_peptide_length=30,
max_precursor_charge=6,
)
Behaviour:
- All rows are retained in the dataset
- Invalid rows get zero/default feature values
- An
is_missing_*indicator column is added (e.g.,is_missing_fragment_match_features) - Calibrator can learn patterns associated with missing data
Use when: You want to maximise recall and let the model learn from missingness patterns.
Configuration¶
The defaults match Prosit model constraints. Adjust for other Koina models:
# Example for a model with different constraints
feature = FragmentMatchFeatures(
mz_tolerance=20,
mz_tolerance_unit="ppm",
max_peptide_length=50, # Model supports longer peptides
max_precursor_charge=8, # Model supports higher charges
unsupported_residues=["U", "O"], # Selenocysteine and pyrrolysine
)
See the configuration guide for details on model-specific constraints.