Retention Time Feature¶

Trains a per-experiment linear regressor that maps observed retention time (RT) to indexed retention time (iRT) from a Koina iRT model. The absolute error between the sequence-based iRT prediction and the regressor-predicted iRT is used as a calibration feature.

Purpose¶

Retention time is orthogonal to fragmentation-based features; it depends on peptide hydrophobicity and chromatographic conditions rather than fragmentation patterns. A peptide that elutes at an unexpected time may be:

Incorrectly identified
A modification variant
Subject to unusual chromatographic behaviour

By comparing predicted vs observed retention times, the calibrator gains an independent signal for PSM quality assessment.

Implementation¶

Step 1: Koina iRT Prediction¶

We call a Koina iRT model (e.g., Prosit_2019_irt) with the predicted peptide sequences. The model returns predicted indexed retention time (iRT) values on a standardised scale.

Step 2: Per-experiment linear calibration¶

The RT-to-iRT mapping is inherently experiment-specific because different LC-MS experiments have different chromatographic conditions (column, gradient, temperature, etc.).

Training phase (prepare method):

Self-supervised training — High-confidence de novo predictions (top train_fraction by confidence score, descending) serve as pseudo-labels. The Koina iRT model is called on these peptide sequences to obtain iRT values, then a LinearRegression is fitted from observed RT to iRT. No database labels are needed.
Per-experiment fitting — Spectra are grouped by their experiment_name column. One regressor is fitted per experiment. If experiment_name is absent, a single global regressor is fitted with a warning.
Always re-fitted — The regressor is fitted at both training and inference time (in prepare()). It is not persisted inside the calibrator pickle. Given the same data and random seed, the same regressor is produced.

Prediction phase (compute method):

Predict iRT for all peptides using Koina
Use the per-experiment regressor to predict what the iRT "should be" given the observed RT
Compute the error: |Koina_iRT - regressor_predicted_iRT|

`experiment_name` column¶

For multi-experiment data, each spectrum must have an experiment_name column:

MGF files: Derived automatically from the file stem (e.g., data/run1.mgf produces experiment_name = "run1").
Parquet / IPC files: If the column already exists in the file, it is stringified and used as-is. If not, no experiment name is inferred.

Regressor checkpoint workflow¶

For within-experiment use cases, especially well-characterised species where the unlabelled data has an unreliable confidence distribution, you can save the regressors trained during the training step and load them at inference time:

# Train: saves calibrator AND per-experiment iRT regressors
winnow train ... irt_regressor_output_path=./irt_regressors.pkl

# Predict: loads regressors from training; skips re-fitting for known experiments
winnow predict ... calibrator.irt_regressor_path=./irt_regressors.pkl

When pre-fitted regressors are loaded, prepare() skips re-fitting for those experiments. Experiments in the inference data that were not in the training checkpoint are still fitted from scratch.

This is separate from the calibrator model itself and should not be confused with the general pretrained calibrator workflow, where regressors are always re-fitted automatically from the inference data.

Regressors can also be saved and loaded programmatically:

# After fitting (e.g., after calibrator.fit(dataset))
rt_feature = calibrator.feature_dict["iRT Feature"]
rt_feature.save_regressors("irt_regressors.pkl")

# Before prediction on new data
rt_feature.load_regressors("irt_regressors.pkl")

Columns¶

Column	Unit	Description
`irt_error`	iRT units (dimensionless)	Absolute difference between Koina-predicted iRT and regressor-predicted iRT. Large errors suggest the peptide elutes at an unexpected time.
`iRT`	iRT units	The raw Koina iRT prediction (stored for reference)
`predicted iRT`	iRT units	The regressor-predicted iRT based on observed retention time

When learn_from_missing=True, an additional indicator column is produced:

Column	Unit	Description
`is_missing_irt_error`	Boolean	`True` when the prediction cannot be passed to the Koina iRT model (e.g., exceeds length limits or contains unsupported residues)

Note: The iRT scale is dimensionless but standardised such that the Biognosys iRT kit peptides span approximately -25 to +120 iRT units.

Usage¶

from winnow.calibration.features import RetentionTimeFeature

feature = RetentionTimeFeature(
    train_fraction=0.1,
    min_train_points=10,
    unsupported_residues=["N[UNIMOD:7]", "Q[UNIMOD:7]"],
    max_peptide_length=30,
    irt_model_name="Prosit_2019_irt",
    learn_from_missing=True,
)
calibrator.add_feature(feature)

Parameters¶

Parameter	Type	Default	Description
`train_fraction`	`float`	`0.1`	Top fraction of spectra by confidence (descending) used to train the regressor. Only assumes higher confidence is better.
`min_train_points`	`int`	`10`	Minimum training points needed per experiment after applying `train_fraction`. Raises a `ValueError` if fewer are available.
`seed`	`int`	`42`	Random seed for reproducibility
`unsupported_residues`	`List[str]`	`[]`	Residue tokens not supported by the Koina model
`max_peptide_length`	`int`	`30`	Maximum peptide length supported by the model
`irt_model_name`	`str`	`"Prosit_2019_irt"`	Name of the Koina iRT model
`learn_from_missing`	`bool`	`True`	Whether to impute missing features or filter invalid rows

Requirements¶

The dataset must contain:

retention_time: Observed retention time values (instrument scale)
prediction: Predicted peptide sequence tokens

For multi-experiment data, each spectrum should also have an experiment_name column (see Implementation above).

Notes¶

The RT-to-iRT regressor is trained during the ProbabilityCalibrator training step via prepare(), so the same dataset used for calibrator training should be representative of the chromatographic conditions
iRT error is always positive (absolute value)
Peptides with unsupported residues or exceeding length limits may not be computable due to Koina model constraints. The defaults match the Prosit model family; if you use a different Koina model, adjust max_peptide_length and unsupported_residues accordingly. See the configuration guide for details.

Handling missing data¶

Winnow provides two strategies controlled by learn_from_missing:

Learn strategy (learn_from_missing=True, default):

Includes an is_missing_irt_error indicator column
Invalid rows get imputed error values (zero)
Calibrator learns patterns associated with missing data
Uses all available data, maximising recall

Filter strategy (learn_from_missing=False):

Invalid PSMs are automatically filtered from the dataset before Koina is called
A warning is emitted reporting how many PSMs were removed and which constraints applied
Filtered PSMs are gone entirely; no indicator column is added
Calibrator trains only on the remaining clean data

Use the filter strategy when you want strict data quality requirements. See Handling Missing Features for the general pattern across Koina-dependent features.