Retention Time Feature¶
Trains a per-experiment linear regressor that maps observed retention time (RT) to indexed retention time (iRT) from a Koina iRT model. The absolute error between the sequence-based iRT prediction and the regressor-predicted iRT is used as a calibration feature.
Purpose¶
Retention time is orthogonal to fragmentation-based features; it depends on peptide hydrophobicity and chromatographic conditions rather than fragmentation patterns. A peptide that elutes at an unexpected time may be:
- Incorrectly identified
- A modification variant
- Subject to unusual chromatographic behaviour
By comparing predicted vs observed retention times, the calibrator gains an independent signal for PSM quality assessment.
Implementation¶
Step 1: Koina iRT Prediction¶
We call a Koina iRT model (e.g., Prosit_2019_irt) with the predicted peptide sequences. The model returns predicted indexed retention time (iRT) values on a standardised scale.
Step 2: Per-experiment linear calibration¶
The RT-to-iRT mapping is inherently experiment-specific because different LC-MS experiments have different chromatographic conditions (column, gradient, temperature, etc.).
Training phase (prepare method):
- Self-supervised training — High-confidence de novo predictions (top
train_fractionby confidence score, descending) serve as pseudo-labels. The Koina iRT model is called on these peptide sequences to obtain iRT values, then aLinearRegressionis fitted from observed RT to iRT. No database labels are needed. - Per-experiment fitting — Spectra are grouped by their
experiment_namecolumn. One regressor is fitted per experiment. Ifexperiment_nameis absent, a single global regressor is fitted with a warning. - Always re-fitted — The regressor is fitted at both training and inference time (in
prepare()). It is not persisted inside the calibrator pickle. Given the same data and random seed, the same regressor is produced.
Prediction phase (compute method):
- Predict iRT for all peptides using Koina
- Use the per-experiment regressor to predict what the iRT "should be" given the observed RT
- Compute the error:
|Koina_iRT - regressor_predicted_iRT|
experiment_name column¶
For multi-experiment data, each spectrum must have an experiment_name column:
- MGF files: Derived automatically from the file stem (e.g.,
data/run1.mgfproducesexperiment_name = "run1"). - Parquet / IPC files: If the column already exists in the file, it is stringified and used as-is. If not, no experiment name is inferred.
Regressor checkpoint workflow¶
For within-experiment use cases, especially well-characterised species where the unlabelled data has an unreliable confidence distribution, you can save the regressors trained during the training step and load them at inference time:
# Train: saves calibrator AND per-experiment iRT regressors
winnow train ... irt_regressor_output_path=./irt_regressors.pkl
# Predict: loads regressors from training; skips re-fitting for known experiments
winnow predict ... calibrator.irt_regressor_path=./irt_regressors.pkl
When pre-fitted regressors are loaded, prepare() skips re-fitting for those experiments. Experiments in the inference data that were not in the training checkpoint are still fitted from scratch.
This is separate from the calibrator model itself and should not be confused with the general pretrained calibrator workflow, where regressors are always re-fitted automatically from the inference data.
Regressors can also be saved and loaded programmatically:
# After fitting (e.g., after calibrator.fit(dataset))
rt_feature = calibrator.feature_dict["iRT Feature"]
rt_feature.save_regressors("irt_regressors.pkl")
# Before prediction on new data
rt_feature.load_regressors("irt_regressors.pkl")
Columns¶
| Column | Unit | Description |
|---|---|---|
irt_error |
iRT units (dimensionless) | Absolute difference between Koina-predicted iRT and regressor-predicted iRT. Large errors suggest the peptide elutes at an unexpected time. |
iRT |
iRT units | The raw Koina iRT prediction (stored for reference) |
predicted iRT |
iRT units | The regressor-predicted iRT based on observed retention time |
When learn_from_missing=True, an additional indicator column is produced:
| Column | Unit | Description |
|---|---|---|
is_missing_irt_error |
Boolean | True when the prediction cannot be passed to the Koina iRT model (e.g., exceeds length limits or contains unsupported residues) |
Note: The iRT scale is dimensionless but standardised such that the Biognosys iRT kit peptides span approximately -25 to +120 iRT units.
Usage¶
from winnow.calibration.features import RetentionTimeFeature
feature = RetentionTimeFeature(
train_fraction=0.1,
min_train_points=10,
unsupported_residues=["N[UNIMOD:7]", "Q[UNIMOD:7]"],
max_peptide_length=30,
irt_model_name="Prosit_2019_irt",
learn_from_missing=True,
)
calibrator.add_feature(feature)
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
train_fraction |
float |
0.1 |
Top fraction of spectra by confidence (descending) used to train the regressor. Only assumes higher confidence is better. |
min_train_points |
int |
10 |
Minimum training points needed per experiment after applying train_fraction. Raises a ValueError if fewer are available. |
seed |
int |
42 |
Random seed for reproducibility |
unsupported_residues |
List[str] |
[] |
Residue tokens not supported by the Koina model |
max_peptide_length |
int |
30 |
Maximum peptide length supported by the model |
irt_model_name |
str |
"Prosit_2019_irt" |
Name of the Koina iRT model |
learn_from_missing |
bool |
True |
Whether to impute missing features or filter invalid rows |
Requirements¶
The dataset must contain:
retention_time: Observed retention time values (instrument scale)prediction: Predicted peptide sequence tokens
For multi-experiment data, each spectrum should also have an experiment_name column (see Implementation above).
Notes¶
- The RT-to-iRT regressor is trained during the
ProbabilityCalibratortraining step viaprepare(), so the same dataset used for calibrator training should be representative of the chromatographic conditions - iRT error is always positive (absolute value)
- Peptides with unsupported residues or exceeding length limits may not be computable due to Koina model constraints. The defaults match the Prosit model family; if you use a different Koina model, adjust
max_peptide_lengthandunsupported_residuesaccordingly. See the configuration guide for details.
Handling missing data¶
Winnow provides two strategies controlled by learn_from_missing:
Learn strategy (learn_from_missing=True, default):
- Includes an
is_missing_irt_errorindicator column - Invalid rows get imputed error values (zero)
- Calibrator learns patterns associated with missing data
- Uses all available data, maximising recall
Filter strategy (learn_from_missing=False):
- Invalid PSMs are automatically filtered from the dataset before Koina is called
- A warning is emitted reporting how many PSMs were removed and which constraints applied
- Filtered PSMs are gone entirely; no indicator column is added
- Calibrator trains only on the remaining clean data
Use the filter strategy when you want strict data quality requirements. See Handling Missing Features for the general pattern across Koina-dependent features.