Fragment Match Features¶
Extracts features by comparing the observed fragmentation spectrum against a theoretical spectrum predicted by a Koina intensity model.
Purpose¶
The quality of the match between observed and theoretical fragmentation patterns is a strong indicator of identification correctness. True identifications typically show:
- High fraction of predicted ions observed
- Intensity patterns matching theoretical predictions
- Consecutive ion series without gaps
- Low unexplained intensity
False identifications often show poor spectral agreement even when the de novo sequencer reports high confidence.
Implementation¶
Step 1: Theoretical Spectrum Generation¶
We call a Koina intensity prediction model (e.g., Prosit_2020_intensity_HCD) with:
- Predicted peptide sequence
- Precursor charge
- Collision energy (only required by some Koina models)
- Fragmentation type (only required by some Koina models)
The model returns:
- Theoretical m/z values for all possible b- and y-ions
- Predicted relative intensities for each ion
- Ion annotations (e.g., "b1", "y3", "b2+2" for doubly-charged)
Step 2: Peak Matching¶
For each theoretical peak, we search for the nearest observed peak using binary search. A match is recorded if the m/z difference is within the configured tolerance (default: 20 ppm). This produces a set of matched peaks containing:
- Theoretical m/z and intensity
- Observed intensity
- Ion annotation
Columns¶
Basic Match Metrics¶
| Column | Unit | Description |
|---|---|---|
ion_matches |
Fraction (0-1) | Number of matched theoretical peaks / total theoretical peaks. A high number indicates presence of much of the predictied peptide's ion ladder in the observed spectrum. Low values suggest missing fragment coverage or an incorrect identification. |
ion_match_intensity |
Fraction (0-1) | Sum of observed intensities for matched peaks / total observed intensity, accounting for the isotopic envelope for four additional peaks. A high number indicates a prediction that explains most of the spectral evidence. A low number could indicate contamination, co-eluting peptides, or an incorrect identification. |
Ion Coverage Features¶
| Column | Unit | Description |
|---|---|---|
longest_b_series |
Count (integer) | Longest consecutive run of matched b-ions (e.g., b1, b2, b3 = 3). |
longest_y_series |
Count (integer) | Same as above for y-ions |
complementary_ion_count |
Count (integer) | Number of peptide bond positions where both the b-ion and complementary y-ion are matched. For a peptide of length n, bond position i produces b_i and y_(n-i). Finding both provides stronger evidence of a correct identification. |
max_ion_gap |
Daltons (Da) | Largest m/z difference between two consecutive matched theoretical peaks when sorted by m/z. Large gaps may indicate missing fragmentation coverage. |
b_y_intensity_ratio |
Ratio | Ratio of total matched b-ion intensity to total matched y-ion intensity (including isotopic envelopes). |
spectral_angle |
Score (0-1) | Normalised spectral angle similarity between theoretical and matched observed intensity vectors. A value of 1 indicates perfect correlation, 0 indicates orthogonal vectors. |
xcorr |
Score | SEQUEST fast cross-correlation score. Measures overall agreement between the observed and theoretical spectra with local background correction. Higher values indicate better matches. |
Cross-correlation Score¶
The xcorr column implements the fast cross-correlation score function from SEQUEST (Eng et al., 2008). Unlike the other features which operate on individually matched peaks, xcorr evaluates the overall pattern of the observed spectrum against the theoretical spectrum using a background-corrected dot product.
The observed spectrum is preprocessed by:
- Binning into near-unit-dalton bins (~1.0005 Da) with square-root intensity compression
- Window normalization — the maximum intensity within each of 10 equal m/z windows is normalized to a fixed value, making the score robust to different intensity scales
- Background subtraction — the mean intensity from ±75 neighbouring bins is subtracted from each bin, so that only signal specifically at the correct m/z positions contributes positively
The score is then computed as the dot product of the preprocessed observed spectrum with a binary theoretical spectrum (1.0 at each predicted fragment ion bin position). This background correction naturally penalizes unmatched theoretical ions and discounts noise.
Reference: Eng JK, Fischer B, Grossmann J, Maccoss MJ. A fast SEQUEST cross correlation algorithm. J Proteome Res. 2008 Oct;7(10):4598-602. doi: 10.1021/pr800420s.
Usage¶
from winnow.calibration.features import FragmentMatchFeatures
feature = FragmentMatchFeatures(
mz_tolerance=20,
mz_tolerance_unit="ppm",
unsupported_residues=["N[UNIMOD:7]", "Q[UNIMOD:7]"],
intensity_model_name="Prosit_2020_intensity_HCD",
max_precursor_charge=6,
max_peptide_length=30,
model_input_constants={"collision_energies": 25},
learn_from_missing=True,
)
calibrator.add_feature(feature)
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
mz_tolerance |
float |
(required) | Tolerance magnitude for matching fragment ions. |
mz_tolerance_unit |
str |
(required) | Unit for mz_tolerance: "ppm" or "da" (case-insensitive). |
unsupported_residues |
List[str] |
[] |
Residue tokens not supported by the Koina model |
intensity_model_name |
str |
"Prosit_2020_intensity_HCD" |
Name of the Koina intensity model |
max_precursor_charge |
int |
6 |
Maximum charge state supported by the model |
max_peptide_length |
int |
30 |
Maximum peptide length supported by the model |
model_input_constants |
Dict |
{} |
Constant values for model inputs (e.g., collision energy) |
model_input_columns |
Dict |
{} |
Column names for per-row model inputs |
learn_from_missing |
bool |
True |
Whether to impute missing features or filter invalid rows |
Requirements¶
The dataset must contain:
precursor_charge: Precursor charge statemz_array: Observed m/z values (list per row)intensity_array: Observed intensities (list per row)prediction: Predicted peptide sequence tokens
For some Koina-hosted intensity prediction models, the dataset may also require:
collision_energies: Kinetic energy used to fragment the peptidefragmentation_types: Method used to break the ions
Notes¶
- Different Koina models have different constraints. See configuration guide for details.
- When
learn_from_missing=True, invalid rows get zero feature values and anis_missing_fragment_match_featuresindicator column. b_y_intensity_ratiois computed asb_total / (y_total + epsilon)where epsilon is a small constant providing numerical stability when no y-ions are matched.- Spectral angle calculations do not take into account isotopic envelopes. Only the first matching peak's intensity is compared against the corresponding theoretical intensity value.