Reference: Prediction Output

When you run predictions with InstaNovo and specify an output path, a CSV file is generated. This document describes the columns in that file.

Standard Columns

Column	Description	Data Type	Notes
experiment_name	Experiment name derived from input filename	String	Based on the input file name (mgf, mzml, or mzxml)
scan_number	Scan number of the MS/MS spectrum	Integer	Unique identifier from the input file
spectrum_id	Unique spectrum identifier	String	Combination of experiment name and scan number (e.g., `yeast:17738`)
precursor_mz	Precursor m/z (mass-to-charge ratio)	Float	The observed m/z of the precursor ion
precursor_charge	Precursor charge state	Integer	Charge state of the precursor ion
prediction_id	Unique prediction identifier	String	Internal identifier for the prediction
group	Data group identifier	String	Used when running predictions on grouped data
targets	Target peptide sequence	String	Ground truth peptide sequence (only present if running in evaluation mode)
predictions	Best predicted peptide sequence	String	The final predicted peptide sequence (from InstaNovo+ when using refinement)
predictions_tokenised	Best predicted peptide sequence (tokenised)	String	The predicted sequence as comma-separated tokens
log_probs	Log probability of the best predicted sequence	Float	Natural logarithm of the sequence confidence. Higher is better.
token_log_probs	Log probability of each token in the best prediction	List[Float]	Natural logarithm of the confidence for each amino acid in the sequence
delta_mass_ppm	Mass difference between precursor and predicted peptide in ppm	Float	The mass deviation in parts per million. Lower is better.

These columns are present when using InstaNovo+ (combined transformer + diffusion model).

Column	Description	Data Type	Notes
instanovo_predictions	Predicted peptide sequence from InstaNovo	String	The initial peptide sequence from the transformer
instanovo_log_probabilities	Log probability from InstaNovo	Float	Natural logarithm of the sequence confidence
instanovo_token_log_probabilities	Token log probabilities from InstaNovo	List[Float]	Natural logarithm of the confidence for each token
instanovo_predictions_beam_0-4	Predicted sequences from each beam	String	Beam search results when num_beams > 1
instanovo_log_probabilities_beam_0-4	Log probabilities from each beam	Float	Confidence scores for each beam
instanovo_token_log_probabilities_beam_0-4	Token log probabilities from each beam	List[Float]	Per-token confidence for each beam

These columns are present when using InstaNovo+ (combined transformer + diffusion model).

Column	Description	Data Type	Notes
diffusion_predictions	Predicted peptide sequence from InstaNovo+	String	The refined peptide sequence from the diffusion
diffusion_log_probabilities	Log probability from InstaNovo+	Float	Natural logarithm of the sequence confidence
diffusion_token_log_probabilities	Token log probabilities from InstaNovo+	List[Float]	Natural logarithm of the confidence for each token
diffusion_unrefined_predictions	Unrefined predictions from InstaNovo+	String	Predictions before refinement
diffusion_predictions_beam_0-4	Predicted sequences from each beam	String	Beam search results when num_beams > 1
diffusion_log_probabilities_beam_0-4	Log probabilities from each beam	Float	Confidence scores for each beam

When using InstaNovo+ with refinement, the predictions column contains the best prediction from the diffusion model.
We recommend filtering the output based on the diffusion_log_probabilities and delta_mass_ppm columns to obtain a set of high-confidence predictions.
Beam search columns (beam_0 through beam_4) are only present when running with num_beams > 1.
The transformer model columns are prefixed with instanovo_ and diffusion model columns are prefixed with diffusion_.