Skip to content

Dataset

AnnotatedPolarsSpectrumDataset(data_frame, peptides)

Bases: PolarsSpectrumDataset

A dataset with a Polars index that includes peptides from an aligned list.

Source code in instanovo/diffusion/dataset.py
def __init__(self, data_frame: polars.DataFrame, peptides: list[str]) -> None:
    super().__init__(data_frame)
    self.peptides = peptides

AnnotatedSpectrumBatch

Bases: NamedTuple

Represents a batch of annotated spectrum data.

Attributes:

Name Type Description
spectra FloatTensor

The tensor containing the spectra data.

spectra_padding_mask BoolTensor

A boolean tensor indicating the padding positions in the spectra tensor.

precursors FloatTensor

The tensor containing precursor mass information.

peptides LongTensor

The tensor containing peptide sequence information.

peptide_padding_mask BoolTensor

A boolean tensor indicating the padding positions in the peptides tensor.

PolarsSpectrumDataset(data_frame)

Bases: Dataset

An Polars data frame index wrapper for depthcharge/casanovo datasets.

Source code in instanovo/diffusion/dataset.py
def __init__(self, data_frame: polars.DataFrame) -> None:
    self.data = data_frame

SpectrumBatch

Bases: NamedTuple

Represents a batch of spectrum data without annotations.

Attributes:

Name Type Description
spectra FloatTensor

The tensor containing the spectra data.

spectra_padding_mask BoolTensor

A boolean tensor indicating the padding positions in the spectra tensor.

precursors FloatTensor

The tensor containing precursor mass information.

collate_batches(residues, max_length, time_steps, annotated)

Get batch collation function for given residue set, maximum length and time steps.

The returned function combines spectra and precursor information for a batch into torch tensors. It also maps the residues in a peptide to their indices in residues, pads or truncates them all to max_length and returns this as a torch tensor.

Parameters:

Name Type Description Default
residues ResidueSet

The residues in the vocabulary together with their masses and index map.

required
max_length int

The maximum peptide sequence length. All sequences are padded to this length.

required
time_steps int

The number of diffusion time steps.

required

Returns:

Type Description
Callable[[list[tuple[FloatTensor, float, int, str]]], SpectrumBatch | AnnotatedSpectrumBatch]

Callable[ [list[tuple[torch.FloatTensor, float, int, str]]], SpectrumBatch | AnnotatedSpectrumBatch]: The function that combines examples into a batch given the parameters above.

Source code in instanovo/diffusion/dataset.py
def collate_batches(
    residues: ResidueSet, max_length: int, time_steps: int, annotated: bool
) -> Callable[
    [list[tuple[torch.FloatTensor, float, int, str]]], SpectrumBatch | AnnotatedSpectrumBatch
]:
    """Get batch collation function for given residue set, maximum length and time steps.

    The returned function combines spectra and precursor information for a batch into
    `torch` tensors. It also maps the residues in a peptide to their indices in
    `residues`, pads or truncates them all to `max_length` and returns this as a
    `torch` tensor.

    Args:
        residues (ResidueSet):
            The residues in the vocabulary together with their masses
            and index map.

        max_length (int):
            The maximum peptide sequence length. All sequences are
            padded to this length.

        time_steps (int):
            The number of diffusion time steps.

    Returns:
        Callable[ [list[tuple[torch.FloatTensor, float, int, str]]], SpectrumBatch | AnnotatedSpectrumBatch]:
            The function that combines examples into a batch given the parameters above.
    """

    def fn(
        batch: list[tuple[torch.Tensor, float, int, str]]
    ) -> SpectrumBatch | AnnotatedSpectrumBatch:
        if annotated:
            spectra, precursor_mz, precursor_charge, peptides = list(zip(*batch))
        else:
            spectra, precursor_mz, precursor_charge = list(zip(*batch))

        spectra = torch.nn.utils.rnn.pad_sequence(spectra, batch_first=True)
        spectra_padding_mask = spectra[:, :, 0] == 0.0

        precursor_mz = torch.tensor(precursor_mz)
        precursor_charge = torch.FloatTensor(precursor_charge)
        precursor_masses = (precursor_mz - PROTON_MASS_AMU) * precursor_charge
        precursors = torch.stack([precursor_masses, precursor_charge, precursor_mz], -1).float()
        if annotated:
            peptides = [sequence if isinstance(sequence, str) else "$" for sequence in peptides]
            peptides = [sequence if len(sequence) > 0 else "$" for sequence in peptides]
            peptides = torch.stack(
                [
                    residues.encode(residues.tokenize(sequence)[:max_length], pad_length=max_length)
                    for sequence in peptides
                ]
            )
            peptide_padding_mask = peptides == residues.pad_index
            return AnnotatedSpectrumBatch(
                spectra, spectra_padding_mask, precursors, peptides, peptide_padding_mask
            )
        else:
            return SpectrumBatch(spectra, spectra_padding_mask, precursors)

    return fn