Skip to content

Dataset

dataset

DataProcessor(metadata_columns: list[str] | set[str] | None = None)

Data processor abstract class.

This class is used to process the data before it is used in the model. It is designed to be used with the Dataset class from the HuggingFace datasets library.

It includes two main methods: - process_row: Processes a row of data. - collate_fn: Collates a batch of data. To be passed to the DataLoader class.

Additionally, it includes a way to pass metadata columns that will be kept after processing a dataset. These metadata columns will also bypass the collate_fn.

Initialize the data processor.

PARAMETER DESCRIPTION
metadata_columns

The metadata columns to add to the expected columns.

TYPE: list[str] | set[str] | None DEFAULT: None

metadata_columns: set[str] property

Get the metadata columns.

These columns are kept after processing a dataset.

RETURNS DESCRIPTION
set[str]

list[str]: The metadata columns.

process_row(row: dict[str, Any]) -> dict[str, Any] abstractmethod

Process a single row of data.

PARAMETER DESCRIPTION
row

The row of data to process in dict format.

TYPE: dict[str, Any]

RETURNS DESCRIPTION
dict[str, Any]

dict[str, Any]: The processed row with resulting columns.

process_dataset(dataset: Dataset, return_format: str | None = 'torch') -> Dataset

Process a dataset by mapping the process_row method.

The resulting dataset has the columns expected by the collate_fn method.

PARAMETER DESCRIPTION
dataset

The dataset to process.

TYPE: Dataset

return_format

The format to return the dataset in. Default is "torch".

TYPE: str | None DEFAULT: 'torch'

RETURNS DESCRIPTION
Dataset

The processed dataset.

TYPE: Dataset

collate_fn(batch: list[dict[str, Any]]) -> dict[str, Any]

Collate a batch.

Metadata columns are added after collation.

PARAMETER DESCRIPTION
batch

The batch to collate.

TYPE: list[dict[str, Any]]

RETURNS DESCRIPTION
dict[str, Any]

dict[str, Any]: The collated batch with metadata.

get_expected_columns() -> list[str]

Get the expected columns to be kept in the dataset after processing.

These columns are expected by the collate_fn method and include both data and metadata columns.

RETURNS DESCRIPTION
list[str]

list[str]: The expected columns.

add_metadata_columns(columns: list[str] | set[str]) -> None

Add expected metadata columns.

PARAMETER DESCRIPTION
columns

The columns to add.

TYPE: list[str] | set[str]

remove_modifications(peptide: str, replace_isoleucine_with_leucine: bool = True) -> str staticmethod

Remove modifications and optionally replace Isoleucine with Leucine.

PARAMETER DESCRIPTION
peptide

The peptide to remove modifications from.

TYPE: str

replace_isoleucine_with_leucine

Whether to replace Isoleucine with Leucine.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION
str

The peptide with modifications removed.

TYPE: str