Dataset
dataset
DataProcessor(metadata_columns: list[str] | set[str] | None = None)
Data processor abstract class.
This class is used to process the data before it is used in the model.
It is designed to be used with the Dataset class from the HuggingFace datasets library.
It includes two main methods:
- process_row: Processes a row of data.
- collate_fn: Collates a batch of data. To be passed to the DataLoader class.
Additionally, it includes a way to pass metadata columns that will be kept after processing a dataset.
These metadata columns will also bypass the collate_fn.
Initialize the data processor.
| PARAMETER | DESCRIPTION |
|---|---|
metadata_columns
|
The metadata columns to add to the expected columns.
TYPE:
|
metadata_columns: set[str]
property
Get the metadata columns.
These columns are kept after processing a dataset.
| RETURNS | DESCRIPTION |
|---|---|
set[str]
|
list[str]: The metadata columns. |
process_row(row: dict[str, Any]) -> dict[str, Any]
abstractmethod
Process a single row of data.
| PARAMETER | DESCRIPTION |
|---|---|
row
|
The row of data to process in dict format.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict[str, Any]
|
dict[str, Any]: The processed row with resulting columns. |
process_dataset(dataset: Dataset, return_format: str | None = 'torch') -> Dataset
Process a dataset by mapping the process_row method.
The resulting dataset has the columns expected by the collate_fn method.
| PARAMETER | DESCRIPTION |
|---|---|
dataset
|
The dataset to process.
TYPE:
|
return_format
|
The format to return the dataset in. Default is "torch".
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Dataset
|
The processed dataset.
TYPE:
|
collate_fn(batch: list[dict[str, Any]]) -> dict[str, Any]
Collate a batch.
Metadata columns are added after collation.
| PARAMETER | DESCRIPTION |
|---|---|
batch
|
The batch to collate.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict[str, Any]
|
dict[str, Any]: The collated batch with metadata. |
get_expected_columns() -> list[str]
Get the expected columns to be kept in the dataset after processing.
These columns are expected by the collate_fn method and include
both data and metadata columns.
| RETURNS | DESCRIPTION |
|---|---|
list[str]
|
list[str]: The expected columns. |
add_metadata_columns(columns: list[str] | set[str]) -> None
Add expected metadata columns.
| PARAMETER | DESCRIPTION |
|---|---|
columns
|
The columns to add.
TYPE:
|
remove_modifications(peptide: str, replace_isoleucine_with_leucine: bool = True) -> str
staticmethod
Remove modifications and optionally replace Isoleucine with Leucine.
| PARAMETER | DESCRIPTION |
|---|---|
peptide
|
The peptide to remove modifications from.
TYPE:
|
replace_isoleucine_with_leucine
|
Whether to replace Isoleucine with Leucine.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
The peptide with modifications removed.
TYPE:
|