Skip to content

Index

common

__all__ = ['DataProcessor', 'AccelerateDeNovoTrainer', 'AccelerateDeNovoPredictor', 'FinetuneScheduler', 'WarmupScheduler', 'CosineWarmupScheduler', 'NeptuneSummaryWriter', 'TrainingState', 'Timer'] module-attribute

DataProcessor(metadata_columns: list[str] | set[str] | None = None)

Data processor abstract class.

This class is used to process the data before it is used in the model. It is designed to be used with the Dataset class from the HuggingFace datasets library.

It includes two main methods: - process_row: Processes a row of data. - collate_fn: Collates a batch of data. To be passed to the DataLoader class.

Additionally, it includes a way to pass metadata columns that will be kept after processing a dataset. These metadata columns will also bypass the collate_fn.

Initialize the data processor.

PARAMETER DESCRIPTION
metadata_columns

The metadata columns to add to the expected columns.

TYPE: list[str] | set[str] | None DEFAULT: None

metadata_columns: set[str] property

Get the metadata columns.

These columns are kept after processing a dataset.

RETURNS DESCRIPTION
set[str]

list[str]: The metadata columns.

process_row(row: dict[str, Any]) -> dict[str, Any] abstractmethod

Process a single row of data.

PARAMETER DESCRIPTION
row

The row of data to process in dict format.

TYPE: dict[str, Any]

RETURNS DESCRIPTION
dict[str, Any]

dict[str, Any]: The processed row with resulting columns.

process_dataset(dataset: Dataset, return_format: str | None = 'torch') -> Dataset

Process a dataset by mapping the process_row method.

The resulting dataset has the columns expected by the collate_fn method.

PARAMETER DESCRIPTION
dataset

The dataset to process.

TYPE: Dataset

return_format

The format to return the dataset in. Default is "torch".

TYPE: str | None DEFAULT: 'torch'

RETURNS DESCRIPTION
Dataset

The processed dataset.

TYPE: Dataset

collate_fn(batch: list[dict[str, Any]]) -> dict[str, Any]

Collate a batch.

Metadata columns are added after collation.

PARAMETER DESCRIPTION
batch

The batch to collate.

TYPE: list[dict[str, Any]]

RETURNS DESCRIPTION
dict[str, Any]

dict[str, Any]: The collated batch with metadata.

get_expected_columns() -> list[str]

Get the expected columns to be kept in the dataset after processing.

These columns are expected by the collate_fn method and include both data and metadata columns.

RETURNS DESCRIPTION
list[str]

list[str]: The expected columns.

add_metadata_columns(columns: list[str] | set[str]) -> None

Add expected metadata columns.

PARAMETER DESCRIPTION
columns

The columns to add.

TYPE: list[str] | set[str]

remove_modifications(peptide: str, replace_isoleucine_with_leucine: bool = True) -> str staticmethod

Remove modifications and optionally replace Isoleucine with Leucine.

PARAMETER DESCRIPTION
peptide

The peptide to remove modifications from.

TYPE: str

replace_isoleucine_with_leucine

Whether to replace Isoleucine with Leucine.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION
str

The peptide with modifications removed.

TYPE: str

AccelerateDeNovoPredictor(config: DictConfig)

Predictor class that uses the Accelerate library.

s3: S3FileHandler property

Get the S3 file handler.

RETURNS DESCRIPTION
S3FileHandler

The S3 file handler

TYPE: S3FileHandler

config = config instance-attribute

targets: list | None = None instance-attribute

output_path = self.config.get('output_path', None) instance-attribute

pred_df: pd.DataFrame | None = None instance-attribute

results_dict: dict | None = None instance-attribute

prediction_tokenised_col = self.config.get('prediction_tokenised_col', 'predictions_tokenised') instance-attribute

prediction_col = self.config.get('prediction_col', 'predictions') instance-attribute

log_probs_col = self.config.get('log_probs_col', 'log_probs') instance-attribute

token_log_probs_col = self.config.get('token_log_probs_col', 'token_log_probs') instance-attribute

save_encoder_outputs = config.get('save_encoder_outputs', False) instance-attribute

encoder_output_path = config.get('encoder_output_path', None) instance-attribute

encoder_output_reduction = config.get('encoder_output_reduction', 'mean') instance-attribute

accelerator = self.setup_accelerator() instance-attribute

denovo = self.config.get('denovo', False) instance-attribute

model = self.model.eval() instance-attribute

residue_set = self.model.residue_set instance-attribute

test_dataset = self.load_dataset() instance-attribute

test_dataloader = self.build_dataloader(self.test_dataset) instance-attribute

decoder = self.setup_decoder() instance-attribute

metrics = self.setup_metrics() instance-attribute

running_loss = None instance-attribute

steps_per_inference = len(self.test_dataloader) instance-attribute

load_model() -> Tuple[nn.Module, DictConfig] abstractmethod

Load the model.

setup_decoder() -> Decoder abstractmethod

Setup the decoder.

setup_data_processor() -> DataProcessor abstractmethod

Setup the data processor.

get_predictions(batch: Any) -> dict[str, Any] abstractmethod

Get the predictions for a batch.

postprocess_dataset(dataset: Dataset) -> Dataset

Postprocess the dataset.

load_dataset() -> Dataset

Load the test dataset.

RETURNS DESCRIPTION
Dataset

The test dataset

TYPE: Dataset

print_sample_batch() -> None

Print a sample batch of the training data.

setup_metrics() -> Metrics

Setup the metrics.

setup_accelerator() -> Accelerator

Setup the accelerator.

build_dataloader(test_dataset: Dataset) -> torch.utils.data.DataLoader

Setup the dataloaders.

predict() -> pd.DataFrame

Predict the test dataset.

predictions_to_df(predictions: dict[str, list]) -> pd.DataFrame

Convert the predictions to a pandas DataFrame.

PARAMETER DESCRIPTION
predictions

The predictions dictionary

TYPE: dict[str, list]

RETURNS DESCRIPTION
DataFrame

pd.DataFrame: The predictions dataframe

postprocess_predictions(pred_df: pd.DataFrame) -> pd.DataFrame

Postprocess the predictions.

Optionally, this can be used to modify the predictions, eg. ensembling. By default, this does nothing.

PARAMETER DESCRIPTION
pred_df

The predictions dataframe

TYPE: DataFrame

RETURNS DESCRIPTION
DataFrame

pd.DataFrame: The postprocessed predictions dataframe

calculate_metrics(pred_df: pd.DataFrame) -> dict[str, Any] | None

Calculate the metrics.

PARAMETER DESCRIPTION
pred_df

The predictions dataframe

TYPE: DataFrame

RETURNS DESCRIPTION
dict[str, Any] | None

dict[str, Any] | None: The results dictionary containing the metrics

save_predictions(pred_df: pd.DataFrame, results_dict: dict[str, list] | None = None) -> None

Save the predictions to a file.

PARAMETER DESCRIPTION
pred_df

The predictions dataframe

TYPE: DataFrame

results_dict

The results dictionary containing the metrics

TYPE: dict[str, list] | None DEFAULT: None

save_encoder_outputs_to_parquet(spectrum_ids: list[str], encoder_outputs: list[np.ndarray]) -> None

Save the encoder outputs to a file.

PARAMETER DESCRIPTION
encoder_outputs

The encoder outputs

TYPE: list[ndarray]

spectrum_ids

The spectrum ids

TYPE: list[str]

CosineWarmupScheduler(optimizer: torch.optim.Optimizer, warmup: int, max_iters: int)

Bases: _LRScheduler

Learning rate scheduler with linear warm up followed by cosine shaped decay.

Parameters

optimizer : torch.optim.Optimizer Optimizer object. warmup : int The number of warm up iterations. max_iters : int The total number of iterations.

get_lr() -> list[float]

Get the learning rate at the current step.

get_lr_factor(epoch: int) -> float

Get the LR factor at the current step.

FinetuneScheduler(model_state_dict: dict, config: DictConfig, steps_per_epoch: int | None = None)

Scheduler for unfreezing parameters of a model.

PARAMETER DESCRIPTION
model_state_dict

The state dictionary of the model.

TYPE: dict

config

The configuration for the scheduler.

TYPE: DictConfig

steps_per_epoch

The number of steps per epoch.

TYPE: int | None DEFAULT: None

model_state_dict = model_state_dict instance-attribute

config = config instance-attribute

steps_per_epoch = steps_per_epoch instance-attribute

is_verbose = self.config.get('verbose', False) instance-attribute

schedule = self._get_schedule() instance-attribute

next_phase: dict[str, Any] | None = self.schedule.pop(0) instance-attribute

step(global_step: int) -> None

Step the unfreezing scheduler.

PARAMETER DESCRIPTION
global_step

The global step of the model.

TYPE: int

WarmupScheduler(optimizer: torch.optim.Optimizer, warmup: int)

Bases: _LRScheduler

Linear warmup scheduler.

warmup = warmup instance-attribute

get_lr() -> list[float]

Get the learning rate at the current step.

get_lr_factor(epoch: int) -> float

Get the LR factor at the current step.

AccelerateDeNovoTrainer(config: DictConfig)

Trainer class that uses the Accelerate library.

run_id: str property

Get the run ID.

RETURNS DESCRIPTION
str

The run ID

TYPE: str

s3: S3FileHandler property

Get the S3 file handler.

RETURNS DESCRIPTION
S3FileHandler

The S3 file handler

TYPE: S3FileHandler

global_step: int property

Get the current global training step.

This represents the total number of training steps across all epochs.

RETURNS DESCRIPTION
int

The current global step number

TYPE: int

epoch: int property

Get the current training epoch.

This represents the current epoch number in the training process.

RETURNS DESCRIPTION
int

The current epoch number

TYPE: int

training_state: TrainingState property

Get the training state.

config = config instance-attribute

enable_verbose_logging = self.config.get('enable_verbose_logging', True) instance-attribute

accelerator = self.setup_accelerator() instance-attribute

residue_set = ResidueSet(residue_masses=(self.config.residues.get('residues')), residue_remapping=(self.config.dataset.get('residue_remapping', None))) instance-attribute

model = self.setup_model() instance-attribute

optimizer = self.setup_optimizer() instance-attribute

lr_scheduler = self.setup_scheduler() instance-attribute

decoder = self.setup_decoder() instance-attribute

metrics = self.setup_metrics() instance-attribute

running_loss = None instance-attribute

total_steps = self.config.get('training_steps', 2500000) instance-attribute

finetune_scheduler: FinetuneScheduler | None = FinetuneScheduler(self.model.state_dict(), self.config.get('finetune')) instance-attribute

steps_per_validation = self.config.get('validation_interval', 100000) instance-attribute

steps_per_checkpoint = self.config.get('checkpoint_interval', 100000) instance-attribute

last_validation_metric = None instance-attribute

best_checkpoint_metric = None instance-attribute

setup_model() -> nn.Module abstractmethod

Setup the model.

setup_optimizer() -> torch.optim.Optimizer abstractmethod

Setup the optimizer.

setup_decoder() -> Decoder abstractmethod

Setup the decoder.

setup_data_processors() -> tuple[DataProcessor, DataProcessor] abstractmethod

Setup the data processor.

save_model(is_best_checkpoint: bool = False) -> None abstractmethod

Save the model.

forward(batch: Any) -> tuple[torch.Tensor, dict[str, torch.Tensor]] abstractmethod

Forward pass for the model to calculate loss.

get_predictions(batch: Any) -> tuple[list[str] | list[list[str]], list[str] | list[list[str]]] abstractmethod

Get the predictions for a batch.

convert_interval_to_steps(interval: float | int, steps_per_epoch: int) -> int staticmethod

Convert an interval to steps.

PARAMETER DESCRIPTION
interval

The interval to convert.

TYPE: float | int

steps_per_epoch

The number of steps per epoch.

TYPE: int

RETURNS DESCRIPTION
int

The number of steps.

TYPE: int

log_if_verbose(message: str, level: str = 'info') -> None

Log a message if verbose logging is enabled.

setup_metrics() -> Metrics

Setup the metrics.

setup_accelerator() -> Accelerator

Setup the accelerator.

build_dataloaders(train_dataset: Dataset, valid_dataset: Dataset) -> tuple[torch.utils.data.DataLoader, torch.utils.data.DataLoader]

Setup the dataloaders.

setup_scheduler() -> torch.optim.lr_scheduler.LRScheduler

Setup the learning rate scheduler.

RETURNS DESCRIPTION
LRScheduler

torch.optim.lr_scheduler.LRScheduler: The learning rate scheduler

setup_neptune() -> None

Setup the neptune.

setup_tensorboard() -> None

Setup the tensorboard.

load_datasets() -> tuple[Dataset, Dataset, int, int]

Load the training and validation datasets.

RETURNS DESCRIPTION
tuple[Dataset, Dataset, int, int]

tuple[SpectrumDataFrame, SpectrumDataFrame]: The training and validation datasets

print_sample_batch() -> None

Print a sample batch of the training data.

save_accelerator_state(is_best_checkpoint: bool = False) -> None

Save the accelerator state.

check_if_best_checkpoint() -> bool

Check if the last validation metric is the best metric.

load_accelerator_state() -> None

Load the accelerator state.

load_model_state() -> None

Load the model state.

update_model_state(model_state: dict[str, torch.Tensor], model_config: DictConfig) -> dict[str, torch.Tensor]

Update the model state.

update_vocab(model_state: dict[str, torch.Tensor]) -> dict[str, torch.Tensor]

Update the vocabulary of the model.

train() -> None

Train the model.

prepare_batch(batch: Iterable[Any]) -> Any

Prepare a batch for training.

Manually move tensors to accelerator.device since we do not prepare our dataloaders with the accelerator.

PARAMETER DESCRIPTION
batch

The batch to prepare.

TYPE: Iterable[Any]

RETURNS DESCRIPTION
Any

The prepared batch

TYPE: Any

train_epoch() -> None

Train the model for one epoch.

validate_epoch(num_sanity_steps: int | None = None, calculate_metrics: bool = True) -> None

Validate for one epoch.

NeptuneSummaryWriter(log_dir: str, run: neptune.Run)

Bases: SummaryWriter

Combine SummaryWriter with NeptuneWriter.

run = run instance-attribute

add_scalar(tag: str, scalar_value: float, global_step: int | float | None = None) -> None

Record scalar to tensorboard and Neptune.

add_text(tag: str, text_string: str, global_step: Optional[int] = None, walltime: Optional[float] = None) -> None

Record text to tensorboard and Neptune.

add_hparams(hparam_dict: dict, metric_dict: dict, hparam_domain_discrete: Optional[Dict[str, List[Any]]] = None, run_name: Optional[str] = None, global_step: Optional[int] = None) -> None

Add a set of hyperparameters to be compared in Neptune as for Tensorboard.

Timer(total_steps: int | None = None)

Timer for training and validation.

start_time = time.time() instance-attribute

total_steps = total_steps instance-attribute

current_step = 0 instance-attribute

start() -> None

Restart the timer.

step() -> None

Step the timer.

get_delta() -> float

Get the time delta since the timer was started.

get_eta(current_step: int | None = None) -> float

Get the estimated time to completion.

get_total_time() -> float

Get the total time expected to complete all steps.

get_rate(current_step: int | None = None) -> float

Get the rate of steps per second.

get_step_time(current_step: int | None = None) -> float

Get the time per step.

get_time_str() -> str

Get the time delta since the timer was started.

get_eta_str(current_step: int | None = None) -> str

Get the estimated time to completion.

get_total_time_str() -> str

Get the total time expected to complete all steps.

get_rate_str(current_step: int | None = None) -> str

Get the rate of steps per second.

get_step_time_rate_str(current_step: int | None = None) -> str

Get the time per step.

get_step_time_str(current_step: int | None = None) -> str

Get the time per step.

TrainingState()

Training state for tracking training progress.

This class is used by Accelerate to save and load training state during checkpointing and resuming training runs. It tracks the current epoch and global step of training.

Initialize training state with zeroed counters.

global_step: int property

Get the current global step.

epoch: int property

Get the current epoch.

state_dict() -> dict[str, Any]

Get the state dictionary for saving.

RETURNS DESCRIPTION
dict[str, Any]

dict[str, Any]: Dictionary containing the current training state.

load_state_dict(state_dict: dict[str, Any]) -> None

Load state from a dictionary.

PARAMETER DESCRIPTION
state_dict

Dictionary containing the training state to load.

TYPE: dict[str, Any]

step() -> None

Step the global step.

step_epoch() -> None

Step the epoch.

unstep_epoch() -> None

Unstep the epoch.