Index
common
__all__ = ['DataProcessor', 'AccelerateDeNovoTrainer', 'AccelerateDeNovoPredictor', 'FinetuneScheduler', 'WarmupScheduler', 'CosineWarmupScheduler', 'NeptuneSummaryWriter', 'TrainingState', 'Timer']
module-attribute
DataProcessor(metadata_columns: list[str] | set[str] | None = None)
Data processor abstract class.
This class is used to process the data before it is used in the model.
It is designed to be used with the Dataset class from the HuggingFace datasets library.
It includes two main methods:
- process_row: Processes a row of data.
- collate_fn: Collates a batch of data. To be passed to the DataLoader class.
Additionally, it includes a way to pass metadata columns that will be kept after processing a dataset.
These metadata columns will also bypass the collate_fn.
Initialize the data processor.
| PARAMETER | DESCRIPTION |
|---|---|
metadata_columns
|
The metadata columns to add to the expected columns.
TYPE:
|
metadata_columns: set[str]
property
Get the metadata columns.
These columns are kept after processing a dataset.
| RETURNS | DESCRIPTION |
|---|---|
set[str]
|
list[str]: The metadata columns. |
process_row(row: dict[str, Any]) -> dict[str, Any]
abstractmethod
Process a single row of data.
| PARAMETER | DESCRIPTION |
|---|---|
row
|
The row of data to process in dict format.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict[str, Any]
|
dict[str, Any]: The processed row with resulting columns. |
process_dataset(dataset: Dataset, return_format: str | None = 'torch') -> Dataset
Process a dataset by mapping the process_row method.
The resulting dataset has the columns expected by the collate_fn method.
| PARAMETER | DESCRIPTION |
|---|---|
dataset
|
The dataset to process.
TYPE:
|
return_format
|
The format to return the dataset in. Default is "torch".
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Dataset
|
The processed dataset.
TYPE:
|
collate_fn(batch: list[dict[str, Any]]) -> dict[str, Any]
Collate a batch.
Metadata columns are added after collation.
| PARAMETER | DESCRIPTION |
|---|---|
batch
|
The batch to collate.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict[str, Any]
|
dict[str, Any]: The collated batch with metadata. |
get_expected_columns() -> list[str]
Get the expected columns to be kept in the dataset after processing.
These columns are expected by the collate_fn method and include
both data and metadata columns.
| RETURNS | DESCRIPTION |
|---|---|
list[str]
|
list[str]: The expected columns. |
add_metadata_columns(columns: list[str] | set[str]) -> None
Add expected metadata columns.
| PARAMETER | DESCRIPTION |
|---|---|
columns
|
The columns to add.
TYPE:
|
remove_modifications(peptide: str, replace_isoleucine_with_leucine: bool = True) -> str
staticmethod
Remove modifications and optionally replace Isoleucine with Leucine.
| PARAMETER | DESCRIPTION |
|---|---|
peptide
|
The peptide to remove modifications from.
TYPE:
|
replace_isoleucine_with_leucine
|
Whether to replace Isoleucine with Leucine.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
The peptide with modifications removed.
TYPE:
|
AccelerateDeNovoPredictor(config: DictConfig)
Predictor class that uses the Accelerate library.
s3: S3FileHandler
property
Get the S3 file handler.
| RETURNS | DESCRIPTION |
|---|---|
S3FileHandler
|
The S3 file handler
TYPE:
|
config = config
instance-attribute
targets: list | None = None
instance-attribute
output_path = self.config.get('output_path', None)
instance-attribute
pred_df: pd.DataFrame | None = None
instance-attribute
results_dict: dict | None = None
instance-attribute
prediction_tokenised_col = self.config.get('prediction_tokenised_col', 'predictions_tokenised')
instance-attribute
prediction_col = self.config.get('prediction_col', 'predictions')
instance-attribute
log_probs_col = self.config.get('log_probs_col', 'log_probs')
instance-attribute
token_log_probs_col = self.config.get('token_log_probs_col', 'token_log_probs')
instance-attribute
save_encoder_outputs = config.get('save_encoder_outputs', False)
instance-attribute
encoder_output_path = config.get('encoder_output_path', None)
instance-attribute
encoder_output_reduction = config.get('encoder_output_reduction', 'mean')
instance-attribute
accelerator = self.setup_accelerator()
instance-attribute
denovo = self.config.get('denovo', False)
instance-attribute
model = self.model.eval()
instance-attribute
residue_set = self.model.residue_set
instance-attribute
test_dataset = self.load_dataset()
instance-attribute
test_dataloader = self.build_dataloader(self.test_dataset)
instance-attribute
decoder = self.setup_decoder()
instance-attribute
metrics = self.setup_metrics()
instance-attribute
running_loss = None
instance-attribute
steps_per_inference = len(self.test_dataloader)
instance-attribute
load_model() -> Tuple[nn.Module, DictConfig]
abstractmethod
Load the model.
setup_decoder() -> Decoder
abstractmethod
Setup the decoder.
setup_data_processor() -> DataProcessor
abstractmethod
Setup the data processor.
get_predictions(batch: Any) -> dict[str, Any]
abstractmethod
Get the predictions for a batch.
postprocess_dataset(dataset: Dataset) -> Dataset
Postprocess the dataset.
load_dataset() -> Dataset
Load the test dataset.
| RETURNS | DESCRIPTION |
|---|---|
Dataset
|
The test dataset
TYPE:
|
print_sample_batch() -> None
Print a sample batch of the training data.
setup_metrics() -> Metrics
Setup the metrics.
setup_accelerator() -> Accelerator
Setup the accelerator.
build_dataloader(test_dataset: Dataset) -> torch.utils.data.DataLoader
Setup the dataloaders.
predict() -> pd.DataFrame
Predict the test dataset.
predictions_to_df(predictions: dict[str, list]) -> pd.DataFrame
Convert the predictions to a pandas DataFrame.
| PARAMETER | DESCRIPTION |
|---|---|
predictions
|
The predictions dictionary
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
pd.DataFrame: The predictions dataframe |
postprocess_predictions(pred_df: pd.DataFrame) -> pd.DataFrame
Postprocess the predictions.
Optionally, this can be used to modify the predictions, eg. ensembling. By default, this does nothing.
| PARAMETER | DESCRIPTION |
|---|---|
pred_df
|
The predictions dataframe
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
pd.DataFrame: The postprocessed predictions dataframe |
calculate_metrics(pred_df: pd.DataFrame) -> dict[str, Any] | None
Calculate the metrics.
| PARAMETER | DESCRIPTION |
|---|---|
pred_df
|
The predictions dataframe
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict[str, Any] | None
|
dict[str, Any] | None: The results dictionary containing the metrics |
save_predictions(pred_df: pd.DataFrame, results_dict: dict[str, list] | None = None) -> None
Save the predictions to a file.
| PARAMETER | DESCRIPTION |
|---|---|
pred_df
|
The predictions dataframe
TYPE:
|
results_dict
|
The results dictionary containing the metrics
TYPE:
|
save_encoder_outputs_to_parquet(spectrum_ids: list[str], encoder_outputs: list[np.ndarray]) -> None
Save the encoder outputs to a file.
| PARAMETER | DESCRIPTION |
|---|---|
encoder_outputs
|
The encoder outputs
TYPE:
|
spectrum_ids
|
The spectrum ids
TYPE:
|
CosineWarmupScheduler(optimizer: torch.optim.Optimizer, warmup: int, max_iters: int)
Bases: _LRScheduler
Learning rate scheduler with linear warm up followed by cosine shaped decay.
Parameters
optimizer : torch.optim.Optimizer Optimizer object. warmup : int The number of warm up iterations. max_iters : int The total number of iterations.
get_lr() -> list[float]
Get the learning rate at the current step.
get_lr_factor(epoch: int) -> float
Get the LR factor at the current step.
FinetuneScheduler(model_state_dict: dict, config: DictConfig, steps_per_epoch: int | None = None)
Scheduler for unfreezing parameters of a model.
| PARAMETER | DESCRIPTION |
|---|---|
model_state_dict
|
The state dictionary of the model.
TYPE:
|
config
|
The configuration for the scheduler.
TYPE:
|
steps_per_epoch
|
The number of steps per epoch.
TYPE:
|
model_state_dict = model_state_dict
instance-attribute
config = config
instance-attribute
steps_per_epoch = steps_per_epoch
instance-attribute
is_verbose = self.config.get('verbose', False)
instance-attribute
schedule = self._get_schedule()
instance-attribute
next_phase: dict[str, Any] | None = self.schedule.pop(0)
instance-attribute
step(global_step: int) -> None
Step the unfreezing scheduler.
| PARAMETER | DESCRIPTION |
|---|---|
global_step
|
The global step of the model.
TYPE:
|
WarmupScheduler(optimizer: torch.optim.Optimizer, warmup: int)
Bases: _LRScheduler
Linear warmup scheduler.
warmup = warmup
instance-attribute
get_lr() -> list[float]
Get the learning rate at the current step.
get_lr_factor(epoch: int) -> float
Get the LR factor at the current step.
AccelerateDeNovoTrainer(config: DictConfig)
Trainer class that uses the Accelerate library.
run_id: str
property
Get the run ID.
| RETURNS | DESCRIPTION |
|---|---|
str
|
The run ID
TYPE:
|
s3: S3FileHandler
property
Get the S3 file handler.
| RETURNS | DESCRIPTION |
|---|---|
S3FileHandler
|
The S3 file handler
TYPE:
|
global_step: int
property
Get the current global training step.
This represents the total number of training steps across all epochs.
| RETURNS | DESCRIPTION |
|---|---|
int
|
The current global step number
TYPE:
|
epoch: int
property
Get the current training epoch.
This represents the current epoch number in the training process.
| RETURNS | DESCRIPTION |
|---|---|
int
|
The current epoch number
TYPE:
|
training_state: TrainingState
property
Get the training state.
config = config
instance-attribute
enable_verbose_logging = self.config.get('enable_verbose_logging', True)
instance-attribute
accelerator = self.setup_accelerator()
instance-attribute
residue_set = ResidueSet(residue_masses=(self.config.residues.get('residues')), residue_remapping=(self.config.dataset.get('residue_remapping', None)))
instance-attribute
model = self.setup_model()
instance-attribute
optimizer = self.setup_optimizer()
instance-attribute
lr_scheduler = self.setup_scheduler()
instance-attribute
decoder = self.setup_decoder()
instance-attribute
metrics = self.setup_metrics()
instance-attribute
running_loss = None
instance-attribute
total_steps = self.config.get('training_steps', 2500000)
instance-attribute
finetune_scheduler: FinetuneScheduler | None = FinetuneScheduler(self.model.state_dict(), self.config.get('finetune'))
instance-attribute
steps_per_validation = self.config.get('validation_interval', 100000)
instance-attribute
steps_per_checkpoint = self.config.get('checkpoint_interval', 100000)
instance-attribute
last_validation_metric = None
instance-attribute
best_checkpoint_metric = None
instance-attribute
setup_model() -> nn.Module
abstractmethod
Setup the model.
setup_optimizer() -> torch.optim.Optimizer
abstractmethod
Setup the optimizer.
setup_decoder() -> Decoder
abstractmethod
Setup the decoder.
setup_data_processors() -> tuple[DataProcessor, DataProcessor]
abstractmethod
Setup the data processor.
save_model(is_best_checkpoint: bool = False) -> None
abstractmethod
Save the model.
forward(batch: Any) -> tuple[torch.Tensor, dict[str, torch.Tensor]]
abstractmethod
Forward pass for the model to calculate loss.
get_predictions(batch: Any) -> tuple[list[str] | list[list[str]], list[str] | list[list[str]]]
abstractmethod
Get the predictions for a batch.
convert_interval_to_steps(interval: float | int, steps_per_epoch: int) -> int
staticmethod
Convert an interval to steps.
| PARAMETER | DESCRIPTION |
|---|---|
interval
|
The interval to convert.
TYPE:
|
steps_per_epoch
|
The number of steps per epoch.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
int
|
The number of steps.
TYPE:
|
log_if_verbose(message: str, level: str = 'info') -> None
Log a message if verbose logging is enabled.
setup_metrics() -> Metrics
Setup the metrics.
setup_accelerator() -> Accelerator
Setup the accelerator.
build_dataloaders(train_dataset: Dataset, valid_dataset: Dataset) -> tuple[torch.utils.data.DataLoader, torch.utils.data.DataLoader]
Setup the dataloaders.
setup_scheduler() -> torch.optim.lr_scheduler.LRScheduler
Setup the learning rate scheduler.
| RETURNS | DESCRIPTION |
|---|---|
LRScheduler
|
torch.optim.lr_scheduler.LRScheduler: The learning rate scheduler |
setup_neptune() -> None
Setup the neptune.
setup_tensorboard() -> None
Setup the tensorboard.
load_datasets() -> tuple[Dataset, Dataset, int, int]
Load the training and validation datasets.
| RETURNS | DESCRIPTION |
|---|---|
tuple[Dataset, Dataset, int, int]
|
tuple[SpectrumDataFrame, SpectrumDataFrame]: The training and validation datasets |
print_sample_batch() -> None
Print a sample batch of the training data.
save_accelerator_state(is_best_checkpoint: bool = False) -> None
Save the accelerator state.
check_if_best_checkpoint() -> bool
Check if the last validation metric is the best metric.
load_accelerator_state() -> None
Load the accelerator state.
load_model_state() -> None
Load the model state.
update_model_state(model_state: dict[str, torch.Tensor], model_config: DictConfig) -> dict[str, torch.Tensor]
Update the model state.
update_vocab(model_state: dict[str, torch.Tensor]) -> dict[str, torch.Tensor]
Update the vocabulary of the model.
train() -> None
Train the model.
prepare_batch(batch: Iterable[Any]) -> Any
Prepare a batch for training.
Manually move tensors to accelerator.device since we do not prepare our dataloaders with the accelerator.
| PARAMETER | DESCRIPTION |
|---|---|
batch
|
The batch to prepare.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Any
|
The prepared batch
TYPE:
|
train_epoch() -> None
Train the model for one epoch.
validate_epoch(num_sanity_steps: int | None = None, calculate_metrics: bool = True) -> None
Validate for one epoch.
NeptuneSummaryWriter(log_dir: str, run: neptune.Run)
Bases: SummaryWriter
Combine SummaryWriter with NeptuneWriter.
run = run
instance-attribute
add_scalar(tag: str, scalar_value: float, global_step: int | float | None = None) -> None
Record scalar to tensorboard and Neptune.
add_text(tag: str, text_string: str, global_step: Optional[int] = None, walltime: Optional[float] = None) -> None
Record text to tensorboard and Neptune.
add_hparams(hparam_dict: dict, metric_dict: dict, hparam_domain_discrete: Optional[Dict[str, List[Any]]] = None, run_name: Optional[str] = None, global_step: Optional[int] = None) -> None
Add a set of hyperparameters to be compared in Neptune as for Tensorboard.
Timer(total_steps: int | None = None)
Timer for training and validation.
start_time = time.time()
instance-attribute
total_steps = total_steps
instance-attribute
current_step = 0
instance-attribute
start() -> None
Restart the timer.
step() -> None
Step the timer.
get_delta() -> float
Get the time delta since the timer was started.
get_eta(current_step: int | None = None) -> float
Get the estimated time to completion.
get_total_time() -> float
Get the total time expected to complete all steps.
get_rate(current_step: int | None = None) -> float
Get the rate of steps per second.
get_step_time(current_step: int | None = None) -> float
Get the time per step.
get_time_str() -> str
Get the time delta since the timer was started.
get_eta_str(current_step: int | None = None) -> str
Get the estimated time to completion.
get_total_time_str() -> str
Get the total time expected to complete all steps.
get_rate_str(current_step: int | None = None) -> str
Get the rate of steps per second.
get_step_time_rate_str(current_step: int | None = None) -> str
Get the time per step.
get_step_time_str(current_step: int | None = None) -> str
Get the time per step.
TrainingState()
Training state for tracking training progress.
This class is used by Accelerate to save and load training state during checkpointing and resuming training runs. It tracks the current epoch and global step of training.
Initialize training state with zeroed counters.
global_step: int
property
Get the current global step.
epoch: int
property
Get the current epoch.
state_dict() -> dict[str, Any]
Get the state dictionary for saving.
| RETURNS | DESCRIPTION |
|---|---|
dict[str, Any]
|
dict[str, Any]: Dictionary containing the current training state. |
load_state_dict(state_dict: dict[str, Any]) -> None
Load state from a dictionary.
| PARAMETER | DESCRIPTION |
|---|---|
state_dict
|
Dictionary containing the training state to load.
TYPE:
|
step() -> None
Step the global step.
step_epoch() -> None
Step the epoch.
unstep_epoch() -> None
Unstep the epoch.