Dataset Configs

class mlip.data.configs.ChemicalSystemsReaderConfig(*, train_dataset_paths: str | Path | list[str | Path], valid_dataset_paths: str | Path | list[str | Path] | None = None, test_dataset_paths: str | Path | list[str | Path] | None = None, train_num_to_load: Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Gt(gt=0)])] | None = None, valid_num_to_load: Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Gt(gt=0)])] | None = None, test_num_to_load: Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Gt(gt=0)])] | None = None)

Pydantic-based config related to data preprocessing and loading into `ChemicalSystem`s.

train_dataset_paths

Path(s) to where the training set(s) are located. Cannot be empty. Will be converted to a list after validation.

Type:

str | pathlib.Path | list[str | pathlib.Path]

valid_dataset_paths

Path(s) to where the validation set(s) are located. This can be empty. Will be converted to a list after validation.

Type:

str | pathlib.Path | list[str | pathlib.Path] | None

test_dataset_paths

Path(s) to where the test set(s) are located. This can be empty. Will be converted to a list after validation.

Type:

str | pathlib.Path | list[str | pathlib.Path] | None

train_num_to_load

Number of training set data points to load from the given dataset. By default, this is None which means all the data points are loaded. If multiple dataset paths are given, then this limit will apply to each path separately, not in total.

Type:

int | None

valid_num_to_load

Number of validation set data points to load from the given dataset. By default, this is None which means all the data points are loaded. If multiple dataset paths are given, then this limit will apply to each path separately, not in total.

Type:

int | None

test_num_to_load

Number of test set data points to load from the given dataset. By default, this is None which means all the data points are loaded. If multiple dataset paths are given, then this limit will apply to each path separately, not in total.

Type:

int | None

class mlip.data.configs.GraphDatasetBuilderConfig(*, graph_cutoff_angstrom: Annotated[float, Gt(gt=0)] = 5.0, max_n_node: Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Gt(gt=0)])] | None = None, max_n_edge: Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Gt(gt=0)])] | None = None, batch_size: Annotated[int, Gt(gt=0)] = 16, num_batch_prefetch: Annotated[int, Gt(gt=0)] = 1, batch_prefetch_num_devices: Annotated[int, Gt(gt=0)] = 1, use_formation_energies: bool = False, avg_num_neighbors: float | None = None, avg_r_min_angstrom: float | None = None)

Pydantic-based config related to graph dataset building and preprocessing.

graph_cutoff_angstrom

Graph cutoff distance in Angstrom to apply when creating the graphs. Default is 5.0.

Type:

float

max_n_node

This value will be multiplied with the batch size to determine the maximum number of nodes we allow in a batch. Note that a batch will always contain max_n_node * batch_size nodes, as the remaining ones are filled up with dummy nodes. If set to None, a reasonable value will be automatically computed. Default is None.

Type:

int | None

max_n_edge

This value will be multiplied with the batch size to determine the maximum number of edges we allow in a batch. Note that a batch will always contain max_n_edge * batch_size edges, as the remaining ones are filled up with dummy edges. If set to None, a reasonable value will be automatically computed. Default is None.

Type:

int | None

batch_size

The number of graphs in a batch. Will be filled up with dummy graphs if either the maximum number of nodes or edges are reached before the number of graphs is reached. Default is 16.

Type:

int

num_batch_prefetch

Number of batched graphs to prefetch while iterating over batches. Default is 1.

Type:

int

batch_prefetch_num_devices

Number of threads to use for prefetching. Default is 1.

Type:

int

use_formation_energies

Whether the energies in the dataset should already be transformed to subtract the average atomic energies. Default is False. Make sure that if you set this to True, the models assume "zero" atomic energies as can be set in the model hyperparameters.

Type:

bool

avg_num_neighbors

The pre-computed average number of neighbors.

Type:

float | None

avg_r_min_angstrom

The pre-computed average minimum distance between nodes.

Type:

float | None