Dataset Info

class mlip.data.dataset_info.DatasetInfo(*, dataset_name: str | list[str] | None = None, num_graphs: int | None = None, atomic_energies_map: dict[int, float] | list[dict[int, float]], total_charge_set: set[int] | None = None, graph_cutoff_angstrom: float, long_range_cutoff_angstrom: float | None = None, avg_num_neighbors: float = 1.0, avg_r_min_angstrom: float | None = None, scaling_mean: float = 0.0, scaling_stdev: float = 1.0, atomic_energies_removed: bool = False)

Information computed from the dataset that is required by the models.

Only the per-dataset-identity fields (dataset_name and atomic_energies_map) accept a list form — one entry per dataset — in multi-dataset mode (GraphDatasetBuilder in MULTI mode). Every other statistic stays scalar: it is either required to match across datasets (graph_cutoff_angstrom, long_range_cutoff_angstrom), aggregated before being stored (num_graphs, avg_num_neighbors, avg_r_min_angstrom, total_charge_set), or inherited from the pretrained entry on the fine-tuning path.

dataset_name

Name of the dataset, or names of the datasets in multi-dataset settings. Defaults to None.

Type:

str | list[str] | None

num_graphs

Total number of graphs in the dataset used to compute the statistics below (summed across datasets in multi-dataset mode). Defaults to None.

Type:

int | None

atomic_energies_map

Mapping from atomic number to average atomic energy. When using multiple datasets, this is a list of such mappings.

Type:

dict[int, float] | list[dict[int, float]]

graph_cutoff_angstrom

Graph cutoff distance in Ångström used to build the neighbor lists. Must match across datasets in multi-dataset mode.

Type:

float

total_charge_set

Set of total charge values supported by the dataset; union across datasets in multi-dataset mode. Defaults to None.

Type:

set[int] | None

long_range_cutoff_angstrom

Long range cutoff distance in Ångström used to build the long range neighbor lists. Defaults to None, meaning no long range graph will be built, preventing any long range interactions computations.

Type:

float | None

avg_num_neighbors

Mean number of neighbors per atom, weighted by num_graphs across datasets in multi-dataset mode. Defaults to 1.0.

Type:

float

avg_r_min_angstrom

Mean of the per-structure minimum edge distances in Ångström, weighted across datasets. None when not computed.

Type:

float | None

scaling_mean

Mean used for energy rescaling. Defaults to 0.0.

Type:

float

scaling_stdev

Standard deviation used for energy rescaling. Defaults to 1.0.

Type:

float

atomic_energies_removed

Whether the atomic energies were subtracted from the dataset(s) by the building process. This information is required by the training loop class to adjust the force field settings accordingly. Default is False.

Type:

bool

__init__(**data: Any) None

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

property allowed_atomic_numbers: list[int]

List of sorted atomic numbers supported by the dataset.

mlip.data.dataset_info.compute_dataset_info_from_graphs(graphs: list[Graph], graph_cutoff_angstrom: float, avg_num_neighbors: float | None = None, avg_r_min_angstrom: float | None = None, long_range_cutoff_angstrom: float | None = None) DatasetInfo

Computes the dataset info from graphs, typically training set graphs.

Parameters:
  • graphs – The graphs.

  • graph_cutoff_angstrom – The graph distance cutoff in Angstrom to store in the dataset info.

  • avg_num_neighbors – The optionally pre-computed average number of neighbors. If provided, we skip recomputing this.

  • avg_r_min_angstrom – The optionally pre-computed average miminum radius. If provided, we skip recomputing this.

  • long_range_cutoff_angstrom – The long range distance cutoff in Angstrom to store in the dataset info. If None, long range interactions are not computed.

Returns:

The dataset info object populated with the computed data.