Dataset Info

class mlip.data.dataset_info.DatasetInfo(*, atomic_energies_map: dict[int, float], cutoff_distance_angstrom: float, avg_num_neighbors: float = 1.0, avg_r_min_angstrom: float | None = None, scaling_mean: float = 0.0, scaling_stdev: float = 1.0)

Pydantic dataclass holding information computed from the dataset that is (potentially) required by the models.

atomic_energies_map

A dictionary mapping the atomic numbers to the computed average atomic energies for that element.

Type:

dict[int, float]

cutoff_distance_angstrom

The graph cutoff distance that was used in the dataset in Angstrom.

Type:

float

avg_num_neighbors

The mean number of neighbors an atom has in the dataset.

Type:

float

avg_r_min_angstrom

The mean minimum edge distance for a structure in the dataset.

Type:

float | None

scaling_mean

The mean used for the rescaling of the dataset values, the default being 0.0.

Type:

float

scaling_stdev

The standard deviation used for the rescaling of the dataset values, the default being 1.0.

Type:

float

__init__(**data: Any) None

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

mlip.data.dataset_info.compute_dataset_info_from_graphs(graphs: list[GraphsTuple], cutoff_distance_angstrom: float, z_table: AtomicNumberTable, avg_num_neighbors: float | None = None, avg_r_min_angstrom: float | None = None) DatasetInfo

Computes the dataset info from graphs, typically training set graphs.

Parameters:
  • graphs – The graphs.

  • cutoff_distance_angstrom – The graph distance cutoff in Angstrom to store in the dataset info.

  • z_table – The atomic numbers table needed to produce the correct atomic energies map keys.

  • avg_num_neighbors – The optionally pre-computed average number of neighbors. If provided, we skip recomputing this.

  • avg_r_min_angstrom – The optionally pre-computed average miminum radius. If provided, we skip recomputing this.

Returns:

The dataset info object populated with the computed data.