Dataset Info¶

class mlip.data.dataset_info.DatasetInfo(*, atomic_energies_map: dict[int, float], cutoff_distance_angstrom: float, avg_num_neighbors: float = 1.0, avg_r_min_angstrom: float | None = None, scaling_mean: float = 0.0, scaling_stdev: float = 1.0)¶

Pydantic dataclass holding information computed from the dataset that is (potentially) required by the models.

atomic_energies_map¶

A dictionary mapping the atomic numbers to the computed average atomic energies for that element.

Type:: dict[int, float]

cutoff_distance_angstrom¶

The graph cutoff distance that was used in the dataset in Angstrom.

Type:: float

avg_num_neighbors¶

The mean number of neighbors an atom has in the dataset.

Type:: float

avg_r_min_angstrom¶

The mean minimum edge distance for a structure in the dataset.

Type:: float | None

scaling_mean¶

The mean used for the rescaling of the dataset values, the default being 0.0.

Type:: float

scaling_stdev¶

The standard deviation used for the rescaling of the dataset values, the default being 1.0.

Type:: float

__init__(**data: Any) → None¶

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

mlip.data.dataset_info.compute_dataset_info_from_graphs(graphs: list[GraphsTuple], cutoff_distance_angstrom: float, z_table: AtomicNumberTable, avg_num_neighbors: float | None = None, avg_r_min_angstrom: float | None = None) → DatasetInfo¶

Computes the dataset info from graphs, typically training set graphs.

Parameters:

graphs – The graphs.
cutoff_distance_angstrom – The graph distance cutoff in Angstrom to store in the dataset info.
z_table – The atomic numbers table needed to produce the correct atomic energies map keys.
avg_num_neighbors – The optionally pre-computed average number of neighbors. If provided, we skip recomputing this.
avg_r_min_angstrom – The optionally pre-computed average miminum radius. If provided, we skip recomputing this.

Returns:

The dataset info object populated with the computed data.