Dataset Info¶
- class mlip.data.dataset_info.DatasetInfo(*, dataset_name: str | list[str] | None = None, num_graphs: int | None = None, atomic_energies_map: dict[int, float] | list[dict[int, float]], total_charge_set: set[int] | None = None, graph_cutoff_angstrom: float, long_range_cutoff_angstrom: float | None = None, avg_num_neighbors: float = 1.0, avg_r_min_angstrom: float | None = None, scaling_mean: float = 0.0, scaling_stdev: float = 1.0, atomic_energies_removed: bool = False)¶
Information computed from the dataset that is required by the models.
Only the per-dataset-identity fields (
dataset_nameandatomic_energies_map) accept a list form — one entry per dataset — in multi-dataset mode (GraphDatasetBuilderinMULTImode). Every other statistic stays scalar: it is either required to match across datasets (graph_cutoff_angstrom,long_range_cutoff_angstrom), aggregated before being stored (num_graphs,avg_num_neighbors,avg_r_min_angstrom,total_charge_set), or inherited from the pretrained entry on the fine-tuning path.- dataset_name¶
Name of the dataset, or names of the datasets in multi-dataset settings. Defaults to
None.- Type:
str | list[str] | None
- num_graphs¶
Total number of graphs in the dataset used to compute the statistics below (summed across datasets in multi-dataset mode). Defaults to None.
- Type:
int | None
- atomic_energies_map¶
Mapping from atomic number to average atomic energy. When using multiple datasets, this is a list of such mappings.
- Type:
dict[int, float] | list[dict[int, float]]
- graph_cutoff_angstrom¶
Graph cutoff distance in Ångström used to build the neighbor lists. Must match across datasets in multi-dataset mode.
- Type:
float
- total_charge_set¶
Set of total charge values supported by the dataset; union across datasets in multi-dataset mode. Defaults to
None.- Type:
set[int] | None
- long_range_cutoff_angstrom¶
Long range cutoff distance in Ångström used to build the long range neighbor lists. Defaults to
None, meaning no long range graph will be built, preventing any long range interactions computations.- Type:
float | None
- avg_num_neighbors¶
Mean number of neighbors per atom, weighted by
num_graphsacross datasets in multi-dataset mode. Defaults to1.0.- Type:
float
- avg_r_min_angstrom¶
Mean of the per-structure minimum edge distances in Ångström, weighted across datasets.
Nonewhen not computed.- Type:
float | None
- scaling_mean¶
Mean used for energy rescaling. Defaults to
0.0.- Type:
float
- scaling_stdev¶
Standard deviation used for energy rescaling. Defaults to
1.0.- Type:
float
- atomic_energies_removed¶
Whether the atomic energies were subtracted from the dataset(s) by the building process. This information is required by the training loop class to adjust the force field settings accordingly. Default is
False.- Type:
bool
- __init__(**data: Any) None¶
Create a new model by parsing and validating input data from keyword arguments.
Raises [
ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.selfis explicitly positional-only to allowselfas a field name.
- property allowed_atomic_numbers: list[int]¶
List of sorted atomic numbers supported by the dataset.
- mlip.data.dataset_info.compute_dataset_info_from_graphs(graphs: list[Graph], graph_cutoff_angstrom: float, avg_num_neighbors: float | None = None, avg_r_min_angstrom: float | None = None, long_range_cutoff_angstrom: float | None = None) DatasetInfo¶
Computes the dataset info from graphs, typically training set graphs.
- Parameters:
graphs – The graphs.
graph_cutoff_angstrom – The graph distance cutoff in Angstrom to store in the dataset info.
avg_num_neighbors – The optionally pre-computed average number of neighbors. If provided, we skip recomputing this.
avg_r_min_angstrom – The optionally pre-computed average miminum radius. If provided, we skip recomputing this.
long_range_cutoff_angstrom – The long range distance cutoff in Angstrom to store in the dataset info. If None, long range interactions are not computed.
- Returns:
The dataset info object populated with the computed data.