Dataset Configs¶
- class mlip.data.configs.GraphDatasetBuilderConfig(*, graph_cutoff_angstrom: Annotated[float, Gt(gt=0)] = 5.0, long_range_cutoff_angstrom: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Gt(gt=0)])] | None = None, max_n_node: Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Gt(gt=0)])] | None = None, max_n_edge: Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Gt(gt=0)])] | None = None, max_n_edge_long_range: Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Gt(gt=0)])] | None = None, batch_size: Annotated[int, Gt(gt=0)] = 16, num_batch_prefetch_host: Annotated[int, Gt(gt=0)] = 1, num_batch_prefetch_device: Annotated[int, Gt(gt=0)] = 1, use_formation_energies: bool = False, avg_num_neighbors: float | None = None, avg_r_min_angstrom: float | None = None, remove_systems_without_partial_charges: bool = False, allowed_atomic_numbers: list[int] | None = None, excluded_atomic_numbers: list[int] | None = None, allowed_charges: list[int] | None = None, excluded_charges: list[int] | None = None, ensure_no_unseen_total_charges: bool = False, set_none_charges_to_zero: bool = False, homogenize: bool = False)¶
Pydantic-based config used by both
SingleGraphDatasetBuilderandGraphDatasetBuilderto ensure the same graph construction and batching parameters are applied consistently across all datasets.- graph_cutoff_angstrom¶
Graph cutoff distance in Angstrom to apply when creating the graphs. Default is 5.0.
- Type:
float
- long_range_cutoff_angstrom¶
Long range cutoff distance in Ångström used to build the long range neighbor lists. Defaults to
None, meaning no long range graph will be built.- Type:
float | None
- max_n_node¶
This value will be multiplied with the batch size to determine the maximum number of nodes we allow in a batch. Note that a batch will always contain max_n_node * batch_size nodes, as the remaining ones are filled up with dummy nodes. If set to
None, a reasonable value will be automatically computed. Default isNone.- Type:
int | None
- max_n_edge¶
This value will be multiplied with the batch size to determine the maximum number of edges we allow in a batch. Note that a batch will always contain max_n_edge * batch_size edges, as the remaining ones are filled up with dummy edges. If set to
None, a reasonable value will be automatically computed. Default isNone.- Type:
int | None
- max_n_edge_long_range¶
This value will be multiplied with the batch size to determine the maximum number of long range edges we allow in a batch. Note that a batch will always contain max_n_edge_long_range * batch_size long range edges, as the remaining ones are filled up with dummy long range edges. If set to
None, a reasonable value will be automatically computed. Default isNone.- Type:
int | None
- batch_size¶
The number of graphs in a batch. Will be filled up with dummy graphs if either the maximum number of nodes or edges are reached before the number of graphs is reached. Default is 16.
- Type:
int
- num_batch_prefetch_host¶
Sets the depth of the inner (host) prefetch queue. Default is 1.
- Type:
int
- num_batch_prefetch_device¶
sets the depth of the outer (device) prefetch queue: how many already-sharded global batches are kept queued on devices. Default is 1.
- Type:
int
- use_formation_energies¶
Whether the energies in the dataset should already be transformed to subtract the average atomic energies. Default is
False.- Type:
bool
- avg_num_neighbors¶
The pre-computed average number of neighbors.
- Type:
float | None
- avg_r_min_angstrom¶
The pre-computed average minimum distance between nodes.
- Type:
float | None
- remove_systems_without_partial_charges¶
Whether to remove systems without partial charges from the dataset. Default is
False.- Type:
bool
- allowed_atomic_numbers¶
List of allowed atomic numbers to filter the dataset by during preprocessing, will remove all systems with elements not in the list. Default is
None(no filter).- Type:
list[int] | None
- excluded_atomic_numbers¶
List of excluded atomic numbers to filter the dataset by during preprocessing, will remove all systems with elements in the list. Default is
None(no filter).- Type:
list[int] | None
- allowed_charges¶
List of allowed total charges to filter the dataset by during preprocessing, will remove all systems with charges not in the list. Default is
None(no filter).- Type:
list[int] | None
- excluded_charges¶
List of excluded total charges to filter the dataset by during preprocessing, will remove all systems with charges in the list. Default is
None(no filter).- Type:
list[int] | None
- ensure_no_unseen_total_charges¶
Whether to ensure that no unseen total charges are present in the dataset based on the allowed charges provided by the
dataset_info. Default isFalse.- Type:
bool
- set_none_charges_to_zero¶
Whether to set None total charges to zero during preprocessing. Default is
False.- Type:
bool
- homogenize¶
If
True, the resultingGraphDatasetwill pad any missingPrediction-targeted optional fields (e.g.stress,forces) with NaN so graphs from heterogeneous datasets share the same pytree structure and can be batched. IfFalse, the dataset will instead validate that the provided graphs are already batch-compatible and raise a clear error otherwise. Multi-dataset merging typically requiresTruebecause different subsets may not share the same optional-field presence. Defaults toFalse.- Type:
bool