Graph Dataset Builder¶
- class mlip.data.graph_dataset_builder.BuilderMode(*values)¶
Available modes for
GraphDatasetBuilder.- CUSTOM¶
Expects a flat readers dict keyed by arbitrary split names, e.g.
{"split_a": reader, "split_b": reader}. Computes dataset info independently for each split, unless a preset dataset info is provided.
- TRAINING¶
Expects a flat readers dict that must contain a
"train"key, e.g.{"train": reader, "val": reader, "test": reader}. Computes dataset info only from the"train"split.
- MULTI¶
Expects a nested readers dict keyed first by dataset name, then by split name, e.g.
{"oc20": {"train": reader, "val": reader}, "omat": {"train": reader}}. Computes dataset info from each dataset’s"train"split and combines them. If a"replay"dataset exists, it reuses a preset dataset info for that dataset.
- class mlip.data.graph_dataset_builder.GraphDatasetBuilder(readers: dict[str, ChemicalSystemsReader | list[ChemicalSystemsReader]] | dict[str, dict[str, ChemicalSystemsReader | list[ChemicalSystemsReader]]], builder_config: GraphDatasetBuilderConfig, mode: BuilderMode | str = BuilderMode.TRAINING, dataset_info: DatasetInfo | None = None)¶
Orchestrates building multiple dataset splits using
SingleGraphDatasetBuilder.Supports three modes (see
BuilderMode):CUSTOM,TRAINING, andMULTI, each differing in how dataset info is computed and how readers are organized.- __init__(readers: dict[str, ChemicalSystemsReader | list[ChemicalSystemsReader]] | dict[str, dict[str, ChemicalSystemsReader | list[ChemicalSystemsReader]]], builder_config: GraphDatasetBuilderConfig, mode: BuilderMode | str = BuilderMode.TRAINING, dataset_info: DatasetInfo | None = None)¶
Constructor.
- Parameters:
readers – A flat or nested dictionary of readers keyed by split name (e.g.
{"train": readers, "valid": readers}) forCUSTOM/TRAININGmodes, or by dataset name then split name forMULTImode.builder_config – Configuration for graph construction, including batch size, cutoff distance, and batch dimension limits.
mode – The build mode. Defaults to
BuilderMode.TRAINING.dataset_info – An optional preset
DatasetInfo. Required forMULTImode when a"replay"dataset is present.
- get_datasets(prefetch: bool = False, mesh: Mesh | None = None, systems_preprocessing: list[Callable[[list[ChemicalSystem]], list[ChemicalSystem]]] | None = None, graph_postprocessing: list[Callable[[Graph], Graph]] | None = None) dict[str, GraphDataset | PrefetchIterator]¶
Build all dataset splits according to the configured mode.
- Parameters:
prefetch – Whether to wrap each dataset in a
PrefetchIterator. Default isFalse.mesh – Device mesh for data parallelism. If
NoneandprefetchisTrue, a default mesh is created.systems_preprocessing – Optional list of functions applied sequentially to the loaded chemical systems.
graph_postprocessing – Optional list of batch-level post-processing functions passed to
GraphDataset.
- Returns:
A dictionary mapping split names to datasets or prefetch iterators. Also sets
self.dataset_info.
- property dataset_info: DatasetInfo | dict[str, DatasetInfo] | None¶
The computed or preset
DatasetInfo, orNoneif dataset info computation was disabled.In
CUSTOMmode with a presetDatasetInfo, the post-build value equals the preset, so it is returned immediately without requiringget_datasets()to have been called. Other modes derive the value from the graphs and still require a prior build.- Raises:
DatasetsHaveNotBeenProcessedError – If dataset info is not yet available because
get_datasets()has not been called.