Graph Dataset Builder

class mlip.data.graph_dataset_builder.BuilderMode(*values)

Available modes for GraphDatasetBuilder.

CUSTOM

Expects a flat readers dict keyed by arbitrary split names, e.g. {"split_a": reader, "split_b": reader}. Computes dataset info independently for each split, unless a preset dataset info is provided.

TRAINING

Expects a flat readers dict that must contain a "train" key, e.g. {"train": reader, "val": reader, "test": reader}. Computes dataset info only from the "train" split.

MULTI

Expects a nested readers dict keyed first by dataset name, then by split name, e.g. {"oc20": {"train": reader, "val": reader}, "omat": {"train": reader}}. Computes dataset info from each dataset’s "train" split and combines them. If a "replay" dataset exists, it reuses a preset dataset info for that dataset.

class mlip.data.graph_dataset_builder.GraphDatasetBuilder(readers: dict[str, ChemicalSystemsReader | list[ChemicalSystemsReader]] | dict[str, dict[str, ChemicalSystemsReader | list[ChemicalSystemsReader]]], builder_config: GraphDatasetBuilderConfig, mode: BuilderMode | str = BuilderMode.TRAINING, dataset_info: DatasetInfo | None = None)

Orchestrates building multiple dataset splits using SingleGraphDatasetBuilder.

Supports three modes (see BuilderMode): CUSTOM, TRAINING, and MULTI, each differing in how dataset info is computed and how readers are organized.

__init__(readers: dict[str, ChemicalSystemsReader | list[ChemicalSystemsReader]] | dict[str, dict[str, ChemicalSystemsReader | list[ChemicalSystemsReader]]], builder_config: GraphDatasetBuilderConfig, mode: BuilderMode | str = BuilderMode.TRAINING, dataset_info: DatasetInfo | None = None)

Constructor.

Parameters:
  • readers – A flat or nested dictionary of readers keyed by split name (e.g. {"train": readers, "valid": readers}) for CUSTOM/TRAINING modes, or by dataset name then split name for MULTI mode.

  • builder_config – Configuration for graph construction, including batch size, cutoff distance, and batch dimension limits.

  • mode – The build mode. Defaults to BuilderMode.TRAINING.

  • dataset_info – An optional preset DatasetInfo. Required for MULTI mode when a "replay" dataset is present.

get_datasets(prefetch: bool = False, mesh: Mesh | None = None, systems_preprocessing: list[Callable[[list[ChemicalSystem]], list[ChemicalSystem]]] | None = None, graph_postprocessing: list[Callable[[Graph], Graph]] | None = None) dict[str, GraphDataset | PrefetchIterator]

Build all dataset splits according to the configured mode.

Parameters:
  • prefetch – Whether to wrap each dataset in a PrefetchIterator. Default is False.

  • mesh – Device mesh for data parallelism. If None and prefetch is True, a default mesh is created.

  • systems_preprocessing – Optional list of functions applied sequentially to the loaded chemical systems.

  • graph_postprocessing – Optional list of batch-level post-processing functions passed to GraphDataset.

Returns:

A dictionary mapping split names to datasets or prefetch iterators. Also sets self.dataset_info.

property dataset_info: DatasetInfo | dict[str, DatasetInfo] | None

The computed or preset DatasetInfo, or None if dataset info computation was disabled.

In CUSTOM mode with a preset DatasetInfo, the post-build value equals the preset, so it is returned immediately without requiring get_datasets() to have been called. Other modes derive the value from the graphs and still require a prior build.

Raises:

DatasetsHaveNotBeenProcessedError – If dataset info is not yet available because get_datasets() has not been called.