Single Graph Dataset Builder

class mlip.data.single_graph_dataset_builder.SingleGraphDatasetBuilder(readers: ChemicalSystemsReader | list[ChemicalSystemsReader], builder_config: GraphDatasetBuilderConfig, dataset_info: DatasetInfo | bool, shuffle: bool = False)

Builds a single GraphDataset from one or more readers.

Handles loading chemical systems, converting them to graphs, auto-filling batch dimensions, and optionally computing DatasetInfo.

__init__(readers: ChemicalSystemsReader | list[ChemicalSystemsReader], builder_config: GraphDatasetBuilderConfig, dataset_info: DatasetInfo | bool, shuffle: bool = False)

Constructor.

Parameters:
  • readers – The data reader(s) that load a dataset into ChemicalSystem dataclasses.

  • builder_config – The pydantic config.

  • dataset_info – Pass True to compute dataset info from the graphs. Pass False to skip dataset info computation. Pass a DatasetInfo instance to use a pre-computed one (e.g. from a trained model).

get_dataset(prefetch: bool = False, mesh: Mesh | None = None, systems_preprocessing: list[Callable[[list[ChemicalSystem]], list[ChemicalSystem]]] | None = None, graph_postprocessing: list[Callable[[Graph], Graph]] | None = None) GraphDataset | PrefetchIterator

Build and return the dataset.

Loads systems, converts to graphs, builds a GraphDataset, optionally computes DatasetInfo, and wraps in a prefetch iterator if requested.

Parameters:
  • prefetch – Whether to wrap the dataset in a PrefetchIterator. By default, this is set to False.

  • mesh – Device mesh for data parallelism. If None and prefetch is True, a default mesh is created.

  • systems_preprocessing – Optional list of functions applied sequentially to the loaded chemical systems.

  • graph_postprocessing – Optional list of batch-level post-processing functions passed to GraphDataset.

Returns:

A GraphDataset or PrefetchIterator.

property dataset_info: DatasetInfo | None

The computed or preset DatasetInfo, or None if dataset info computation was disabled.

Returns immediately when dataset info was disabled (False) or was provided as a preset — in those cases get_dataset() does not need to have been called.

Raises:

DatasetsHaveNotBeenProcessedError – If dataset info computation was requested (dataset_info=True) but get_dataset() has not been called yet.