Graph Dataset Builder¶
- class mlip.data.graph_dataset_builder.GraphDatasetBuilder(reader: ChemicalSystemsReader | CombinedReader, dataset_config: GraphDatasetBuilderConfig, dataset_info: DatasetInfo | None = None)¶
Main class handling the construction and preprocessing of the graph dataset.
The key idea is that a user provides a
ChemicalSystemsReadersubclass that loads a dataset from disk intoChemicalSystemdataclasses and thenGraphDatasetBuilderconverts these further tojraphgraphs and the dataset info dataclass.- __init__(reader: ChemicalSystemsReader | CombinedReader, dataset_config: GraphDatasetBuilderConfig, dataset_info: DatasetInfo | None = None)¶
Constructor.
- Parameters:
reader – The data reader that loads a dataset into
ChemicalSystemdataclassesdataset_config – The pydantic config.
dataset_info – Leave
Noneto create initial training datasets. Otherwise, pass the.dataset_infoof a trainedForceFieldfor downstream tasks like batched inference, finetuning, or distillation.
Note
Evaluating a trained model on new data can lead to inconsistent results if:
a wrong
AtomicNumberTableis used to map atomic numbers to specie indices,a different
graph_cutoff_angstromis used to generate the graphs (to a lesser extent).
Therefore, it is important to pass the
dataset_infoof a trained model to prepare a dataset of batched graphs in downstream tasks.
- prepare_datasets() None¶
Prepares the datasets.
This includes loading it into ChemicalSystem objects via the chemical systems reader, and then producing the graph datasets and the dataset info object.
- get_splits(prefetch: bool = False, devices: list[Device] | None = None) tuple[GraphDataset, GraphDataset, GraphDataset] | tuple[PrefetchIterator, PrefetchIterator, PrefetchIterator]¶
Returns the training, validation, and test dataset splits.
- Parameters:
prefetch – Whether to run the data prefetching and return PrefetchIterators.
devices – Devices for parallel prefetching. Must be given if prefetch=True.
- Returns:
A tuple of training, validation, and test datasets. If prefetch=False, these are of type GraphDataset, otherwise of type PrefetchIterator.
- property dataset_info: DatasetInfo¶
Getter for the dataset info.
Will raise exception if dataset info not available yet.