HDF5 Reader

This reader expects the data to be in HDF5 format and organized in the following way. The data must be defined as groups by the structure name. The scalar properties will be stored as attributes to the group and the array properties as arrays. Below, we provide an example of how to read the data from such a compliant HDF5 file to demonstrate how the data is organized:

with h5py.File(hdf5_dataset_path, "r") as h5file:
    # Get the identifiers for all structures in the dataset
    struct_names = list(h5file.keys())

    # Just loading the first one for the sake of an example
    structure = h5file[struct_names[0]]
    positions = structure["positions"][:]
    element_numbers = structure["elements"][:]
    forces = structure["forces"][:]
    # Stress could be optional if not needed during training
    if "stress" in structure:
        stress = structure["stress"][:]

    # Energy is a scalar
    energy = structure.attrs["energy"]

See below for the API reference to the associated loader class.

class mlip.data.chemical_systems_readers.hdf5_reader.Hdf5Reader(config: ChemicalSystemsReaderConfig, data_download_fun: Callable[[str | PathLike, str | PathLike], None] | None = None)

Implementation of a chemical systems reader that loads data from hdf5 format.

__init__(config: ChemicalSystemsReaderConfig, data_download_fun: Callable[[str | PathLike, str | PathLike], None] | None = None)

Constructor.

Parameters:
  • config – The configuration defining how and where to load the data from.

  • data_download_fun – A function to download data from an external remote system. If None (default), then this class assumes file paths are local. This function must take two paths as input, source and target, and download the data at source into the target location.

load(postprocess_fun: ~typing.Callable[[list[~mlip.data.chemical_system.ChemicalSystem], list[~mlip.data.chemical_system.ChemicalSystem], list[~mlip.data.chemical_system.ChemicalSystem]], tuple[list[~mlip.data.chemical_system.ChemicalSystem], list[~mlip.data.chemical_system.ChemicalSystem], list[~mlip.data.chemical_system.ChemicalSystem]]] | None = <function filter_systems_with_unseen_atoms_and_assign_atomic_species>) tuple[list[ChemicalSystem], list[ChemicalSystem], list[ChemicalSystem]]

Loads the dataset into its internal format.

Parameters:

postprocess_fun – Function to call to postprocess the loaded dataset before returning it. Accepts train, validation and test systems (list[ChemicalSystems]), runs some postprocessing (filtering for example) and returns the postprocessed train, validation and test systems. If postprocess_fun is None then no postprocessing will be done. By default, it will run assign_atomic_species_and_filter_systems_with_unseen_atoms() which assigns atomic species on ChemicalSystem objects and filters out systems from the validation and test sets that contain chemical elements that are not present in the train systems.

Returns:

A tuple of loaded training, validation and test datasets (in this order). The internal format is a list of ChemicalSystem objects.