Data split¶

mlip.data.helpers.data_split.split_data_randomly(data: list[Any], proportions: DataSplitProportions, seed: int) → tuple[list[Any], list[Any], list[Any]]¶

Splits the data randomly.

Parameters:

data – The data, which must be a list of any object.
proportions – The dataset proportions. These must sum to one and none of these can be larger than one. The train proportion must also be greater than zero.
seed – The random seed for the split.

Returns:

The split data, which are three lists of the objects, referring to training set, validation set, and test set. The latter two can be empty if the given proportions were zero.

mlip.data.helpers.data_split.split_data_randomly_by_group(data: list[Any], proportions: DataSplitProportions, seed: int, get_group_id_fun: Callable[[Any], str], placeholder_group_id: str) → tuple[list[Any], list[Any], list[Any]]¶

Splits the data randomly, but by respecting some groups.

This means that data points that belong to the same group must end up in the same split. The grouping mechanism can be provided via the get_group_id_fun parameter.

Parameters:

data – The data, which must be a list of any object.
proportions – The dataset proportions. These must sum to one and none of these can be larger than one. The train proportion must also be greater than zero.
seed – The random seed for the split.
get_group_id_fun – This function takes in one of the objects (data points) and returns a string representation of its group.
placeholder_group_id – This group is for any data that does not belong to a predefined group. The data belonging to this group will be assigned to the training set.

Returns:

The split data, which are three lists of the objects, referring to training set, validation set, and test set. The latter two can be empty if the given proportions were zero. Note that the proportions may not be exactly as requested in the input, but close.

mlip.data.helpers.data_split.split_data_by_group(data: list[Any], group_ids_by_split: tuple[set[str], set[str], set[str]], get_group_id_fun: Callable[[Any], str]) → tuple[list[Any], list[Any], list[Any]]¶

Splits the data into groups (train, val, test) with the get_group_id_fun based on the group_ids_by_split.

If there’s a data point with a group_id that doesn’t belong to the split (train, val, test), an exception will be raised.

Parameters:

data – The data, which must be a list of any object.
group_ids_by_split – Tuple of sets of group IDs by the split (train, val, test).
get_group_id_fun – This function takes in one of the objects (data points) and returns a string representation of its group.

Returns:

The split data, which are three lists of the objects, referring to training set, validation set, and test set. The latter two can be empty if the given split was empty.