Benchmark¶

class mlipaudit.benchmark.Benchmark(force_field: ForceField | Calculator, data_input_dir: str | PathLike = './data', run_mode: RunMode | Literal['dev', 'fast', 'standard'] = RunMode.STANDARD)¶

An Abstract Base Class for structuring MLIP benchmark calculations.

This class uses the Template Method pattern. Each concrete benchmark must implement the run_model and analyze methods. Benchmarks must be designed to first call run_model followed by analyze. Intermediate calculations generated by run_model will be stored in the instance variable model_output. Results generated by analyze will be stored in the instance variable model_output.

Subclasses should also define the class attribute name, giving the benchmark a unique name, as well as input_data_url if necessary, specifying where any input data should be downloaded from.

name¶

The unique benchmark name that should be used to run the benchmark from the CLI and that will determine the output folder name for the result file.

Type:: str

category¶

A string that describes the category of the benchmark, used for example, in the UI app for grouping. Default, if not overridden, is “General”.

Type:: str

result_class¶

A reference to the type of BenchmarkResult that will determine the return type of self.analyze().

Type:: type[mlipaudit.benchmark.BenchmarkResult] | None

model_output_class¶

A reference to the type of ModelOutput class that will be used to store the outcome of the self.run_model() function.

Type:: type[mlipaudit.benchmark.ModelOutput] | None

required_elements¶

The set of element types that are present in the benchmark’s input files.

Type:: set[str] | None

skip_if_elements_missing¶

Whether the benchmark should be skipped entirely if there are some element types that the model cannot handle. If False, the benchmark must have its own custom logic to handle missing element types. Defaults to True.

Type:: bool

reusable_output_id¶

An optional ID that references other benchmarks with identical input systems and ModelOutput signatures (in form of a tuple). If present, a user or the CLI can make use of this information to reuse cached model outputs from another benchmark carrying the same ID instead of rerunning simulations or inference.

Type:: tuple[str, …] | None

__init__(force_field: ForceField | Calculator, data_input_dir: str | PathLike = './data', run_mode: RunMode | Literal['dev', 'fast', 'standard'] = RunMode.STANDARD) → None¶

Initializes the benchmark.

Parameters:

force_field – The force field model to be benchmarked.
data_input_dir – The local input data directory. Defaults to “./data”. If the subdirectory “{data_input_dir}/{benchmark_name}” exists, the benchmark expects the relevant data to be in there, otherwise it will download it from HuggingFace.
run_mode – Whether to run the standard benchmark length, a faster version, or a very fast development version. Subclasses should ensure that when RunMode.DEV, their benchmark runs in a much shorter timeframe, by running on a reduced number of test cases, for instance. Implementing RunMode.FAST being different from RunMode.STANDARD is optional and only recommended for very long-running benchmarks. This argument can also be passed as a string “dev”, “fast”, or “standard”.

Raises:

ChemicalElementsMissingError – If initialization is attempted with a force field that cannot perform inference on the required elements.
ValueError – If force field type is not compatible.

abstractmethod run_model() → None¶

Generates any necessary data with self.force_field.

Subclasses must implement this method. Raw data from simulations, single-point energy calculations or other types of calculations will be stored in the instance variable model_output.

abstractmethod analyze() → BenchmarkResult¶

Performs all post-inference or simulation analysis.

Subclasses must implement this method. This method processes the raw data generated from the generation step to compute final metrics. Subclasses are also responsible for computing the final score for the benchmark.

Returns:: A class-specific instance of BenchmarkResult.

class mlipaudit.benchmark.BenchmarkResult(*, failed: bool = False, score: Annotated[float | None, Ge(ge=0), Le(le=1)] = None)¶

A base model for all benchmark results.

failed¶

Whether all the simulations or inferences failed and no analysis could be performed. Defaults to False.

Type:: bool

score¶

The final score for the benchmark between 0 and 1.

Type:: float | None

class mlipaudit.benchmark.ModelOutput¶: A base model for all intermediate model outputs.