Benchmark¶
- class mlipaudit.benchmark.Benchmark(force_field: ForceField | Calculator, data_input_dir: str | PathLike = './data', run_mode: RunMode | Literal['dev', 'fast', 'standard'] = RunMode.STANDARD)¶
An Abstract Base Class for structuring MLIP benchmark calculations.
This class uses the Template Method pattern. Each concrete benchmark must implement the
run_modelandanalyzemethods. Benchmarks must be designed to first callrun_modelfollowed byanalyze. Intermediate calculations generated byrun_modelwill be stored in the instance variablemodel_output. Results generated byanalyzewill be stored in the instance variablemodel_output.Subclasses should also define the class attribute
name, giving the benchmark a unique name, as well asinput_data_urlif necessary, specifying where any input data should be downloaded from.- name¶
The unique benchmark name that should be used to run the benchmark from the CLI and that will determine the output folder name for the result file.
- Type:
str
- category¶
A string that describes the category of the benchmark, used for example, in the UI app for grouping. Default, if not overridden, is “General”.
- Type:
str
- result_class¶
A reference to the type of
BenchmarkResultthat will determine the return type ofself.analyze().- Type:
type[mlipaudit.benchmark.BenchmarkResult] | None
- model_output_class¶
A reference to the type of
ModelOutputclass that will be used to store the outcome of theself.run_model()function.- Type:
type[mlipaudit.benchmark.ModelOutput] | None
- required_elements¶
The set of element types that are present in the benchmark’s input files.
- Type:
set[str] | None
- skip_if_elements_missing¶
Whether the benchmark should be skipped entirely if there are some element types that the model cannot handle. If False, the benchmark must have its own custom logic to handle missing element types. Defaults to True.
- Type:
bool
- reusable_output_id¶
An optional ID that references other benchmarks with identical input systems and
ModelOutputsignatures (in form of a tuple). If present, a user or the CLI can make use of this information to reuse cached model outputs from another benchmark carrying the same ID instead of rerunning simulations or inference.- Type:
tuple[str, …] | None
- __init__(force_field: ForceField | Calculator, data_input_dir: str | PathLike = './data', run_mode: RunMode | Literal['dev', 'fast', 'standard'] = RunMode.STANDARD) None¶
Initializes the benchmark.
- Parameters:
force_field – The force field model to be benchmarked.
data_input_dir – The local input data directory. Defaults to “./data”. If the subdirectory “{data_input_dir}/{benchmark_name}” exists, the benchmark expects the relevant data to be in there, otherwise it will download it from HuggingFace.
run_mode – Whether to run the standard benchmark length, a faster version, or a very fast development version. Subclasses should ensure that when
RunMode.DEV, their benchmark runs in a much shorter timeframe, by running on a reduced number of test cases, for instance. ImplementingRunMode.FASTbeing different fromRunMode.STANDARDis optional and only recommended for very long-running benchmarks. This argument can also be passed as a string “dev”, “fast”, or “standard”.
- Raises:
ChemicalElementsMissingError – If initialization is attempted with a force field that cannot perform inference on the required elements.
ValueError – If force field type is not compatible.
- abstractmethod run_model() None¶
Generates any necessary data with
self.force_field.Subclasses must implement this method. Raw data from simulations, single-point energy calculations or other types of calculations will be stored in the instance variable
model_output.
- abstractmethod analyze() BenchmarkResult¶
Performs all post-inference or simulation analysis.
Subclasses must implement this method. This method processes the raw data generated from the generation step to compute final metrics. Subclasses are also responsible for computing the final score for the benchmark.
- Returns:
A class-specific instance of
BenchmarkResult.
- class mlipaudit.benchmark.BenchmarkResult(*, failed: bool = False, score: Annotated[float | None, Ge(ge=0), Le(le=1)] = None)¶
A base model for all benchmark results.
- failed¶
Whether all the simulations or inferences failed and no analysis could be performed. Defaults to False.
- Type:
bool
- score¶
The final score for the benchmark between 0 and 1.
- Type:
float | None
- class mlipaudit.benchmark.ModelOutput¶
A base model for all intermediate model outputs.