Tutorial: Adding a new benchmark¶

Code pattern for our benchmarks¶

As can be seen in our main benchmarking script src/mlipaudit/main.py, the basic pattern of running our benchmarks in code is the following:

from mlipaudit.benchmarks import TautomersBenchmark
from mlipaudit.io import write_benchmark_result_to_disk
from mlip.models import Mace
from mlip.models.model_io import load_model_from_zip

force_field = load_model_from_zip(Mace, "./mace.zip")

benchmark = TautomersBenchmark(force_field)
benchmark.run_model()
result = benchmark.analyze()

write_benchmark_result_to_disk(
    TautomersBenchmark.name, result, "./results/mace"
)

After initializing a benchmark class, in this example TautomersBenchmark, we call run_model() to execute all inference calls and simulations required with the MLIP force field model. The raw output of this is stored inside the class. Next, we run analyze() to produce the final benchmarking results. This function returns the results class which is always a derived class of BenchmarkResult, in this example TautomersResult. The function write_benchmark_result_to_disk then writes these results to disk in JSON format.

How to implement a new benchmark¶

Overview¶

A new benchmark class can easily be implemented as a derived class of the abstract base class Benchmark. The attributes and members to override are:

name: A unique name for the benchmark.
category: A string that represents the category of the benchmark. If not overridden, “General” is used. Currently, used exclusively for visualization in the GUI.
result_class: A reference to the results class of the benchmark. More details below.
model_output_class: A reference to the model output class of the benchmark. More details below.
required_elements: A set of element symbols that are required by a model to run this benchmark.
skip_if_elements_missing: Boolean that has a default of True and hence does not need to be overridden. However, if you want your benchmark to still run even if a model is missing some required elements, then this should be overridden to be False. A reason for this would be that parts of the benchmark can still be run in this case and the missing elements will be handled on a case-by-case basis inside the benchmark’s run function.
run_model: This method implements running all inference calls and simulations related to the benchmark. This method can take a significant time to execute. As part of this, the raw output of the model should be stored in a model output class that needs to be implemented and must be derived from the base class ModelOutput, which is a pydantic model (works similar to dataclasses but with type validation and serialization built in). The model output of this type is then assigned to an instance attribute self.model_output.
analyze: This method implements the analysis of the raw results and returns the benchmark results. This works similarly to the model output, where the results are a derived class of BenchmarkResult (also a pydantic model).

Hence, to add a new benchmark, three classes must be implemented, the benchmark, model output, and results class.

Note that we also recommend that a new benchmarks implements a very minimal version of itself that is run if self.run_mode == RunMode.DEV. For very long-running benchmarks, we also recommend to implement a version for self.run_mode == RunMode.FAST that may differ from self.run_mode == RunMode.STANDARD, however, for most benchmarks this may not be necessary.

Minimal example implementation¶

Here is an example of a very minimal new benchmark implementation:

import functools
from mlipaudit.benchmark import Benchmark, BenchmarkResult, ModelOutput

class NewResult(BenchmarkResult):
    errors: list[float]

class NewModelOutput(ModelOutput):
    energies: list[float]

class NewBenchmark(Benchmark):
    name = "new_benchmark"
    category = "New category"
    result_class = NewResult
    model_output_class = NewModelOutput
    required_elements = {"H", "N", "O", "C"}

    def run_model(self) -> None:
        energies = _compute_energies_blackbox(self.force_field, self._data)
        self.model_output = NewModelOutput(energies=energies)

    def analyze(self) -> NewResult:
        score, errors = _analyze_blackbox(self.model_output, self._data)
        return NewResult(score=score, errors=errors)

    @functools.cached_property
    def _data(self) -> dict:
        data_path = self.data_input_dir / self.name / "new_benchmark_data.json"
        return _load_data_blackbox(data_path)

The data loading as a cached property is only recommended if the loaded data is needed in both the run_model() and the analyze() functions.

Note that the functions _compute_energies_blackbox and _analyze_blackbox are placeholders for the actual implementations.

Another class attribute that can be specified optionally is reusable_output_id, which is None by default. It can be used to signal that two benchmarks use the exact same run_model() method and the exact same signature for the model output class. This ID should be of type tuple with the names of the benchmarks in it, see the benchmarks Sampling and FoldingStability for an example of this. See the source code of the main benchmarking script for how it reuses the model output of one for the other benchmark without rerunning any simulation or inference.

Furthermore, you need to add an import for your benchmark to the src/mlipaudit/benchmarks/__init__.py file such that the benchmark can be automatically picked up by the CLI tool.

Data¶

The benchmark base class downloads the input data for a benchmark from HuggingFace automatically if it does not yet exist locally. As you can see in the minimal example above, the benchmark expects the data to be in the directory self.data_input_dir / self.name. Therefore, if you place your data in this directory before initializing the benchmark, it will not try to download anything from HuggingFace. This mechanism allows the data to be provided in custom ways.

UI page¶

To create a new benchmark UI page, we refer to the existing implementations located in src/mlipaudit/ui for how to add a new one. The basic idea is that a page is represented by a function like this:

def new_benchmark_page(
    data_func: Callable[[], dict[str, NewResult]],
) -> None:
    data = data_func()  # data is a dictionary of model names and results

    # add rest of UI page implementation here
    pass

The implementation must be a valid streamlit page.

In order for this page to be automatically included in the UI app, you need to wrap this new benchmark page in a derived class of UIPageWrapper like this,

class NewBenchmarkPageWrapper(UIPageWrapper):

    @classmethod
    def get_page_func(cls):
        return new_benchmark_page

    @classmethod
    def get_benchmark_class(cls):
        return NewBenchmark

and then make sure to add the import of your new benchmark page to the src/mlipaudit/ui/__init__.py file. This will result in your benchmark’s UI page being automatically picked up and displayed.

How to run the new benchmark¶

Note that as you need to modify some existing source code files of mlipaudit to include your new benchmarks, this cannot be achieved purely with the pip installed library, however, we recommend to clone or fork our repository and run this local version instead after adding your own benchmarks with minimal code changes, as explained above.