Tutorial: CLI tools¶
After installation and activating the respective Python environment, the command line
tool mlipaudit should be available with two tasks:
mlipaudit benchmark: The benchmarking CLI task. It runs the full or partial benchmark suite for one or more models. Results will be stored locally in multiple JSON files in an intuitive directory structure.mlipaudit gui: The UI app for visualization of the results. Running it opens a browser window and displays the web app. Implementation is based on streamlit.
Benchmarking task¶
The benchmarking CLI task is invoked by running
mlipaudit benchmark [OPTIONS]
and has the following command line options:
-h / --help: Prints info on usage of tool into terminal.-m / --models: Paths to the model zip archives or to Python files with external model definitions (as either ASE calculator or ForceField objects). If multiple are specified, the tool runs the benchmark suite for all of them sequentially. The zip archives for the models must follow the convention that the model name (one ofmace,visnet,nequipas of mlip v0.1.3) must be part of the zip file name, such that the app knows which model architecture to load the model into. For example,model_mace_123_abc.zipis allowed. For more information about providing your own models as ASE calculators or mlip-compatibleForceFieldclasses, see the Providing external models section.-o / --output: Path to an output directory. The tool will write the results to this directory. Inside the directory, there will be subdirectories for each model and then subdirectories for each benchmark. Each benchmark directory will hold aresult.jsonfile with the benchmark results.-i / --input: Optional setting for the path to an input data directory. If it does not exist, each benchmark will download its data from HuggingFace automatically. If the data has already been downloaded once, it will not be re-downloaded. The default is the local directory./data.-b / --benchmarks: Optional setting to specify which benchmarks to run. Accepts a list of benchmark names (e.g.,dihedral_scan,ring_planarity) orallto run every available benchmark. Default:all. If the flag is omitted, all benchmarks run. This is mutually exclusive with-e.-e / --exclude: Optional setting to specify which benchmarks to exclude. Works in an analogous way to-band is mutually exclusive with it.-rm / --run-mode: Optional setting that allows to run faster versions of the benchmark suite. The default optionstandardwhich runs the entire suite. The optionfastruns a slightly faster version. It runs less test cases for most benchmarks and it reduces the number of steps for benchmarks requiring long molecular dynamics simulations. The optiondevruns a very minimal version of each benchmark for development and testing purposes. Benchmarks requiring molecular dynamics simulations are run with minimal steps.-v / --verbose: Optional flag to enable verbose logging from the mlip library code.-lt / --log-timing: Optional flag to enable logging of the run time for each benchmark.
For example, if you want to run the entire benchmark suite for two models, say
visnet_1 and mace_2, use this command:
mlipaudit benchmark -m /path/to/visnet_1.zip /path/to/mace_2.zip -o /path/to/output
The output directory then contains an intuitive folder structure of models and
benchmarks with the aforementioned result.json files. Each of these files will
contain the results for multiple metrics and possibly multiple test systems in
human-readable format. The JSON schema can be understood by investigating the
corresponding BenchmarkResult class
that will be referenced at
the result_class attribute
for a given benchmark in the API reference. For example,
ConformerSelectionResult
will be the result class for the conformer selection benchmark.
Furthermore, each result will also include a score that reflects the model’s performance on the benchmark on a scale of 0 to 1. For information on what this score means for a given benchmark, we refer to the Benchmarks subsection of this documentation.
UI app¶
We provide a graphical user interface to visualize the results of the benchmarks located
in the /path/to/output (see example above). The app is web-based and can be launched
by running
mlipaudit gui /path/to/output
in the terminal. This should open a browser window automatically. More information
can be obtained by running mlipaudit gui -h.
The landing page of the app will provide you with some basic information about the app and with a table of all the evaluated models with their overall score.
On the left sidebar, one can then select each specific benchmark to compare the models on each one individually. If you have not run a given benchmark, the UI page for that benchmark will display that data is missing.
Providing external models¶
Instead of providing models via .zip archives holding models compatible with the
mlip library, we also support any model
to be provided as long as it is implemented as an
ASE calculator and has
an attribute allowed_atomic_numbers of type set[int]. Note that the calculator
must have at least the properties "energy" and "forces" implemented.
The external model also has the choice of following the ForceField API of the
mlip library instead (for reference, see
the documentation of this class
here).
This is useful if you have implemented your own MLIP architecture compatible with
the mlip library, but not natively included in it. If your model is implemented in
JAX, we strongly recommend to interface it in this way,
because this will allow for making use of highly efficient JAX-MD based simulations
and batched inference in the benchmarks executions. However, if your model is
implemented in PyTorch or another framework, providing it as an ASE calculator is your
best option.
For example, let’s assume your model is implemented as an ASE calculator in a module
my_module as MyCalculator. In this case, you can provide the following code
as a model file my_model.py:
from my_module import MyCalculator
kwargs = {} # whatever your configuration is
mlipaudit_external_model = MyCalculator(**kwargs)
# Defining that your model can handle H, C, N, and O atoms
setattr(mlipaudit_external_model, "allowed_atomic_numbers", {1, 6, 7, 8})
Note that in this file, the calculator instance must be initialized and assigned
to a variable that is named mlipaudit_external_model.
You can now run your benchmarks like this:
mlipaudit benchmark -m /path/to/my_model.py -o /path/to/output
Note that the model name that will be assigned to the model will be my_model.
We emphasize that if the object assigned to the variable mlipaudit_external_model
is neither of type ASE calculator, nor of type ForceField (from the mlip API),
a ValueError is raised.
If the provided model implementation is based on PyTorch or another deep learning framework that comes with its own CUDA dependencies, we strongly recommend to not install the CUDA-based JAX version in the same environment to avoid dependency conflicts. However, when running the external models, MLIPAudit will not require any compute-heavy JAX operations, hence, relying on the CPU version of JAX is not an issue in this case.
Note
MLIPAudit is not optimized for using external models via the ASE calculator
interface. Hence, it is to be expected that benchmarks can take significantly longer
compared to using JAX-based and mlip-compatible models loaded via
.zip archives.