.. _tutorial_cli:
Tutorial: CLI tools
===================
After installation and activating the respective Python environment, the command line
tool `mlipaudit` should be available with two tasks:
* `mlipaudit benchmark`: The **benchmarking CLI task**. It runs the full or partial
benchmark suite for one or more models. Results will be stored locally in multiple
JSON files in an intuitive directory structure.
* `mlipaudit gui`: The **UI app** for visualization of the results. Running it opens a
browser window and displays the web app. Implementation is based
on `streamlit `_.
Benchmarking task
-----------------
The benchmarking CLI task is invoked by running
.. code-block:: bash
mlipaudit benchmark [OPTIONS]
and has the following command line options:
* `-h / --help`: Prints info on usage of tool into terminal.
* `-m / --models`: Paths to the
`model zip archives `_
or to Python files with external model definitions (as either
`ASE calculator `_ or
`ForceField `_
objects). If multiple are specified, the tool runs the benchmark suite for all of them
sequentially. The zip archives for the models must follow the convention that
the model name (one of `mace`, `visnet`, `nequip` as of *mlip v0.1.3*) must be
part of the zip file name, such that the app knows which model architecture to load
the model into. For example, `model_mace_123_abc.zip` is allowed. For more information
about providing your own models as ASE calculators or *mlip*-compatible `ForceField`
classes, see the :ref:`ext_model_tutorial` section.
* `-o / --output`: Path to an output directory. The tool will write
the results to this directory. Inside the directory, there will be subdirectories for each model and
then subdirectories for each benchmark. Each benchmark directory will hold a
`result.json` file with the benchmark results.
* `-i / --input`: *Optional* setting for the path to an input data directory.
If it does not exist, each benchmark will download its data
from `HuggingFace `_
automatically. If the data has already been downloaded once, it will not be
re-downloaded. The default is the local directory `./data`.
* `-b / --benchmarks`: *Optional* setting to specify which benchmarks to run. Accepts a
list of benchmark names (e.g., `dihedral_scan`, `ring_planarity`) or `all` to
run every available benchmark. Default: `all`. If the flag is omitted, all benchmarks
run. This is mutually exclusive with `-e`.
* `-e / --exclude`: *Optional* setting to specify which benchmarks to exclude. Works
in an analogous way to `-b` and is mutually exclusive with it.
* `-rm / --run-mode`: *Optional* setting that allows to run faster versions of the
benchmark suite. The default option `standard` which runs the entire suite.
The option `fast` runs a slightly faster version. It runs less test cases for most
benchmarks and it reduces the number of steps for benchmarks requiring long molecular
dynamics simulations. The option `dev` runs a very minimal version of each benchmark
for development and testing purposes. Benchmarks requiring molecular dynamics
simulations are run with minimal steps.
* `-v / --verbose`: *Optional* flag to enable verbose logging
from the `mlip `_ library code.
* `-lt / --log-timing`: *Optional* flag to enable logging of the run time for each
benchmark.
For example, if you want to run the entire benchmark suite for two models, say
`visnet_1` and `mace_2`, use this command:
.. code-block:: bash
mlipaudit benchmark -m /path/to/visnet_1.zip /path/to/mace_2.zip -o /path/to/output
The output directory then contains an intuitive folder structure of models and
benchmarks with the aforementioned `result.json` files. Each of these files will
contain the results for multiple metrics and possibly multiple test systems in
human-readable format. The JSON schema can be understood by investigating the
corresponding :py:class:`BenchmarkResult ` class
that will be referenced at
the :py:meth:`result_class ` attribute
for a given benchmark in the :ref:`api_reference`. For example,
:py:class:`ConformerSelectionResult `
will be the result class for the conformer selection benchmark.
Furthermore, each result will also include a score that reflects the
model's performance on the benchmark on a scale of 0 to 1. For information on what
this score means for a given benchmark, we refer to the :ref:`benchmarks` subsection
of this documentation.
UI app
------
We provide a graphical user interface to visualize the results of the benchmarks located
in the `/path/to/output` (see example above). The app is web-based and can be launched
by running
.. code-block:: bash
mlipaudit gui /path/to/output
in the terminal. This should open a browser window automatically. More information
can be obtained by running `mlipaudit gui -h`.
The landing page of the app will provide you with some basic information about the app
and with a table of all the evaluated models with their overall score.
On the left sidebar, one can then select each specific benchmark to compare the models
on each one individually. If you have not run a given benchmark, the UI page for that
benchmark will display that data is missing.
.. _ext_model_tutorial:
Providing external models
-------------------------
Instead of providing models via `.zip` archives holding models compatible with the
`mlip `_ library, we also support any model
to be provided as long as it is implemented as an
`ASE calculator `_ and has
an attribute `allowed_atomic_numbers` of type `set[int]`. Note that the calculator
must have at least the properties `"energy"` and `"forces"` implemented.
The external model also has the choice of following the `ForceField` API of the
`mlip `_ library instead (for reference, see
the documentation of this class
`here `_).
This is useful if you have implemented your own MLIP architecture compatible with
the *mlip* library, but not natively included in it. If your model is implemented in
JAX, we strongly recommend to interface it in this way,
because this will allow for making use of highly efficient JAX-MD based simulations
and batched inference in the benchmarks executions. However, if your model is
implemented in PyTorch or another framework, providing it as an ASE calculator is your
best option.
For example, let's assume your model is implemented as an ASE calculator in a module
`my_module` as `MyCalculator`. In this case, you can provide the following code
as a model file `my_model.py`:
.. code-block:: python
from my_module import MyCalculator
kwargs = {} # whatever your configuration is
mlipaudit_external_model = MyCalculator(**kwargs)
# Defining that your model can handle H, C, N, and O atoms
setattr(mlipaudit_external_model, "allowed_atomic_numbers", {1, 6, 7, 8})
Note that in this file, the calculator instance must be initialized and assigned
to a variable that is named `mlipaudit_external_model`.
You can now run your benchmarks like this:
.. code-block:: bash
mlipaudit benchmark -m /path/to/my_model.py -o /path/to/output
Note that the model name that will be assigned to the model will be `my_model`.
We emphasize that if the object assigned to the variable `mlipaudit_external_model`
is neither of type ASE calculator, nor of type `ForceField` (from the *mlip* API),
a `ValueError` is raised.
If the provided model implementation is based on PyTorch or another deep learning
framework that comes with its own CUDA dependencies, we strongly recommend to not
install the CUDA-based JAX version in the same environment to avoid dependency
conflicts. However, when running the external models, MLIPAudit will not require any
compute-heavy JAX operations, hence, relying on the CPU version of JAX is not an issue
in this case.
.. note::
MLIPAudit is not optimized for using external models via the ASE calculator
interface. Hence, it is to be expected that benchmarks can take significantly longer
compared to using JAX-based and `mlip`-compatible models loaded via
`.zip` archives.