Model Scores¶

To enable consistent and fair comparison across models, we define a composite score that aggregates performance over all compatible benchmarks. Each benchmark \(b \in \mathcal{B}\) may report one or more metrics \(x_{m,b}^{(i)}\), where \(i = 1, \ldots, N_b\) indexes the \(N_b\) metrics evaluated for the model \(m\). For each metric, we compute a normalized score using a soft thresholding function based on a DFT-derived reference tolerance \(t_b^{(i)}\) (see Table 1 below):

\[\begin{split}s_{m,b}^{(i)} = \begin{cases} 1, & \text{if } x_{m,b}^{(i)} \leq t_b^{(i)} \\ \exp\left(-\alpha \cdot \frac{x_{m,b}^{(i)} - t_b^{(i)}}{t_b^{(i)}}\right), & \text{otherwise} \end{cases}\end{split}\]

where \(\alpha\) is a tunable parameter controlling the steepness of the penalty (e.g., \(\alpha = 3\)). The per-benchmark score is then computed as the average over all its metric scores:

\[s_{m,b} = \frac{1}{N_b} \sum_{i=1}^{N_b} s_{m,b}^{(i)}\]

Let \(\mathcal{B}_m \subseteq \mathcal{B}\) denote the subset of benchmarks for which the model \(m\) has valid data (i.e., benchmarks compatible with its element set). The final model score is the mean over all benchmarks on which the model could be evaluated:

\[S_m = \frac{1}{|\mathcal{B}_m|} \sum_{b \in \mathcal{B}_m} s_{m,b}\]

This scoring framework ensures that models are rewarded for meeting or exceeding DFT-level accuracy. In the current version, full benchmarks are skipped if a model does not have all the necessary chemical elements to run all the test cases. This is true for all benchmarks, but non-covalent interactions, in which we do a per-test-case exception. When a benchmark is not run, \(s_{m,b} = 0\) is assigned. Benchmarks with multiple metrics contribute proportionally, and the result is a single interpretable score \(S_m \in [0,1]\) that balances physical fidelity, chemical coverage, and overall model robustness. The thresholds for the different benchmarks have been chosen based on the literature. In the case of tautomers, energy differences are very small; therefore, we’ve chosen a stricter threshold of 1-2 kcal/mol, which is not enough for classification. Thresholds for biomolecules are borrowed from traditional literature in molecular modeling.

Table 1: Score thresholds across benchmarks

Benchmark	Metric	Threshold
Reference Geometry Stability	RMSD (Å)	0.075 [1]
Non-covalent Interactions	Absolute deviation from reference interaction energy (kcal/mol)	1.0 [1]
Dihedral Scan	Mean barrier error (kcal/mol)	1.0 [2]
Conformer Selection	MAE (kcal/mol), RMSE (kcal/mol)	0.5, 1.5 [3]
Tautomers	Absolute deviation (ΔG)	0.05
Ring Planarity	Deviation from plane (Å)	0.05 [4]
Bond Length Distribution	Avg. fluctuation (Å)	0.05 [1]
Reactivity-TST	Activation Energy (kcal/mol), Enthalpy (kcal/mol)	3.0 [5], 2.0 [5]
Reactivity-NEB	Final force convergence (eV/Å)	0.05 [6]
Radial Distribution Function	RMSE (Å)	0.1 [7]
Protein Sampling Outliers	Ramachandran ratio, Rotamers ratio	0.1, 0.03
Protein Folding Stability	min(RMSD) (Å), max(TM-Score)	2.0, 0.5

Model Scores¶

References¶