Model Scores¶
To enable consistent and fair comparison across models, we define a composite score that aggregates performance over all compatible benchmarks. Each benchmark \(b \in \mathcal{B}\) may report one or more metrics \(x_{m,b}^{(i)}\), where \(i = 1, \ldots, N_b\) indexes the \(N_b\) metrics evaluated for the model \(m\). For each metric, we compute a normalized score using a soft thresholding function based on a DFT-derived reference tolerance \(t_b^{(i)}\) (see Table 1 below):
where \(\alpha\) is a tunable parameter controlling the steepness of the penalty (e.g., \(\alpha = 3\)). The per-benchmark score is then computed as the average over all its metric scores:
Let \(\mathcal{B}_m \subseteq \mathcal{B}\) denote the subset of benchmarks for which the model \(m\) has valid data (i.e., benchmarks compatible with its element set). The final model score is the mean over all benchmarks on which the model could be evaluated:
This scoring framework ensures that models are rewarded for meeting or exceeding DFT-level accuracy. In the current version, full benchmarks are skipped if a model does not have all the necessary chemical elements to run all the test cases. This is true for all benchmarks, but non-covalent interactions, in which we do a per-test-case exception. When a benchmark is not run, \(s_{m,b} = 0\) is assigned. Benchmarks with multiple metrics contribute proportionally, and the result is a single interpretable score \(S_m \in [0,1]\) that balances physical fidelity, chemical coverage, and overall model robustness. The thresholds for the different benchmarks have been chosen based on the literature. In the case of tautomers, energy differences are very small; therefore, we’ve chosen a stricter threshold of 1-2 kcal/mol, which is not enough for classification. Thresholds for biomolecules are borrowed from traditional literature in molecular modeling.
Table 1: Score thresholds across benchmarks
Benchmark |
Metric |
Threshold |
|---|---|---|
Reference Geometry Stability |
RMSD (Å) |
0.075 [1] |
Non-covalent Interactions |
Absolute deviation from reference interaction energy (kcal/mol) |
1.0 [1] |
Dihedral Scan |
Mean barrier error (kcal/mol) |
1.0 [2] |
Conformer Selection |
MAE (kcal/mol), RMSE (kcal/mol) |
0.5, 1.5 [3] |
Tautomers |
Absolute deviation (ΔG) |
0.05 |
Ring Planarity |
Deviation from plane (Å) |
0.05 [4] |
Bond Length Distribution |
Avg. fluctuation (Å) |
0.05 [1] |
Reactivity-TST |
Activation Energy (kcal/mol), Enthalpy (kcal/mol) |
|
Reactivity-NEB |
Final force convergence (eV/Å) |
0.05 [6] |
Radial Distribution Function |
RMSE (Å) |
0.1 [7] |
Protein Sampling Outliers |
Ramachandran ratio, Rotamers ratio |
0.1, 0.03 |
Protein Folding Stability |
min(RMSD) (Å), max(TM-Score) |
2.0, 0.5 |