Deployment-complete benchmarking

2026-05-25 • Machine Learning

Machine Learning

AI summaryⓘ

The authors discuss how current benchmarks often only measure a single response, which may not fully guide how to use a model in real-world deployment. They propose a method called deployment-complete benchmarking that checks if benchmark results reliably determine the right deployment actions. Their tests revealed many benchmarks lack enough information to make confident deployment decisions. They show that adding more detailed evidence can reduce wrong decisions when using models. Overall, the authors suggest benchmarks should report more than just scores, including how clear the evidence is and any remaining uncertainties.

benchmarkingdeployment actionevidence fiberconformal coverageresponse-rank intervalsTox21MatbenchJARVISmodel certificationfalse decision rate

Authors

El Mustapha Mansouri, Keigo Arai

Abstract

Benchmarks increasingly guide deployment, procurement and scientific screening, yet a score supports only the response it records, not necessarily the deployment action. We introduce deployment-complete benchmarking, which tests whether benchmark evidence determines a deployment action. A benchmark is complete for a claim exactly when the action is constant on each evidence fiber; mixed fibers expose missing deployment information, and completion curves quantify the evidence required to resolve ambiguity. In controlled response spaces, benchmark-channel conformal coverage of 94.98% transferred poorly to an unmeasured deployment channel (10.07%), whereas response-rank intervals achieved 94.91% coverage; even zero benchmark error certified only 45.4% of candidates at the largest residual size. Public audits revealed incompleteness, including 97.9% mixed Tox21 fibers and zero median certifiable fraction in main Matbench and JARVIS audits. In held-out replays, certify-then-acquire reduced false decisions from 1.19% to 0.027% in Tox21 and from 20.3% to 0.128% in JARVIS, while changing model choice and identifying deployment-relevant probes. Deployment-ready benchmarks should report evidence, supported actions, ambiguity and completion cost rather than scores alone.

View PDFOpen arXiv