Bayesian Inference and Decision Audits for Public Archives of Frontier AI Evaluations

2026-06-15 • Artificial Intelligence

Artificial Intelligence

AI summaryⓘ

The authors examine public AI leaderboards, showing that these rankings are based on selective data influenced by reporting rules and missing information. They analyze multiple archives that track AI performance over time and find that the available data can be explained by different histories, leading to uncertainty about true progress timing. Their tests reveal that some models predicting AI progress do not match observed data well, and they propose a protocol to verify evaluation histories and reject unsupported claims about AI advancements. Overall, the authors highlight the complexity in interpreting AI benchmarks and suggest careful validation is needed.

AI evaluationleaderboardsbenchmarkingBayesian inferencelongitudinal datamodel calibrationevaluation archivespreference modelinguncertainty quantificationperformance tracking

Authors

Yanan Long

Abstract

Public AI evaluations are often read as terminal leaderboards, yet the underlying evidence is a selective time series shaped by reporting rules, benchmark revisions, and missingness. Repeated public archives for LiveBench and Open LLM Leaderboard v2 serve as the primary longitudinal record; LMArena provides a preference stress test; and GAIA and tau-bench contribute limited agentic pilots. Together, these archives instantiate a Bayesian inference problem: under a fixed reporting convention, one constructed terminal-only example over $1{,}000$ systems is compatible with two pre-terminal histories, yielding times of $23.03$ or $75.13$ to reach within $0.05$ of the ceiling under the same terminal-tail model. In synthetic posterior comparisons, action-facing diagnostics differ across observation regimes. The candidate selection-aware frontier model fails synthetic recovery, objective-archive prediction, preference transfer, and uncertainty calibration; correspondingly, fixed audit gates reject its stronger claims. An archive-and-adjudication protocol reconstructs public evaluation histories, isolates a verified timing boundary, and falsifies unsupported frontier claims.

View PDFOpen arXiv