Automated Benchmark Auditing for AI Agents and Large Language Models

2026-05-25 • Computation and Language

Computation and Language

AI summaryⓘ

The authors explain that many modern AI tests are complicated and have hidden problems that humans often miss. They created a tool called Auto Benchmark Audit (ABA) to automatically check these tests for issues like unclear instructions and wrong answers. Testing ABA on many AI benchmarks, they found problems in over a quarter of the tasks. They also showed that fixing these issues changes how well AI models seem to perform. The authors provide their tool and findings to help make future AI tests better.

AI benchmarksLarge Language Models (LLMs)benchmark auditingtask specificationground truthperformance evaluationagentic frameworkNeurIPSenvironment dependenciesmodel ranking

Authors

Junlin Wang, Federico Bianchi, Shang Zhu, Fan Nie, Yongchan Kwon, Bhuwan Dhingra, James Zou

Abstract

Modern AI benchmarks operate at a complexity that outpaces traditional verification methods. Tasks authored by domain experts often contain implicit assumptions, incomplete environment specifications, and brittle evaluation logic that human annotation cannot reliably catch. We introduce Auto Benchmark Audit (ABA), an agentic framework that systematically audits individual benchmark tasks, uncovering issues such as hidden environment dependencies, specification gaps, and limited grading logic. We run ABA on a collection of frontier LLM benchmarks and previous NeurIPS publications, totaling 168 benchmarks across nine domains. Across this corpus, ABA identifies critical issues including ambiguous task design, execution environment conflicts, and incorrect ground truths in over 25.7% of the evaluated tasks. The precision of these automated audits is validated by expert review and independent third-party reports such as upstream PRs. Crucially, we demonstrate that these problematic tasks severely distorts capability assessments for agents and LLMs: filtering out these tasks with issues shifts model rankings and increases average performance on SWE-bench Verified and Terminal-Bench 2 by 9.9% and 9.6%, respectively. We release the agentic tool and all task annotations to support the future development of frontier benchmarks.

View PDFOpen arXiv