The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection
2026-06-02 • Artificial Intelligence
Artificial Intelligence
AI summaryⓘ
The authors studied how well existing methods detect if parts of test benchmarks were included in large language model training data, which can affect fair model testing. They found that common detection techniques often fail in realistic settings because the data being checked is different from training data or because benchmarks are much smaller than training sets. After testing various methods on many models, the authors saw many incorrect results, showing current tools are not reliable enough for practical use. They conclude that clear records of training data are still essential and provide their benchmark to encourage more research.
benchmark contaminationlarge language modelstraining data membershipdistribution shiftIID assumptiondataset inferencepost-hoc analysismodel evaluationdata provenancestatistical detection
Authors
Wojciech Zarzecki, Jan Dubiński, Sebastian Cygert
Abstract
Benchmark contamination, where evaluation examples appear in a model's training data, threatens the validity of LLM assessment. Statistical tools for detecting training-data membership exist, but have been validated almost exclusively in controlled academic regimes: large, homogeneous pre-training corpora and transparent, single-stage training pipelines. Whether these methods remain reliable in realistic auditing scenarios remains unclear. We identify two under-studied failure modes: distribution shift, which arises when suspect and validation sets violate the IID assumption, and scale constraints, which arise because benchmarks are orders of magnitude smaller than pre-training corpora. We systematically evaluate three leading paradigms: LLM Dataset Inference, Post-Hoc Dataset Inference, and CoDeC across 27 models from multiple families (including Pythia, OLMo~2, and specialised cultural and medical LLMs) and scales (up to 27B). We then further extend our analysis to frontier industry models. Across 335 evaluations, only 199 yield correct outcomes. LLM Dataset Inference results in false positives under distribution shift, Post-Hoc Dataset Inference is underpowered at benchmark scale, and CoDeC provides only coarse provenance signals that are insufficient to verify individual benchmark splits. Our results reveal a systematic reliability gap between controlled validation and practical benchmark auditing, and show that statistical detection cannot yet replace transparent data provenance. We open-source our benchmark for further research.