SafetyRepro: Configuration-Conditional Rank Instability on Alignment Benchmarks

2026-05-25Machine Learning

Machine Learning
AI summary

The authors explain that when researchers compare AI models by saying things like 'Model A is safer than Model B,' these comparisons depend heavily on how the tests are set up, which is often not clearly explained. They provide a new theoretical idea showing that small changes in evaluation settings can flip which model appears better. They also created a way to test this idea on popular safety benchmarks and found that the choice of test details alone can change the rankings. This helps highlight a key reason why some model comparisons might be unreliable.

foundation modelsmodel comparisonbenchmarkingpairwise disagreementevaluation protocolAI safetyalignment benchmarksordering reversalconfiguration choice
Authors
Yanhang Li, Zhichao Fan, Zexin Zhuang
Abstract
Pairwise model comparisons drawn from foundation-model benchmarks ("A is safer than B") are read as quantitative verdicts but hinge on harness choices benchmark papers under-specify. We close one theory-benchmark loop on this primitive: a finite-envelope proposition tying a measurable pairwise-disagreement rate to whether the strict ordering admits a configuration-pair reversal, paired with a commit-stamped evaluation protocol that operationalises it on widely cited alignment benchmarks. On every benchmark we test, configuration choice alone can flip the pairwise verdict; the proposition isolates this strict-reversal failure mode.