Active Testing of Large Language Models via Approximate Neyman Allocation
2026-05-11 • Artificial Intelligence
Artificial Intelligence
AI summaryⓘ
The authors address the high cost of evaluating large language models, especially for tasks where experts are needed to judge performance. They propose a new method for picking a small but smart sample of examples to test generative models, which create new content rather than just classify. Their method uses a measure called semantic entropy from simpler models to guide this sampling and allocate the testing effort efficiently. Tests across different tasks show their approach reduces error and saves evaluation resources compared to standard uniform sampling.
large language modelsevaluationactive testinggenerative taskssemantic entropyNeyman allocationsurrogate modelsmultimodal benchmarkssamplingmean squared error
Authors
Zeli Liu, Jiancheng Zhang, Cong Liu, Yinglun Zhu
Abstract
Large language models (LLMs) require reliable evaluation from pre-training to test-time scaling, making evaluation a recurring rather than one-off cost. As model scales grow and target tasks increasingly demand expert annotators, both the compute and labeling costs needed for each evaluation rise rapidly. Active testing aims to alleviate this bottleneck by approximating the evaluation result from a small but informative subset of the evaluation pool. However, existing approaches primarily target classification and break down on generative tasks. We introduce a novel active testing algorithm tailored to generative tasks. Our method leverages semantic entropy from surrogate models to stratify the evaluation pool and then conducts approximate Neyman allocation based on signals extracted from these surrogates. Across multiple language and multimodal benchmarks and a range of surrogate-target model pairs, our method significantly improves on baselines and closely tracks Oracle-Neyman, delivering up to 28\% MSE reduction over Uniform Sampling and an average of 22.9\% budget savings.