Efficient Benchmarking Is Just Feature Selection and Multiple Regression

2026-05-25Artificial Intelligence

Artificial IntelligenceComputation and LanguageMachine Learning
AI summary

The authors looked at how to speed up testing large language models by guessing overall scores from just a few questions instead of all of them. They showed that using a method called kernel ridge regression helps make better predictions than older techniques. They also used a smart way to pick the best questions, called mRMR, which finds questions that give the most useful information without repeating themselves. Their methods work well on different tests and are faster and more consistent than other selection approaches. They provide code so others can use their improved techniques.

Large Language ModelsBenchmarkingKernel Ridge RegressionFeature SelectionMinimum Redundancy Maximum Relevance (mRMR)Prediction ErrorMean Absolute Error (MAE)Root Mean Square Error (RMSE)Spearman CorrelationKendall Tau
Authors
Sam Bowyer, Acyr Locatelli, Kris Cao
Abstract
Efficient benchmarking techniques aim to lower the computational cost of evaluating LLMs by predicting full benchmark scores using only a subset of a benchmark's questions. By reframing this problem as an instance of multiple regression with feature selection, we find that existing efficient benchmarking methods can be greatly improved by simply using kernel ridge regression at the prediction stage. Additionally, using an information-theoretic feature-selection algorithm called minimum redundancy maximum relevance (mRMR), we can further improve upon these methods by selecting question subsets that will be maximally useful for prediction. Except in very data-poor settings, these approaches consistently achieve smaller prediction errors (in both MAE and RMSE), and greater ranking correlation between predicted and true scores (in both Spearman $ρ$ and Kendall $τ$) across a range of benchmarks using both binary and continuous metrics. Furthermore, mRMR subsampling is much faster than competitor methods (which often involve fitting probabilistic models or running clustering algorithms), and is more likely to select the same questions under different random seeds or training data splits. Tutorial code can be found at https://github.com/sambowyer/mrmr_eval .