MINCE: Shrinking LLM Evaluation Datasets via Few-Model Monte Carlo Calibration

2026-06-22 • Artificial Intelligence

Artificial Intelligence

AI summaryⓘ

The authors present MINCE, a method to speed up evaluating different versions of large language models (LLMs) on specialized hardware, which is usually very slow. MINCE smartly picks a small subset of test examples using a simulation technique, balancing speed and accuracy without needing extra prediction models. Their tests show that MINCE cuts evaluation time significantly while keeping errors low, and it works well even with few calibration models. Overall, this approach makes it much faster and efficient to test LLMs on devices like NPUs and GPUs.

Large Language Models (LLMs)BenchmarkingMonte Carlo SimulationSubset SelectionCalibration ModelsQuantizationFine-tuningNPUs (Neural Processing Units)Evaluation SpeedupAccuracy Drift

Authors

Devleena Das, Rajeev Patwari, Vikram Kumar Bukka, Nithin Kumar Guggilla, Elliott Delaye, Ashish Sirasao

Abstract

Evaluating LLMs across many model variants -- quantized, fine-tuned, or deployment-specific -- requires running large benchmarks repeatedly, a process that can take tens of hours per model on edge hardware such as NPUs. Existing subset selection methods reduce this cost but depend on large calibration pools or learned prediction layers. We introduce MINCE (Monte Carlo Informed N-sizing for Compact Evaluation), which uses Monte Carlo simulation over per-item logs from a small set of calibration models to find the minimum subset size that bounds accuracy drift and then fixes a randomly sampled subset at that size, with no prediction layer needed. MINCE reduces IFEVAL by 54\%, MMLU by 89\%, and GSM8K by 70\% with maximum drift $\leq$2.62\,pp on BF16 models and mean drift of 0.77--3.59\,pp on held-out NPU models, while delivering median GPU evaluation speedups of 2.7--8.1$\times$ and NPU evaluation speedups of 1.7--2.0$\times$. The method is robust to calibration pool size and achieves lower drift than tinyBenchmarks (12$\times$ lower on MMLU, 3.3$\times$ on GSM8K) while using 57$\times$ fewer calibration models.

View PDFOpen arXiv