Breaking the Evaluation Paradox: Evaluating High-Entropy Search with Computationally Irreducible Constraints

2026-06-22Information Retrieval

Information Retrieval
AI summary

The authors address a problem in testing how well large language models (LLMs) can search exhaustively: it’s hard to check if their answers are complete without perfect knowledge, which humans can’t provide for complex tasks. To solve this, they created VERITAS, a system that builds special search problems with rules that make it impossible for models to guess shortcuts, so they must explore everything to find all answers. This lets researchers automatically make lots of test cases with exact answers, helping to fairly measure and improve models’ searching skills. VERITAS offers a clearer way to evaluate exhaustive search without relying on incomplete human annotations.

Large Language ModelsExhaustive SearchGround TruthBenchmarkingComputational IrreducibilitySearch Space TraversalSystematic ExplorationEvaluation FrameworkTraining Data GenerationHash Computation
Authors
Juntao Wu, Wei Wen, Xianting Huang, Shuai Pang, Ruizhi Qiao, Xing Sun, Ke Wang
Abstract
Evaluating the exhaustive search capabilities of large language models (LLMs) is plagued by a fundamental paradox: verifying completeness requires complete ground truth, yet high-entropy enumeration tasks make such ground truth impossible for humans to create. This causes benchmarks to systematically penalize models for outperforming their human annotators. Despite rapid progress in web-search and deep research agents -- which now issue hundreds of queries, traverse diverse sites, and synthesize long reports -- evaluation still largely relies on partially annotated answer sets, LLM-based judges, or single-answer questions that avoid genuinely exhaustive search scenarios. We break this paradox by shifting the evaluation paradigm from simulating a messy reality to constructing computationally pure challenges. We introduce VERITAS (Verifiable Traversal Assessment for Search), a framework built on the principle of computationally irreducible constraints. By introducing novel, non-optimizable constraints, we create verifiable, sparse-answer search tasks that are computationally equivalent to exhaustive enumeration. These constraints are easy to verify but impossible for LLMs or search engines to optimize, forcing agents to genuinely traverse the entire search space. VERITAS can automatically generate a virtually infinite number of test cases with perfect ground truth and precise difficulty control, with marginal instance cost dominated by hash computations. This provides not only a robust benchmark for evaluating systematic exploration under uncertainty but also a scalable method for generating training data to improve these crucial, yet underdeveloped, capabilities.