ATLAS: Agentic Test-time Learning-to-Allocate Scaling
2026-06-01 • Machine Learning
Machine Learning
AI summaryⓘ
The authors introduce ATLAS, a new system where a large language model not only solves problems but also decides how to approach and combine multiple attempts to improve answers. Instead of following fixed rules, ATLAS dynamically chooses when to try more solutions and when to stop, making better use of computing resources. Testing ATLAS on different tasks like scientific questions and code generation showed it performs well while using fewer calls to the model. They also extended ATLAS to choose between different models, which improved performance further. The authors found that letting ATLAS handle how it combines evidence is key to its success.
test-time scalinglarge language model (LLM)orchestrationreasoningsolverprompting strategyevidence synthesisbenchmarkagentic frameworkmulti-model integration
Authors
Peijia Qin, Qi Cao, Pengtao Xie
Abstract
Test-time scaling has become a major way to improve large language model reasoning, but its orchestration has remained designer-engineered: a fixed sample budget, a fixed refinement loop, a fixed scoring rule, or a fixed search policy decides how compute is spent, leaving the model in charge of solving but not of orchestration. We introduce ATLAS, an agentic test-time scaling framework in which an LLM orchestrator owns the control loop end-to-end. Through a single action, explore, which dispatches a fresh independent solver on the original problem, the orchestrator decides whether to gather more evidence, when to stop, and how to synthesize the final answer; the action space is extensible, with each explore call optionally specifying solver, reasoning effort, or prompting strategy. We evaluate ATLAS on four benchmarks covering scientific question answering, code generation, and multimodal reasoning under a Claude Sonnet 4.6 backbone, where it reaches 56.00% on HLE-Verified, 82.29% on LiveCodeBench, 85.75% on GPQA-Diamond, and 23.71% on BabyVision while using far fewer API calls than fixed-workflow baselines. A multi-model extension, ATLAS-MM, that exposes solver choice as an additional action dimension further improves HLE-Verified to 60.00% and LiveCodeBench to 85.63%, with consistent gains on GPQA-Diamond and BabyVision. Ablations replacing the orchestrator's direct synthesis with a separate integrator degrade or fail to improve accuracy on three of four benchmarks, consistent with the role of stateful evidence management in producing the gains.