Tool-Aware Optimization with Entropy Guidance for Efficient Agentic Reinforcement Learning
2026-06-02 • Machine Learning
Machine LearningArtificial Intelligence
AI summaryⓘ
The authors address the problem that using external tools with large language models can make training unstable, either because the model relies too much on tools or uses them too little. They propose TAO-RL, which cleans up the training data by removing unhelpful examples and encourages the model to try different approaches when using tools by adding a special bonus during learning. This combination helps the model learn better ways to use tools when reasoning through complex problems. Their method was tested on several difficult tasks and showed clear improvements over previous approaches.
reinforcement learninglarge language modelstool usetrajectory filteringentropy-guided explorationpolicy optimizationadvantage functioninput distribution shiftreasoning benchmarks
Authors
Hongye Cao, Nuo Yan, Haoyuan Deng, Ziwei Wang, Tianpei Yang, Jing Huo, Yuyao Zhang, Yang Gao
Abstract
Agentic reinforcement learning (RL) equips large language models (LLMs) with tool-use capabilities that substantially improve reasoning on complex tasks. However, integrating external tools often destabilizes training: over-reliance on tools can induce input distribution shift, while overly conservative tool use limits effective exploration. To address this issue, we propose a unified framework TAO-RL that couples tool-aware trajectory filtering with entropy-guided exploration for efficient policy optimization. Specifically, at the data level, TAO-RL filters rollout trajectories along two criteria: discarding those where all tool invocations fail to execute, and removing those where all rollouts are either correct or incorrect, as both cases yield degenerate advantage estimates that contribute no discriminative learning signal. This joint filtering retains data that are both tool-capable and informative, establishing a high-quality training distribution. At the algorithmic level, we introduce a tool-aware entropy-guided bonus that reshapes the advantage function at post-tool-call tokens, encouraging the policy to explore more diverse reasoning paths at critical decision points. These two components are mutually reinforcing: trajectory filtering establishes a clean and informative training foundation, while entropy-guided exploration drives stronger reasoning behaviors at critical tool-interaction junctures. Extensive experiments on 7 challenging reasoning benchmarks across 3 model scales demonstrate the superiority of TAO-RL over existing methods.