Learning When Not to Act: Mitigating Tool Abuse in Agentic Reinforcement Learning
2026-06-01 • Artificial Intelligence
Artificial Intelligence
AI summaryⓘ
The authors studied how AI agents sometimes overuse tools even when they could solve problems by thinking on their own. They created a new method called EAPO that helps agents decide when it's better to use tools and when it's not needed. This approach learns to avoid extra tool use on easy questions without hurting performance, improving accuracy and efficiency on multiple tests. In comparison to older methods, their approach uses fewer tools but still answers better overall.
agentic reinforcement learningtool abusereward shapingpolicy optimizationconfidence-aware token reweightingreinforcement learning benchmarksQwen modelsLlama modelstool-assisted reasoning
Authors
Liuji Chen, Dianxing Tang, Xing Shi, Dingshuo Chen, Qiang Liu, Shu Wu, Liang Wang
Abstract
Agentic reinforcement learning can induce tool abuse, where models overuse external tools even for queries solvable by internal reasoning. Existing approaches mitigate this issue with uniform tool-use penalties or hard limits, which reduce tool frequency but may also suppress useful tool-assisted exploration. We propose EAPO, an Efficient Agentic Policy Optimization framework that learns selective tool use. EAPO introduces tool-free trajectories into each rollout group, applies difficulty-aware reward shaping to penalize redundant tool calls mainly on easier queries, and uses confidence-aware token reweighting to improve policy learning. Across nine mathematical and knowledge-intensive reasoning benchmarks, EAPO consistently improves the accuracy efficiency trade-off on Qwen2.5-3B, Qwen2.5-7B, and Llama3.1-8B. Compared with GRPO, EAPO improves average performance by 10.45%, 7.27%, and 9.69%, while reducing average tool calls by 18.33%, 18.33%, and 24.59%, respectively. These results show that agents can learn when not to use tools without compromising tool-integrated reasoning.