Do Multimodal Agents Really Benefit from Tool Use? A Systematic Study of Capability Gains

2026-06-01Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence
AI summary

The authors studied two AI agents that use tools to help solve tasks like reading images and math problems. They found that simply using tools didn’t always help the agents solve more problems than agents without tools. Most problems solved with tools were also solved by agents not using tools, suggesting tools didn’t add much new ability. The authors highlight the need to check if tools actually improve problem-solving, not just if they are used.

tool-augmented agentsmultimodal agentstool-use evaluationimage understandingOCRmathematical reasoningagent ablationtool-calling patternsexecution resultsbenchmark performance
Authors
Garvin Guo, Donglei Yu, Yu Chen, Xiang Wang, Shuai Li, Xinpei Zhao, Huaxing Liu, Qinghao Wang, Minpeng Liao
Abstract
Tool-augmented multimodal agents show strong benchmark gains, often taken as evidence that agents have learned to use tools. We argue that this interpretation can be premature: a tool-call trace alone does not show whether the tool supplied answer-critical information. We study two representative ``thinking with images'' agents, Thyme and DeepEyesV2, across real-world understanding, OCR, chart understanding, and mathematical reasoning. Each agent is compared with its Tool-Free counterpart and with a Pure-Text Reasoner trained from the same source pool without tool-calling trajectories. Tool access yields little consistent aggregate improvement, does not reliably reduce generated-token cost, and leaves only a small tool-only solved set: 93% of DeepEyesV2's tool-solved problems and 96% of Thyme's are also solved by at least one non-tool setting. Mechanism ablations further show that the full tool-use loop does not consistently outperform either the tool-call format or the returned execution result alone. In the settings we study, the analyzed agents appear to learn tool-calling patterns more reliably than tool-contributed capabilities, suggesting that evaluation should distinguish tool availability from whether tools actually expand what agents can solve.