Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

2026-04-09 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence

AI summaryⓘ

The authors point out that current intelligent agents often use external tools too much, even when they could solve problems using what they already see, which slows them down and hurts their reasoning. They explain that existing methods trying to reduce unnecessary tool use don't work well because penalizing tool use either stops it completely or doesn't reduce it enough. To fix this, they created a new approach called HDPO that separates improving accuracy and limiting tool use into two distinct learning goals. This helps the agent focus on getting the task right first, then learn to rely less on tools. Their model, Metis, uses this method to call tools much less often and still perform better at reasoning tasks.

agentic multimodal modelstool invocationmeta-cognitionreinforcement learningreward scalarizationadvantage estimationconditional optimizationtask accuracyexecution efficiencyself-reliance curriculum

Authors

Shilin Yan, Jintao Tong, Hongwei Xue, Xiaojun Tang, Yangyang Wang, Kunyu Shi, Guannan Zhang, Ruixuan Li, Yixiong Zou

Abstract

The advent of agentic multimodal models has empowered systems to actively interact with external environments. However, current agents suffer from a profound meta-cognitive deficit: they struggle to arbitrate between leveraging internal knowledge and querying external utilities. Consequently, they frequently fall prey to blind tool invocation, resorting to reflexive tool execution even when queries are resolvable from the raw visual context. This pathological behavior precipitates severe latency bottlenecks and injects extraneous noise that derails sound reasoning. Existing reinforcement learning protocols attempt to mitigate this via a scalarized reward that penalizes tool usage. Yet, this coupled formulation creates an irreconcilable optimization dilemma: an aggressive penalty suppresses essential tool use, whereas a mild penalty is entirely subsumed by the variance of the accuracy reward during advantage normalization, rendering it impotent against tool overuse. To transcend this bottleneck, we propose HDPO, a framework that reframes tool efficiency from a competing scalar objective to a strictly conditional one. By eschewing reward scalarization, HDPO maintains two orthogonal optimization channels: an accuracy channel that maximizes task correctness, and an efficiency channel that enforces execution economy exclusively within accurate trajectories via conditional advantage estimation. This decoupled architecture naturally induces a cognitive curriculum-compelling the agent to first master task resolution before refining its self-reliance. Extensive evaluations demonstrate that our resulting model, Metis, reduces tool invocations by orders of magnitude while simultaneously elevating reasoning accuracy.

View PDFOpen arXiv