ToolFG: Towards Well-Grounded Fine-Grained Image Classification

2026-06-01Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors propose ToolFG, a new system that helps large language models use external tools to better identify very similar images, a task called fine-grained image classification. Their method allows the model to interact with images and gather reliable visual clues to make more accurate distinctions. They develop a special training process to teach the model how to use these tools effectively and improve both the tools and the model together over time. Experiments show that their approach works well for this tricky image classification problem.

fine-grained image classificationMLLMtool-use knowledge distillationMonte Carlo tree searchvisual cue extractionmodel-tool co-evolutionexternal tool integrationimage reasoning
Authors
Yu Xue, Haoxuan Qu, Zhuoling Li, Yihang Lou, Yan Bai, Hossein Rahmani, Jun Liu
Abstract
Fine-grained image classification (FGIC) has broad applications and has attracted significant research attention. In this paper, we explore a novel paradigm for solving FGIC by proposing \textbf{ToolFG}, the first tool-integrated MLLM-based framework tailored to FGIC. ToolFG enables MLLMs to autonomously and flexibly use external tools during the reasoning process, actively interact with images, and collect verifiable visual cues for distinguishing highly similar categories in a more \textit{reliable} and \textit{well-grounded} manner. To equip the model with such tool-use ability, we design a novel \textbf{MCTS-guided tool-use knowledge distillation mechanism}, which effectively mines tool-use- and FGIC-relevant knowledge from advanced proprietary MLLMs for model training. Furthermore, we propose a \textbf{model-tool co-evolution mechanism} that jointly refines the toolset and the model's tool-use policy, driving them toward a mutually adapted and FGIC-specialized state. Extensive experiments demonstrate the effectiveness of our framework.