LOCUS: Local Visual Cue Search for Enhancing Fine-Grained Perception in Multimodal Large Language Models

2026-06-15Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors explain that multimodal large language models (MLLMs) struggle to pick out important small details in images, even when those details are visible. They call this problem 'visual context rot,' meaning the model gets distracted by too much extra information. To fix this, the authors created a method called LOCUS, which trains the model to find small important parts of an image by giving it hints during training. Their experiments show this helps the model focus better on important areas without losing its overall understanding ability.

Multimodal Large Language ModelsFine-grained Visual PerceptionVisual Context RotLOCUSLocal Visual Cue SearchIntersection over Union (IoU)Visual LocalizationModel TrainingVisual ReasoningHallucination in AI
Authors
Zhou Tao, Fang Zhang, Zewen Ding, Shida Wang, Xiaokun Sun, YongXiang Hua, Haoyu Cao, Linli Xu
Abstract
Multimodal Large Language Models (MLLMs) remain unreliable on fine-grained visual perception, even when high-resolution inputs preserve the necessary local details. We identify this limitation as visual context rot: decisive evidence may exist in the full image, yet fail to be reliably selected and used amid redundant visual context. We propose LOCUS (LOcal visual CUe Search), a training framework that teaches MLLMs to internalize local evidence search through a verifiable proxy task. During training, LOCUS provides a local crop as a visual cue and optimizes the model to recover its spatial support in the full image using an IoU-based reward. The visual cue is used only during training, leaving the standard image-question inference interface unchanged. Experiments across fine-grained perception, hallucination, general understanding, and reasoning benchmarks show that LOCUS improves localization-sensitive visual understanding while preserving broad capabilities. Attention analyses further indicate stronger focus on task-relevant evidence regions, suggesting that training-time visual cue search provides an effective route to internalized fine-grained evidence selection.