VisReflect: Latent Visual Reflection for Fine-Grained Perception in Long Visual Context

2026-06-29 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors address a problem where large vision-language models struggle to focus on important details in very large images or long videos. Instead of trying to cut out parts of images, their method, VisReflect, creates a special internal summary that highlights key visual information for the question being asked. This approach helps the model pay attention to relevant areas more efficiently in just one step. Tests show their method improves accuracy on both image and video tasks and speeds up video understanding compared to older methods that zoom in on parts of the visuals.

Large Vision Language ModelsVisual AttentionLatent SpaceHigh-resolution ImagesVideo UnderstandingAttention MechanismBounding BoxesInference TimeVisual TokensFine-grained Perception

Authors

Xiaoqian Shen, Mohamed Elhoseiny

Abstract

Large Vision Language Models (LVLMs) have achieved remarkable success on vision-language tasks, yet fine-grained perception over high-resolution images and long-context videos remains challenging. As the number of visual tokens increases, the visual attention sink phenomenon becomes increasingly severe, causing irrelevant tokens to absorb a disproportionate amount of attention mass. Recent approaches attempt to mitigate this issue by explicitly predicting bounding boxes or temporal spans and re-encoding the cropped visual regions. Such methods depend on unreliable numeric localization in the discrete token space and incur significant computational overhead due to additional forward passes. In this work, we propose **VisReflect**, a simple yet effective framework that improves fine-grained perception in long visual contexts through latent visual reflection. Instead of decoding intermediate predictions into discrete tokens, the model generates continuous visual reflection that represents question-relevant visual features in the latent space. These reflections selectively emphasize salient regions or frames, guiding attention towards relevant visual tokens within a single forward pass. We conduct comprehensive evaluations on challenging high-resolution image benchmarks, including BLINK, V*, and HRBench-4K/8K, as well as video understanding benchmarks such as MVBench, VideoMME, and MLVU. Our method consistently improves over strong baselines, achieving gains of 4.1% on image benchmarks and 1.8% on video benchmarks. Compared with zooming-based methods, our model achieves comparable performance while reducing inference time by roughly 44% on video understanding.

View PDFOpen arXiv