Faithful Grounded Visual Reasoning via Learned Proxy-Tokens

2026-06-22Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors worked on making multimodal large language models (MLLMs), which answer questions about images, easier to understand and trust. They created a new way to link parts of an image to the model's reasoning using learned proxy-tokens instead of just text-based coordinates, which often don't match the actual image. Their new model, Composer, showed better accuracy in connecting reasoning to the right image parts without losing answer quality. They also made a special dataset, ComposerGCoT, to test how well these links work. This approach could help build AI systems that explain their decisions in a clearer and more reliable way.

Multimodal Large Language Models (MLLMs)Visual Question Answering (VQA)Grounded Visual Reasoning (GVR)Visual groundingProxy-tokensSemantic-spatial gapLatent spaceInterpretabilityComposerGCoT datasetReasoning consistency
Authors
Tom Hodemon, Mohamed Chaouch, Aboubacar Tuo, Angelique Loesch
Abstract
Multimodal Large Language Models (MLLMs) have achieved remarkable success in Visual Question Answering (VQA), yet their "black-box" nature hinders deployment in critical domains. Grounded Visual Reasoning (GVR) approaches attempt to improve interpretability by explicitly couple textual rationales with visual grounding information, which are typically textual coordinates. This mechanism lacks a learnable semantic link to the visual features, often resulting in a semantic-spatial gap where the model hallucinates coordinates that do not correspond to image evidences. In this work, we introduce Composer, a MLLM that leverages a novel visual grounding mechanism based on learned proxy-tokens to promote faithful interpretability. These discrete symbolic pointers explicitly index the image latent space, allowing the model to manipulate visual regions as addressable, semantically manipulable sets. To rigorously validate our novel grounding mechanism, we constructed ComposerGCoT, a dataset synthesized to enable holistic assessment of reasoning consistency and grounding accuracy. Experimental results indicate that Composer achieves performance parity with its coordinate-based counterpart in final answer accuracy, while improving visual grounding accuracy by +9.0 points. By demonstrating that discrete proxy-tokens capture spatial semantics more effectively than typical textual coordinates, we establish that visual grounding mechanisms with learnable semantic links represent a promising path toward trustworthy and reliable MLLMs.