H-GRPO: Permutation-Invariant Reinforcement Learning for Grounded Visual Reasoning

2026-06-29 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors created a new way to make vision-language models (which understand images and text) more accurate and easier to understand. Instead of giving one big answer, their method breaks down a big question into smaller steps, each with a clear answer and a specific part of the image as evidence. This helps the model explain its reasoning by showing which parts of the image support each step. This approach aims to reduce guessing and make the model's answers more reliable and transparent.

Vision-Language Modelshallucinationinterpretabilitydecompositionbounding boxvisual groundingstructured reasoningsub-questionsimage understandingexplainability

Authors

Eric Peh, Debaditya Roy, Basura Fernando

Abstract

Vision-Language Models (VLMs) often achieve high performance on benchmarks while remaining "black boxes", yet they remain prone to hallucination or rely on superficial shortcuts. In this work, we propose a framework designed to enhance both performance and interpretability through De-compositional Evidence Grounding. Unlike monolithic inference approaches, our approach forces the model to decompose a global query into a sequence of atomic sub-questions, each requiring an explicit sub-answer and critically a localized evidence bounding box. By grounding intermediate logical steps (e.g. identifying a container, analyzing liquid properties, and assessing environmental context) in specific visual regions, we construct a structured reasoning path that mirrors human-like deduction. This allows the final answer to emerge as a logical consequence of verified visual facts rather than a statistical guess.

View PDFOpen arXiv