Dense Reward for Multi-View 3D Reasoning with Global Maps and Local Views

2026-06-22 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors address a task where a model looks at multiple 3D views to answer questions about a scene. They propose a new method called DR-MV3D that helps the model learn by giving detailed feedback during the reasoning steps, instead of just checking if the final answer is right. Their method breaks down the problem into making a global map of the scene, planning which views to look at based on the question, and then using these views to answer. Experiments show their approach improves performance by teaching the model to reason more consistently across different views.

3D Visual Question Answeringmulti-view reasoningallocentric mapegocentric groundingtrajectory planningdense rewardpolicy optimization3D vision foundation modelsspatial reasoning

Authors

Jiho Choi, Seonho Lee, Seojeong Park, Hyunjung Shim

Abstract

Multi-view 3D Visual Question Answering (MV3D-VQA) requires integrating partial observations into a coherent 3D scene representation and selecting informative viewpoints for multi-step spatial reasoning. However, current multimodal LLMs are typically trained with sparse, answer-level supervision, which often yields inconsistent cross-view reasoning and brittle view selection. We present DR-MV3D (Dense Reward for MV3D-VQA), a map-grounded learning framework that provides dense, verifiable rewards to supervise the reasoning process. Our approach decomposes MV3D-VQA into (i) allocentric global map construction, (ii) question-conditioned view-trajectory planning, and (iii) egocentric grounding for answer prediction. To make intermediate steps learnable without manual annotations, we introduce two rewards: a global consistency reward that aligns the predicted map with geometry-consistent pseudo targets from frozen 3D vision foundation models (e.g., VGGT + SAM3), and a local trajectory reward that supervises ordered viewpoint selection. We optimize the full pipeline with trajectory-level policy optimization (GRPO). Experiments on MindCube, VSI-Bench, and BLINK (MV) show that DR-MV3D consistently improves over strong multi-image baselines, supporting the effectiveness of process-level dense supervision for multi-view 3D reasoning.

View PDFOpen arXiv