Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models

2026-03-18 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial IntelligenceComputation and Language

AI summaryⓘ

The authors created Loc3R-VLM, a system that helps 2D vision-language models understand 3D spaces better using just regular video. Instead of adding geometric hints, they teach the model to build a full 3D map of the scene and understand the viewpoint of the camera, like how humans perceive space. They also use camera position info from a pre-trained model to keep the 3D layout accurate. Their method improves the model's ability to answer questions about locations and 3D scenes compared to previous approaches.

Multimodal Large Language ModelsVision-Language Models3D Spatial UnderstandingMonocular VideoGlobal Layout ReconstructionEgocentric PerspectiveCamera Pose Priors3D Question AnsweringSpatial Supervision

Authors

Kevin Qu, Haozhe Qi, Mihai Dusmanu, Mahdi Rad, Rui Wang, Marc Pollefeys

Abstract

Multimodal Large Language Models (MLLMs) have made impressive progress in connecting vision and language, but they still struggle with spatial understanding and viewpoint-aware reasoning. Recent efforts aim to augment the input representations with geometric cues rather than explicitly teaching models to reason in 3D space. We introduce Loc3R-VLM, a framework that equips 2D Vision-Language Models with advanced 3D understanding capabilities from monocular video input. Inspired by human spatial cognition, Loc3R-VLM relies on two joint objectives: global layout reconstruction to build a holistic representation of the scene structure, and explicit situation modeling to anchor egocentric perspective. These objectives provide direct spatial supervision that grounds both perception and language in a 3D context. To ensure geometric consistency and metric-scale alignment, we leverage lightweight camera pose priors extracted from a pre-trained 3D foundation model. Loc3R-VLM achieves state-of-the-art performance in language-based localization and outperforms existing 2D- and video-based approaches on situated and general 3D question-answering benchmarks, demonstrating that our spatial supervision framework enables strong 3D understanding. Project page: https://kevinqu7.github.io/loc3r-vlm

View PDFOpen arXiv