AgentGrounder: Zero-Shot 3D Visual Pointcloud Grounding using Multimodal Language Models

2026-05-25Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionRobotics
AI summary

The authors present AgentGrounder, a new zero-shot method for finding objects in 3D scenes using natural language without needing special 3D training. Their system builds a detailed object database first, then uses a smart agent to pick likely objects and only get extra visual info when needed. This approach reduces errors and handles context better compared to older fixed matching methods. Tests show AgentGrounder improves accuracy on standard 3D grounding tasks, especially when queries don't depend on specific views.

3D Visual GroundingZero-shot LearningPoint CloudsVision-Language ModelsObject Lookup TableGeometric ScoringNatural Language ProcessingScanReferNr3D DatasetEmbodied AI
Authors
Cuong Huynh, Maxim Popov, Denis Gridusov, Sergey Kolyubin
Abstract
3D Visual Grounding (3DVG) is an essential capability for embodied AI, requiring agents to localize objects in 3D scenes based on natural language descriptions. Recent zero-shot methods leverage 2D vision-language models (LVLMs). However, they often rely on existing sets of multi-view images and struggle with the limited semantic and spatial details provided by standard 3D segmentation tools. We present $\textbf{AgentGrounder}$, a zero-shot 3D visual grounding framework that operates directly on colored point clouds without task-specific 3D training. Our approach follows a two-stage design: (1) an offline stage that applies 3D model to build an Object Lookup Table (OLT) with instance IDs, semantic labels, 3D bounding boxes; and (2) an online tool-driven agent that decomposes each query, retrieves only relevant candidates from the OLT, performs geometric scoring, and triggers image rendering on demand when additional visual evidence (e.g., color, material, or viewpoint-sensitive cues) is required. Compared with fixed anchor-target matching pipelines, this design reduces cascading matching errors and improves context-window efficiency by avoiding prompts overloaded with irrelevant objects. We evaluate on ScanRefer and Nr3D under a zero-shot setting and observe consistent improvements over SeeGround in our setup, including +2.5% Acc@0.5 on ScanRefer and +6.3% on Nr3D, with a notable +6.3% gain on Nr3D view-independent queries. These results show that combining selective retrieval, geometric reasoning, and adaptive visual inspection yields a practical and robust foundation for open-vocabulary 3D grounding. Our code is available at https://github.com/be2rlab/AgentGrounder.