Beyond Language: Grounding Referring Expressions with Hand Pointing in Egocentric Vision

2026-03-27Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors created EgoPoint-Ground, a large dataset with over 15,000 real-world examples where people point and talk to indicate objects from a first-person view. This helps computers better understand both hand gestures and speech together to find objects, which traditional methods using only text sometimes miss. They tested different models and introduced a new approach called SV-CoT that links pointing and language step-by-step to improve accuracy. Their method performs significantly better, helping machines interpret combined speech and gestures more effectively. The dataset and code will be shared publicly for future research.

Visual GroundingEgocentric VisionDeictic GesturesReferring Expression ResolutionMultimodal Large Language ModelsHand-PointingBounding BoxesSemantic CaptionsVisual Chain-of-ThoughtMultimodal Interaction
Authors
Ling Li, Bowen Liu, Zinuo Zhan, Peng Jie, Jianhui Zhong, Kenglun Chang, Zhidong Deng
Abstract
Traditional Visual Grounding (VG) predominantly relies on textual descriptions to localize objects, a paradigm that inherently struggles with linguistic ambiguity and often ignores non-verbal deictic cues prevalent in real-world interactions. In natural egocentric engagements, hand-pointing combined with speech forms the most intuitive referring mechanism. To bridge this gap, we introduce EgoPoint-Ground, the first large-scale multimodal dataset dedicated to egocentric deictic visual grounding. Comprising over \textbf{15k} interactive samples in complex scenes, the dataset provides rich, multi-grained annotations including hand-target bounding box pairs and dense semantic captions. We establish a comprehensive benchmark for hand-pointing referring expression resolution, evaluating a wide spectrum of mainstream Multimodal Large Language Models (MLLMs) and state-of-the-art VG architectures. Furthermore, we propose SV-CoT, a novel baseline framework that reformulates grounding as a structured inference process, synergizing gestural and linguistic cues through a Visual Chain-of-Thought paradigm. Extensive experiments demonstrate that SV-CoT achieves an $\textbf{11.7\%}$ absolute improvement over existing methods, effectively mitigating semantic ambiguity and advancing the capability of agents to comprehend multimodal physical intents. The dataset and code will be made publicly available.