Goal2Pixel: Grounding Goals to Pixels for Vision-Language Navigation
2026-06-01 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionRobotics
AI summaryⓘ
The authors propose Goal2Pixel, a new way for robots to navigate using vision-language models (VLMs) by pointing to pixels in images instead of predicting low-level actions. Instead of deciding every small move, their method picks a pixel on the image that the robot should move toward, which is then converted into a real-world direction. They also add special image areas for commands like turning or stopping and use memory to remember important visual history. Their approach requires fewer calls to the VLM while still performing well on navigation tasks compared to previous methods.
Vision-language modelsVision-and-language navigationPixel grounding3D waypointAction predictionKeyframe memorySemantic embeddingsRL navigation benchmarksR2R-CE datasetSpl (Success weighted by Path Length)
Authors
Muyi Bao, Yuxin Cai, Hang Xu, Zongtai Li, Jinxi He, Jingfan Tang, Chen Lv, Ji Zhang, Yaqi Xie, Wenshan Wang
Abstract
Vision-language models (VLMs) have become a common foundation for vision-and-language navigation in continuous environments (VLN-CE). Yet most VLM-based methods cast navigation as low-level action prediction, an interface that is ambiguous, tied to short-horizon motion primitives, and inefficient due to repeated VLM querying. We propose Goal2Pixel, a pure pixel-based paradigm that reformulates VLN-CE as navigable pixel grounding. Rather than predicting actions, Goal2Pixel uses the image plane as a unified spatial interface between VLM reasoning and robot motion: the model predicts a visible navigable pixel to the agent, which is back-projected into a 3D waypoint for forward navigation. For non-forward actions, we append auxiliary directive regions to the image plane, where the left/right/bottom regions are interpreted as turning left, turning right, and stopping, respectively. To enable long-horizon navigation, we propose a visibility-aware keyframe memory for compact and informative history representation. To adapt pretrained VLMs to navigable pixel grounding, we introduce semantic embeddings and coordinate-aware auxiliary losses. Goal2Pixel achieves competitive state-of-the-art performance while requiring fewer VLM inference calls than prior methods. On R2R-CE Val-Unseen it achieves 54.1% SR and 52.5% SPL with just 7.75 VLM calls per episode, 6x fewer than the 46.62 required by direct action prediction at 32.9% SR. The same trend holds on RxR-CE.Project Page: https://baobao0926.github.io/Goal2Pixel/.