GN0: Toward a Unified Paradigm for Generation, Evaluation, and Policy Learning in Visual-Language Navigation

2026-06-02Robotics

Robotics
AI summary

The authors created a large, diverse 3D environment dataset called GN-Matrix to improve how robots navigate using vision and language. They developed a high-quality simulation platform with realistic movement and obstacles, plus a new test called GN-Bench to evaluate robot-human interactions. Their model, called BAE, learns navigation through a mix of supervised and reinforcement learning, using special bird's eye view maps for better spatial understanding. Their approach helps robots follow instructions, humans, and reach goals more effectively than previous methods. Overall, their work integrates data, simulation, and learning to advance robot navigation research.

Embodied NavigationVision-and-Language Navigation (VLN)3D Gaussian Splatting (3DGS)Simulation PlatformBird's Eye View (BEV)Reinforcement Learning (RL)DAgger AlgorithmHuman-Robot InteractionFoundation ModelNavigation Benchmark
Authors
Xinhai Li, Xiaotao Zhang, Yuehao Huang, Jiankun Dong, Tianhang Wang, Sunyao Zhou, Yunzi Wu, Chengnuo Sun, Yunfei Ge, Qizhen Weng, Chi Zhang, Chenjia Bai, Xuelong Li
Abstract
Embodied navigation connects intelligent agents with the physical world and is fundamental for general robotic intelligence. Limited availability and quality of navigation data have constrained Vision-and-Language Navigation (VLN) systems' generalization and long-horizon capabilities. To address this, we curate diverse 3D scenes and develop an automated pipeline for large-scale navigation data, resulting in the GN-Matrix dataset. Building on a 3D Gaussian Splatting (3DGS) engine, we introduce a high-fidelity simulation platform supporting interactive roaming and collision-aware navigation. We further propose GN-Bench, the first BEV-based benchmark incorporating dynamic 3DGS avatars for human-robot interaction evaluation. To leverage the simulator, we develop an RL-driven navigation foundation model, Break and Establish (BAE). After supervised learning, DAgger exposes the model to rollout-induced states, breaking narrow expert-centric distributions and enabling downstream RL exploration. This unified VLN paradigm integrates map-based and map-free tasks, including instruction following, human following, and goal navigation. GN-BAE formalizes high-fidelity 3DGS-rendered Bird's Eye View representations as compact memory, unlocking latent spatial reasoning in VLMs. Extensive evaluations on GN-Bench and VLN-CE show that GN0 outperforms state-of-the-art VLN methods. Overall, GN-Matrix offers a unified framework spanning data, simulation, and learning, advancing embodied navigation in research and industrial applications.