Hierarchical Semantic-Augmented Navigation: Optimal Transport and Graph-Driven Reasoning for Vision-Language Navigation

2026-06-01Robotics

RoboticsComputer Vision and Pattern Recognition
AI summary

The authors propose a new method called Hierarchical Semantic-Augmented Navigation (HSAN) to help robots better understand and navigate complex indoor spaces by combining language instructions and visual information. Their approach builds a layered map that captures detailed information about objects and areas, uses a smart planner to pick efficient long-term goals, and applies a learning-based controller to safely move through the environment. This framework improves on earlier methods that struggled with long tasks and complex layouts. Tests on several benchmarks show that HSAN performs better at successfully completing navigation tasks and can adapt to new places more effectively.

Vision-Language NavigationHierarchical Semantic Scene GraphOptimal TransportTopological PlannerReinforcement LearningSpectral Graph TheoryMulti-modal Learning3D Indoor NavigationSpatial ReasoningRobotics
Authors
Xiang Fang, Wanlong Fang, Changshuo Wang
Abstract
Vision-Language Navigation in Continuous Environments (VLN-CE) poses a formidable challenge for autonomous agents, requiring seamless integration of natural language instructions and visual observations to navigate complex 3D indoor spaces. Existing approaches often falter in long-horizon tasks due to limited scene understanding, inefficient planning, and lack of robust decision-making frameworks. We introduce the \textbf{Hierarchical Semantic-Augmented Navigation (HSAN)} framework, a groundbreaking approach that redefines VLN-CE through three synergistic innovations. First, HSAN constructs a dynamic hierarchical semantic scene graph, leveraging vision-language models to capture multi-level environmental representations, from objects to regions to zones, enabling nuanced spatial reasoning. Second, it employs an optimal transport-based topological planner, grounded in Kantorovich's duality, to select long-term goals by balancing semantic relevance and spatial accessibility with theoretical guarantees of optimality. Third, a graph-aware reinforcement learning policy ensures precise low-level control, navigating subgoals while robustly avoiding obstacles. By integrating spectral graph theory, optimal transport, and advanced multi-modal learning, HSAN addresses the shortcomings of static maps and heuristic planners prevalent in prior work. Extensive experiments on multiple challenging VLN-CE datasets demonstrate that HSAN achieves state-of-the-art performance, with significant improvements in navigation success and generalization to unseen environments.