Pondering the Way: Spatial-perceiving World Action Model for Embodied Navigation

2026-06-29 • Robotics

RoboticsArtificial Intelligence

AI summaryⓘ

The authors found that current visual navigation planners often separate deciding where to go from figuring out how to get there, which causes problems like inefficiency and mismatched visuals. They created SWAM, a new system that, given photos of the start and goal, predicts the full path and actions in one step, making the route more accurate and spatially sensible. SWAM learns from depth information during training but only needs regular single-camera images when in use. The authors also added features to fine-tune action choices based on visuals and keep predictions stable over different distances. Their tests show SWAM works better and faster than older methods, even in new places it hasn't seen before.

world modelvisual navigationRGB-D sequencestrajectory synthesismonocular RGB inputdepth pseudo-labelsaction refinementtrajectory regularizationzero-shot generalization

Authors

Hong Chen, Daqi Liu, Zehan Zhang, Haiguang Wang, Tianhao Lu, Longfei Yan, Haiyang Sun, Fangzhen Li, Hongwei Xie, Bing Wang, Guang Chen, Hangjun Ye, Yihua Tan

Abstract

Existing world model-based planners for visual navigation typically follow a verification-centric paradigm, decoupling goal intent from trajectory synthesis. This approach suffers from candidate dependence, heavy computational overhead, and inconsistencies between sampled actions and predicted visuals. To address these issues, we propose SWAM (Spatial-perceiving World Action Model), a task-centric joint observation-action generation framework. Given start and goal RGB observations, SWAM performs single-pass inference to simultaneously generate intermediate RGB-D sequences and corresponding action trajectories, promoting goal-consistent trajectory generation and improved spatial feasibility. While SWAM leverages depth pseudo-labels during training to internalize spatial priors, it requires only monocular RGB input at inference time. We further introduce a visual-guided action refinement module and a trajectory-scale regularization loss to enforce fine-grained alignment between motion and visual cues while stabilizing predictions across varying distances. Extensive experiments show that SWAM significantly outperforms state-of-the-art two-stage planners in success rate, trajectory accuracy, and inference efficiency, while demonstrating robust zero-shot generalization to unseen environments.

View PDFOpen arXiv