Learning Robot Visual Navigation in Crowds via Intention-Aware Scene Representations
2026-06-24 • Robotics
Robotics
AI summaryⓘ
The authors address how robots can navigate through crowds by better understanding people’s intentions using visual information. They created iCrowdNav, a system that looks at the scene from the robot’s viewpoint and uses visual cues like human poses and the environment to guess where people are likely to move. Their method combines a special encoder for scene layout with an attention module to predict pedestrian intentions, helping the robot make smarter navigation decisions. Tests show their approach works better than previous ones, and it also performs well in real-world settings.
robot crowd navigationdeep reinforcement learningegocentric visual observationspatio-temporal encodinghuman intention inferenceattention mechanismpose estimationstate embeddingnavigation policy
Authors
Han Bao, Bingyi Xia, Hanjing Ye, Yu Zhan, Hao Cheng, Baozhi Jia, Wenjun Xu, Jiankun Wang
Abstract
Robot crowd navigation requires the ability to infer human intentions while accounting for the structural constraints of the environment. Currently, deep reinforcement learning (DRL) provides a promising method for learning navigation policies that understand human intentions. However, most of them rely on limited scene representations, treating pedestrians as simple 2D points and ignoring rich visual cues from both humans and the environment. To address this issue, we introduce iCrowdNav, a novel visual crowd navigation method with intention-aware scene representations, to encode behavioral and structural context from egocentric visual observations. Our method employs two key components: a spatio-temporal encoder for extracting occupancy features of the scene, and Intent-Interact Former (I$^2$ Former), an attention-based module that encodes human poses to infer pedestrians' motion intentions. These features are integrated into a compact state embedding that supports effective DRL policy training. Extensive experiments show that our method achieves superior performance over baselines, and real-world deployment demonstrates vision-based crowd navigation.