Co-training with Ego-centric Video and Demonstration for Robot Navigation Task
2026-06-01 • Robotics
Robotics
AI summaryⓘ
The authors developed a way to help robots learn to navigate by using videos taken from a person's point of view while walking. They figured out how to turn the movements in these human videos into actions that a mobile robot can understand and imitate. By training their model with both these human-based videos and actual robot data, the robot got better at understanding instructions and moving around. They tested this on a task where the robot had to find fruit, showing that using human videos is a good way to teach navigation without needing lots of robot data.
vision-language-action modelsmobile robot navigationimitation learningegocentric videoscamera motion estimationaction representationrobot training datafruit-search taskrobot locomotiondata augmentation
Authors
Shoya Kuno, Yumo Ouchi, Kanata Suzuki
Abstract
Vision-language-action (VLA) models are promising for diverse robotic tasks, but their performance heavily depends on large-scale high-quality training data, whose collection on real robots is costly and time-consuming. While prior work has explored augmenting manipulation datasets with egocentric human videos, applying such approaches to mobile robot navigation remains challenging due to viewpoint changes during locomotion. In this paper, we propose a framework that converts egocentric walking videos into datasets for mobile robot imitation learning. The proposed method estimates camera motion from human videos and transforms it into action representations compatible with ground mobile robots. By jointly training a VLA model on human-derived and robot-collected datasets, the model achieves improved language understanding and more robust action generation than training with either data source alone. Experiments on a fruit-search navigation task demonstrate that human egocentric videos provide an effective and scalable data source for mobile robot learning.