ROVE: Unlocking Human Interventions for Humanoid Manipulation via Reinforcement Learning

2026-06-15 • Robotics

RoboticsMachine Learning

AI summaryⓘ

The authors introduce ROVE, a method to improve how humanoid robots learn from human guidance during tasks involving vision, language, and movement. Since human corrections can be imperfect or inefficient, their approach uses smart value estimation to focus on the best parts of these corrections instead of copying everything blindly. They also use videos of humans to help the robot understand complex failure and recovery actions better. Tests show that ROVE helps robots perform better on tricky real-world tasks requiring delicate manipulation.

Vision-Language-Action (VLA)humanoid robotsreinforcement learninghuman-in-the-loopOptimistic Value Estimation (OVE)robot manipulationtrajectory datacritic networkimperfect supervisionfine-grained tasks

Authors

Wei Xiao, Weiliang Tang, Yuying Ge, Hui Zhou, Yao Mu, Li Zhang, Yixiao Ge

Abstract

Human interventions provide crucial corrective signals for post-training Vision-Language-Action (VLA) models. However, enabling seamless humanoid interventions is a formidable systems challenge due to complex whole-body kinematics and dexterous-hand control. Consequently, the collected intervention trajectories are often suboptimal, and methods that rely on human interventions as expert supervision can absorb hesitant, inefficient, or even erroneous behaviors. To address both the system and algorithmic challenges, we propose ROVE, a reinforcement learning framework for humanoid VLA post-training with imperfect human interventions. First, ROVE introduces a human-in-the-loop pipeline capable of collecting deployment and intervention data for humanoid manipulation. Second, it utilizes Optimistic Value Estimation (OVE) to prioritize high-value behaviors from mixed-quality trajectories. To further robustify value estimation, we incorporate cross-embodiment human experience videos to provide rich supervision for long-tailed failure and recovery modes. The resulting critic yields informative advantage signals, steering the VLA actor to focus on high-value behaviors rather than indiscriminately imitating all actions. On challenging real-world contact-rich and fine-grained humanoid manipulation tasks, ROVE outperforms experience-learning baselines and consistently improves across multiple rollout-intervention iterations.

View PDFOpen arXiv