ReCoVLA: VLM-Guided Reward Compilation for Failure Recovery in Vision-Language-Action Policies

2026-06-08Robotics

RoboticsArtificial IntelligenceMachine Learning
AI summary

The authors developed ReCoVLA, a system that helps a robot recover when a vision-language-action policy fails during tasks like manipulating objects. Instead of changing the original policy, they use an external model to understand what went wrong and guide a recovery step using specially designed rewards. This approach separates understanding the problem from fixing it, making recovery easier to train and apply in the real world. Their experiments show ReCoVLA improves success rates both in simulations and real physical tests.

Vision-language-action policyfailure recoveryvision-language modelreward shapingsim-to-real transferrobotic manipulationzero-shot learningresidual policy training
Authors
Haodi Hu, Chung-Ta Huang, Jing Liu, Ye Wang, Kei Suzuki, Matthew Brand, Toshiaki Koike-Akino
Abstract
Vision-language-action (VLA) policies provide strong priors for language-conditioned manipulation, but remain brittle in off-nominal states requiring targeted recovery. We propose ReCoVLA -- a failure-conditioned residual recovery framework that keeps a pretrained VLA policy frozen, uses an external vision-language model (VLM) to infer the failure mode and recovery stage, and compiles a structured reward from task-relevant components. Rather than using the VLM to generate actions or rewards directly, ReCoVLA uses it as a semantic reward selector: it predicts a recovery descriptor and reward mask for in-simulation residual-policy training, followed by zero-shot sim-to-real deployment of the trained recovery policies. This decouples high-level failure understanding from low-level corrective control to support different VLAs. Experiments across short-horizon, long-horizon, and contact-rich manipulation tasks show that ReCoVLA outperforms the tested baselines on average. In simulation, our reward compiler improves average success from 36.7% for the fine-tuned $π_{0.5}$ baseline to 66.7%. In physical zero-shot sim-to-real experiments, ReCoVLA achieves the best average performance, with 61.7% success.