EXPO-FT: Sample-Efficient Reinforcement Learning Finetuning for Vision-Language-Action Models

2026-05-25 • Robotics

RoboticsArtificial Intelligence

AI summaryⓘ

The authors address the challenge of teaching robots to do new tasks reliably using Vision-Language-Action (VLA) models, which can generalize well but are not always dependable in real-world situations. They propose EXPO-FT, a method that fine-tunes these pretrained models using reinforcement learning more efficiently and effectively than previous methods. Their system successfully completes various precise and dynamic manipulation tasks perfectly after only about 19 minutes of training with real robot data. The authors also provide their code openly to help other researchers use and improve upon their approach.

Vision-Language-Action (VLA) modelsreinforcement learning (RL)fine-tuningrobot manipulationpretrained modelssample efficiencyrobot datageneralizationtask reliabilityrobotics

Authors

Perry Dong, Kuo-Han Hung, Tian Gao, Dorsa Sadigh, Chelsea Finn

Abstract

The ability to efficiently and reliably learn new tasks has been a foundational challenge in robotics. Vision-Language-Action (VLA) models have demonstrated strong generalization across diverse manipulation tasks, yet pretrained policies consistently fall short of the reliability required for real-world deployment. Reinforcement learning (RL) fine-tuning offers a promising path to bridge this gap, but existing approaches either train from scratch without fully leveraging pretrained priors, or fine-tune VLAs without achieving the sample efficiency and success rates that practical deployment demands. We present EXPO-FT, a system for stable, sample-efficient RL finetuning of pretrained VLA policies that closes this gap. Our system solves a suite of challenging manipulation tasks, including routing string lights and inserting the plug to light it up, striking a pool ball into a pocket, and inserting a flower into a wine bottle, each requiring combinations of high precision, dynamic actions, and robustness to varied initial states. Our system achieves perfect task performance (30/30 successes) across all evaluated tasks within an average of 19.1 minutes of online robot data, outperforming both prior RL-from-scratch and VLA finetuning approaches. We release an open-source codebase with the aim of facilitating broader adoption of RL finetuning of VLA models in robotics.

View PDFOpen arXiv