Scaling by Diversified Experience for Vision-Language-Action Models

2026-06-08Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors address problems in models that combine vision, language, and actions, particularly how these models struggle when mixing high-level thinking with low-level controls and become unstable during training. They introduce SyVLA, a new model trained with varied experiences to be more reliable. Their approach separates control-related details from general reasoning and uses a reinforcement learning method that focuses on similar examples to keep learning steady. Tests show SyVLA works better on real robot tasks and handles new situations more effectively than previous models.

Vision-Language-Action modelspolicy optimizationreinforcement learningIntention Decouplingdistribution shiftmulti-modal benchmarksrobotic tasksout-of-distribution generalization
Authors
Leiyu Wang, Zhaofengnian Wang, Xueqi Li, Luoyi Fan, Cewu Lu, Nanyang Ye
Abstract
Vision-Language-Action models face significant challenges in real-world deployment due to the entanglement of high-level reasoning with low-level control, and the instability of policy optimization. In this paper, we introduce SyVLA, a robust VLA model trained with diversified experiences. We propose an Intention Decoupling algorithm to isolate control-relevant features from reasoning contexts and a similar-sample guided RL pipeline to stabilize policy updates and mitigate distribution shift. Extensive experiments on real-world robotic tasks and multi-modal benchmarks demonstrate that SyVLA achieves superior task success rates and stronger out-of-distribution generalization compared to existing methods, while effectively preserving core vision-language capabilities. Codes and Datasets is released on \href{https://sy-vla.github.io/}{project page}.