Vision-Language-Action Models: Experimental Insights from a Real-World UR5 Platform

2026-06-29 • Robotics

RoboticsComputer Vision and Pattern Recognition

AI summaryⓘ

The authors studied how well Vision-Language-Action models, which combine seeing, understanding language, and acting, work when moved from controlled tests to a real robot arm. They built tools to collect robot data, prepare it, fine-tune the models, and run experiments on a physical robot. Their tests showed that success in offline tests doesn’t always mean the robot will behave well in real life, due to factors like how actions are represented and data quality. They conclude that making these systems work well depends more on managing the whole process—including data and control—rather than just improving the models.

Vision-Language-Action (VLA) modelsUR5e robot manipulatordata acquisition pipelineRLDS formatfine-tuningclosed-loop behavioraction representationcoordinate framestemporal alignmentimage preprocessing

Authors

Mathilde Hochedel, Marc Lalonde

Abstract

This project investigates whether recent Vision-Language-Action (VLA) models can be transferred from controlled research benchmarks to a real-world robotic platform, specifically a UR5e manipulator, in a reproducible and operationally meaningful manner. The work integrates real-robot data acquisition, dataset engineering (compatible with the RLDS format), and the fine-tuning and deployment of OpenVLA and OpenVLA-OFT models, with systematic validation of action representations and control interfaces. The project resulted in several foundational assets: (i) a complete real-robot data acquisition pipeline, (ii) a dataset conversion workflow aligned with RLDS standards, (iii) an initial fine-tuning and inference infrastructure for VLA models, and (iv) a structured set of experimental observations grounded in real-robot trials. These elements collectively establish a reproducible framework for evaluating learning-based manipulation systems beyond simulation. Empirically, the experiments reveal a consistent gap between promising offline indicators and unstable closed-loop behavior on the physical system: this gap cannot be attributed solely to model limitations, it is strongly influenced by action semantics, coordinate frame conventions, temporal alignment between modalities, image preprocessing consistency, and dataset coverage and quality. These observations lead to a key interpretation: the successful deployment of VLA systems in real-world settings depends less on incremental improvements in model capacity and more on precise control of the entire data-model-control pipeline. The project reframes VLA-based robotics from a primarily model-centric challenge to a system-level problem; it highlights the difficulty of running robust task execution on the real robot and provides a clear, experimentally grounded understanding of the conditions required for reliable deployment.

View PDFOpen arXiv