Rethinking VLM Representation for VLA Initialization

2026-05-25Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors studied how to best use pretrained vision-language models (VLMs) to help robots understand and act in the world, a process called Vision-Language-Action (VLA). They found that keeping the original VLM knowledge is important for good robot performance. Training the models with robot-specific data and using a careful update method called LoRA helps improve results, while full model retraining can harm performance. Overall, adapting VLMs for robots works best when adding robot experience without changing the original model too much.

Vision-Language Models (VLMs)Vision-Language-Action (VLA)embodied VQAparameter update strategyLoRArobot-data pretrainingfinite tuningrobot trajectoryrepresentation learningpolicy initialization
Authors
Weifeng Lin, Siyuan Huang, Hao Li, Tingwei Chen, Ruichuan An, Xinyu Wei, Jianbo Liu, Hongsheng Li
Abstract
Vision-Language-Action (VLA) models widely adopt pretrained Vision-Language Models (VLMs) as policy backbones, yet it remains unclear what kind of pretrained VLM representation is useful as a VLA initialization. In this paper, we study VLA initialization as a controlled representation-design problem along three axes: capability-level embodied VQA supervision, parameter-update strategy, and robot-data pretraining. Our experiments show that the original pretrained VLM representation is a key source of action performance. However, embodied VQA adaptation does not yield uniform gains: its benefit depends on downstream bottlenecks, and gains from different capability domains are not simply additive. For update strategy, LoRA provides a more reliable initialization than Full Finetune, indicating that overly reshaping the pretrained representation can weaken VLA initialization. Robot-data pretraining further improves VLA initialization, with the strongest variant obtained by staged LoRA-based training. Together, these findings suggest that effective VLM-to-VLA adaptation should inject action-relevant embodied and robot-trajectory signals while preserving the pretrained VLM representation that remains useful for action learning.