LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models

2026-05-11Artificial Intelligence

Artificial IntelligenceComputer Vision and Pattern RecognitionRobotics
AI summary

The authors found that current vision-language-action models use very deep processing that might be too complex for simple robotic movements requiring precise adjustments. They created LoopVLA, a model that repeatedly improves its understanding step-by-step and decides on its own if it needs more refining before choosing an action. LoopVLA learns when it has enough information without being directly told, by comparing its confidence to how well its actions actually worked. Tests show LoopVLA saves computation and runs faster while maintaining or improving task success.

Vision-Language-Action (VLA) modelsrobotic manipulationTransformerrepresentation refinementself-supervised learningearly-exit strategiespolicy optimizationconfidence estimationmultimodal tokensinference efficiency
Authors
Boyang Shen, Kaixiang Yang, Hao Wang, Qiuyu Yu, Qiang Xie, Qiang Li, Zhiwei Wang
Abstract
Current Vision-Language-Action (VLA) models typically treat the deepest representation of a vision-language backbone as universally optimal for action prediction. However, robotic manipulation is composed of many frequent closed-loop spatial adjustments, for which excessive abstraction may waste computation and weaken low-level geometric cues essential for precise control. Existing early-exit strategies attempt to reduce computation by stopping at predefined layers or applying heuristic rules such as action consistency, but they do not directly answer when a representation is actually sufficient for action. In this paper, we present LoopVLA, a recurrent VLA architecture that jointly learns representation refinement, action prediction, and sufficiency estimation. LoopVLA iteratively applies a shared Transformer block to refine multimodal tokens, and at each iteration produces both a candidate action and a sufficiency score that estimates whether further refinement is necessary. By sharing parameters across iterations, LoopVLA decouples refinement from absolute layer indices and grounds sufficiency estimation in the evolving representation itself. Since sufficiency has no direct supervision, we introduce a self-supervised distribution alignment objective, where intermediate confidence scores are trained to match the relative action quality across refinement steps, thereby linking sufficiency learning to policy optimization signals. Experiments on LIBERO, LIBERO-Plus, and VLA-Arena show that LoopVLA pushes the efficiency-performance frontier of VLA policies, reducing parameters by 45% and improving inference throughput by up to 1.7 times while matching or outperforming strong baselines in task success.