SA-VLA: State-aware tokenizer for improving Vision-Language-Action Models' performance

2026-06-29Robotics

RoboticsArtificial Intelligence
AI summary

The authors introduce SA-VLA, a method that improves robot action control by letting the robot's current state affect how discrete action tokens are turned back into continuous movements. Unlike previous methods that used fixed actions for each token, their approach adjusts actions based on the robot's joint positions and conditions, making it more flexible, especially in tasks like manipulation. They tested SA-VLA on various robot tasks and in real-world trials, showing it performs better at completing tasks than earlier tokenizers. This approach helps make robot action commands more accurate without complicating the overall system.

Discrete action tokenizationAutoregressive policiesProprioceptive stateVector Quantization (VQ)Cross-attentionState-conditioned decodingLarge Language Models (LLM)Sim-to-real transferRobotic manipulationAction discretization
Authors
Tengyue Jiang, Chunpu Xu, Jiayue Kang, Yao Mu
Abstract
Discrete action tokenization provides a compact interface for autoregressive VLA policies, but accurately recovering continuous robot actions from discrete codes remains challenging. Existing tokenizers typically map each discrete code to a fixed continuous action prototype, ignoring the robot's current proprioceptive state. This limitation is particularly pronounced in manipulation, where the same action token may require different continuous controls under different joint configurations, object poses, and contact conditions. We therefore propose SA-VLA, a state-aware action tokenizer that conditions action decoding on robot state. We study two state-injection mechanisms for VQ-based action tokenization: cross-attention between state and action features, and a lightweight state adapter that predicts action-wise modulation factors for state-conditioned action modulation and reconstruction. The adapter formulation expands the effective support of a finite codebook by allowing each discrete token to represent a family of state-dependent continuous actions, while preserving the efficiency and compatibility of discrete action modeling. Integrated into an LLM-based VLA policy, SA-VLA supports both autoregressive and parallel action-token decoding with minimal changes to the model interface. On 12 RoboTwin manipulation tasks, SA-VLA improves the average success rate from 0.29 to 0.56 over the strongest tokenizer baseline. In zero-shot sim-to-real experiments on three real-world tasks, it further improves average success from 0.15 to 0.33 over the strongest tokenizer baseline. These results demonstrate that state-conditioned action decoding is a simple and effective mechanism for reducing the compression gap in discrete VLA policies.