Training Vision-Language-Action Models with Dense Embodied Chain-of-Thought Supervision

2026-06-29 • Robotics

RoboticsComputer Vision and Pattern Recognition

AI summaryⓘ

The authors address the problem that robots with different shapes and ways of moving find it hard to share learned skills. They notice that high-level thinking like recognizing objects and planning tasks is similar across different robots. To use this, they built ZR-0, a model that learns reasoning steps and actions together, so it can work well with many robot types without extra thinking during use. They trained it on a very large dataset and showed it works well on various robot simulations and a real robot arm. Their work helps machines better transfer knowledge across different robot bodies.

vision-language-action (VLA) modelscross-embodiment transferEmbodied Chain-of-Thought (ECoT)dual-stream architecturediffusion transformerflow matchingProcCorpus-60M datasetrobot manipulationcross-attentionpre-trained vision-language models

Authors

Haoyang Li, Guanlin Li, Youhe Feng, Chen Zhao, Zhuoran Wang, Yang Li, Qizhe Wei, Shifeng Bao, Haitao Shen, Yihan Zhao, Tong Yang, Jing Zhang

Abstract

Cross-embodiment transfer in vision-language-action (VLA) models remains challenging because low-level state and action spaces differ fundamentally across robot platforms. We observe that the high-level cognitive process underlying manipulation, including scene perception, object identification, task planning, and sub-task decomposition, is largely shared across embodiments. Based on this observation, we present ZR-0, a 2.6 billion parameter end-to-end VLA model that uses dense Embodied Chain-of-Thought (ECoT) supervision to align cross-embodiment representations within the vision-language model (VLM). ZR-0 adopts a dual-stream architecture: a pre-trained VLM (System 2) generates structured ECoT reasoning during training, while a Diffusion Transformer-based action expert (System 1) produces continuous action chunks via flow matching. The two components are coupled through cross-attention, with an attention mask that restricts the action expert to input prompt features only, enabling ECoT generation to be entirely skipped at inference without any performance loss. ZR-0 is pre-trained on ProcCorpus-60M, a large-scale dataset comprising approximately 60 million frames (approximately 1,000 hours) from over 400K trajectories, with dense ECoT annotations covering 96.8% of all frames. We evaluate ZR-0 on three simulation benchmarks spanning single-arm (LIBERO), bimanual (RoboTwin 2.0), and humanoid (RoboCasa GR-1 Tabletop) embodiments, as well as real-world experiments on the xArm platform, demonstrating strong performance across all settings. Code and model checkpoints are available at https://github.com/RUCKBReasoning/ZR-0.

View PDFOpen arXiv