PokeVLA: Empowering Pocket-Sized Vision-Language-Action Model with Comprehensive World Knowledge Guidance

2026-04-22 • Robotics

Robotics

AI summaryⓘ

The authors developed PokeVLA, a compact and efficient model that helps robots understand images and language better to improve how they physically interact with objects. They trained it in two steps: first, teaching it about space, object uses, and reasoning from a large dataset; second, helping it connect this knowledge to actions with special training techniques. Their approach outperformed other models on a robot manipulation benchmark and worked well in real-world tests. They will also share their code and data to help others build on their work.

Vision-Language-Action modelsEmbodied manipulationPre-trainingSpatial groundingAffordanceGoal-aware semanticsGeometry alignmentAction learningLIBERO-Plus benchmarkRobotics

Authors

Yupeng Zheng, Xiang Li, Songen Gu, Yuhang Zheng, Shuai Tian, Weize Li, Linbo Wang, Senyu Fei, Pengfei Li, Yinfeng Gao, Zebin Xing, Yilun Chen, Qichao Zhang, Haoran Li, Wenchao Ding

Abstract

Recent advances in Vision-Language-Action (VLA) models have opened new avenues for robot manipulation, yet existing methods exhibit limited efficiency and a lack of high-level knowledge and spatial awareness. To address these challenges, we propose PokeVLA, a lightweight yet powerful foundation model for embodied manipulation that effectively infuses vision-language understanding into action learning. Our framework introduces a two-stage training paradigm: first, we pre-train a compact vision-language model (PokeVLM) on a curated multimodal dataset of 2.4M samples encompassing spatial grounding, affordance, and embodied reasoning tasks; second, we inject manipulation-relevant representations into the action space through multi-view goal-aware semantics learning, geometry alignment, and a novel action expert. Extensive experiments demonstrate state-of-the-art performance on the LIBERO-Plus benchmark and in real-world deployment, outperforming comparable baselines in success rate and robustness under diverse perturbations. To foster reproducibility and community progress, we will open-source our code, model weights, and the scripts for the curated pre-training dataset. Project page: https://getterupper.github.io/PokeVLA

View PDFOpen arXiv