UniFS: Unified Fast-to-Slow Hierarchical Architecture for Vision-Language-Action Models

2026-06-22Robotics

Robotics
AI summary

The authors study models that combine vision, language, and action, which usually update fast-changing actions separately from slower-changing understanding but face problems balancing update speed and efficiency. They propose UniFS, a new model that updates different parts of the vision and language system at different speeds, allowing quick reactions while keeping stable context. UniFS also reorganizes how information flows between perception and action parts and trains them in steps from coarse plans to fine details. Their experiments show UniFS works better and faster than previous models, including tests on a real robot. The authors also provide their code for others to use.

vision-language modelsdual system modelsfast-slow processingsemantic driftmulti-timescale neural processinglatent vector inversionmulti-level supervisionaction expertinference latencyrobotic manipulation
Authors
Lin Sun, Zhiwei Guan, Conglin Wang, Zihong Chen, Jianhai Yu, Zongsheng Li, Boyong He, Tao Sun, Jiale Cao, Lige Liu
Abstract
Mainstream Fast-Slow dual system vision-language-action models decouple a high-frequency action expert from a low-frequency vision-language model for efficiency, yet they face a fundamental frequency dilemma: large update gaps cause semantic drift from stale context, while small gaps erode the intended computational savings. Moreover, because the action expert receives only the VLM's final-layer representation at a single fixed frequency, rich intermediate features are discarded, limiting both information coupling and manipulation precision. Inspired by multi-timescale neural processing in the human brain, we introduce UniFS, a unified fast-to-slow architecture that resolves these challenges through three key designs. First, we stratify the VLM layers into groups with progressively decreasing update frequencies, enabling shallow layers to capture fast-changing dynamics while deeper layers cache stable semantic context. Second, a latent vector inversion mechanism re-routes the interaction order between multi-scale VLM features and the action expert, aligning fast-varying representations with fine-grained action decoding and slow-varying ones with coarse planning. Third, a multi-level supervision strategy enforces a coarse-to-fine learning hierarchy across temporal scales. Together, these designs enable richer cross-frequency information transfer within a single backbone, while the low-frequency pathways additionally preserve temporal context across steps. Experiments on LIBERO show that UniFS achieves state-of-the-art performance (98.3\% average success rate, a 2.5\% gain over VLA-Adapter baseline) while reducing average inference latency from 36.5~ms to 17.8~ms (2.1$\times$ speedup). Real-robot experiments on a Franka platform further validate its practical applicability. Code is opensourced at https://github.com/linsun449/UniFS.