CT-VAM: A Cerebello-Thalamic-Inspired Vision-Action Model for Efficient Visuomotor Control

2026-06-08 • Robotics

RoboticsArtificial Intelligence

AI summaryⓘ

The authors introduce CT-VAM, a robot control system inspired by brain structures, that efficiently uses vision and task info to guide actions. Instead of processing raw language all the time, their model uses language just to set the task, then quickly executes actions locally using compact visual and sensor data. They develop a special attention method to keep task instructions clear while handling complex visual inputs. Their method works well compared to bigger models, is faster, and runs smoothly on robots with limited computing power.

vision-language-action modelsvisuomotor controlcerebello-thalamic modelconditional attention decoderproprioceptionrobot manipulationinference latencyasynchronous chunk executioncloud-edge computingflow-consistent inpainting

Authors

Jiacheng Li, Yize Guo, Jiabin Guo, Qingchen Liu, Jiahu Qin

Abstract

Vision-language-action models have shown strong promise for robot manipulation, yet raw language is primarily needed to specify task intent rather than to be repeatedly processed during high-frequency low-level execution. Motivated by this separation, we propose a cerebello-thalamic-inspired vision-action model (CT-VAM) for efficient task-conditioned visuomotor control. CT-VAM acts as a compact local execution policy that predicts action chunks from dualview visual observations, proprioception, and a lightweight task condition, potentially enabling a practical cloud-edge paradigm in which high-level semantic reasoning can be handled by large models while fast closed-loop control runs on local hardware. To fuse heterogeneous inputs effectively, CT-VAM introduces TARS (Thalamic Action Routing Stream), a stream-separated conditional attention decoder that independently routes action, visual and task streams, preventing dense sensory tokens from overwhelming compact task-relevant conditions. With only 68M parameters, CT-VAM achieves LIBERO success rates competitive with substantially larger VLA models, while reducing inference latency. Together with flow-consistent inpainting for asynchronous chunk execution, CT-VAM supports high-frequency control and demonstrates robust realworld deployment on resource-constrained robotic platforms.

View PDFOpen arXiv