VTAM: Video-Tactile-Action Models for Complex Physical Interaction Beyond VLAs

2026-03-24Robotics

RoboticsArtificial IntelligenceComputer Vision and Pattern RecognitionMachine Learning
AI summary

The authors studied how robots can better understand and perform tasks that require touch, not just vision. They created a new model called VTAM that combines video and touch data to improve how robots predict and control their actions during complex, hands-on tasks. By training their model to balance information from sight and touch, they showed it works much better in tricky tasks like picking up delicate objects. This shows that adding touch helps robots act more accurately in the real world.

Video-Action ModelsTactile PerceptionMultimodal LearningEmbodied IntelligenceCross-modal FusionForce ModulationWorld ModelingTransformer ModelsManipulation TasksContact-rich Scenarios
Authors
Haoran Yuan, Weigang Yi, Zhenyu Zhang, Wendi Chen, Yuchen Mo, Jiashi Yin, Xinzhuo Li, Xiangyu Zeng, Chuan Wen, Cewu Lu, Katherine Driggs-Campbell, Ismini Lourentzou
Abstract
Video-Action Models (VAMs) have emerged as a promising framework for embodied intelligence, learning implicit world dynamics from raw video streams to produce temporally consistent action predictions. Although such models demonstrate strong performance on long-horizon tasks through visual reasoning, they remain limited in contact-rich scenarios where critical interaction states are only partially observable from vision alone. In particular, fine-grained force modulation and contact transitions are not reliably encoded in visual tokens, leading to unstable or imprecise behaviors. To bridge this gap, we introduce the Video-Tactile Action Model (VTAM), a multimodal world modeling framework that incorporates tactile perception as a complementary grounding signal. VTAM augments a pretrained video transformer with tactile streams via a lightweight modality transfer finetuning, enabling efficient cross-modal representation learning without tactile-language paired data or independent tactile pretraining. To stabilize multimodal fusion, we introduce a tactile regularization loss that enforces balanced cross-modal attention, preventing visual latent dominance in the action model. VTAM demonstrates superior performance in contact-rich manipulation, maintaining a robust success rate of 90 percent on average. In challenging scenarios such as potato chip pick-and-place requiring high-fidelity force awareness, VTAM outperforms the pi 0.5 baseline by 80 percent. Our findings demonstrate that integrating tactile feedback is essential for correcting visual estimation errors in world action models, providing a scalable approach to physically grounded embodied foundation models.