VT-WAM: Visual-Tactile World Action Model for Contact-Rich Manipulation

2026-07-02 • Robotics

Robotics

AI summaryⓘ

The authors developed VT-WAM, a system that helps robots better handle tasks involving touch, like feeling pressure and slip, which are hard to see in normal visuals. Their method combines visual and touch information to predict both what will happen next and what actions to take. They also introduced special attention techniques that help the robot focus on touch signals when it’s actually in contact with objects. Their system performed better than previous methods in real-world tests involving six different touch-based tasks. They showed that paying attention to how touch sensations change over time is important for success.

Contact-rich manipulationTactile deformationVisual-tactile policyFlow matchingTransformersAction predictionAttention mechanismSlip detectionRobotic manipulation

Authors

Shuai Tian, Yupeng Zheng, Yuhang Zheng, Songen Gu, Yujie Zang, Yuxing Qin, Weize Li, Haoran Li, Wenchao Ding, Dongbin Zhao

Abstract

Contact-rich manipulation requires policies to react to local deformation, pressure, slip, and friction, yet these cues are temporally sparse and often invisible in visual observations. Existing visual-tactile policies usually feed tactile observations directly into action prediction, but rarely model tactile deformation dynamics during action generation. In this paper, we introduce VT-WAM, a Visual-Tactile World Action Model that jointly learns future visual prediction, tactile deformation prediction, and action prediction within a unified flow matching framework. In particular, VT-WAM introduces (1) Asymmetric Mixture-of-Transformers (MoT) attention to bridge a first-frame visual anchor with temporal tactile dynamics, and (2) contact-gated Action-Visual-Tactile Attention Guidance (AVTAG) to encourage action queries to rely on tactile evidence during contact phases. Across six real-world contact-rich manipulation tasks, VT-WAM achieves a 71.67% average success rate, outperforming Fast-WAM by 26.67% and OmniVTLA by 35.84%. Ablations demonstrate that modeling tactile deformation dynamics and guiding contact-phase tactile attention are both important for contact-rich tasks. Project website: https://vt-wam.github.io/.

View PDFOpen arXiv