EgoTactile: Learning Grasp Pressure for Everyday Objects from Egocentric Video

2026-06-08 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence

AI summaryⓘ

The authors created a new dataset called EgoTactile that pairs video from a first-person view with full-hand pressure data to help understand how hands touch objects. They developed two methods: EgoPressureFormer as a basic approach, and EgoPressureDiff, a more advanced model using conditional diffusion and a special layer to better predict where the hand applies pressure, even when the video doesn't show everything clearly. Their work helps improve estimating hand grasp pressure without needing extra hardware and works well on different objects and real-world scenes. Overall, the authors show their method performs better than previous ones on this task.

egocentric videograsp pressure estimationtactile sensingdiffusion modelsvideo diffusion backbonefeature rectificationsemantic constraintsrobotic manipulationimmersive virtual realityvisual-physical ambiguity

Authors

Yuan Zeng, Yujia Shi, Tiao Tan, Xingting Li, Yaqi Qin, Zongqing Lu, Wenming Yang, Jing-Hao Xue, Qingmin Liao

Abstract

Estimating full-hand grasp pressure from egocentric video is critical for immersive VR and robotic manipulation, yet dense tactile sensing often relies on intrusive hardware. Existing vision-based methods predominantly rely on planar surfaces or fingertip contacts, failing to generalize to complex 3D object interactions. Therefore, we introduce EgoTactile, a benchmark pairing egocentric video with full-hand pressure supervision for diverse everyday objects, incorporating a bare-hand transfer subset to enable generalization to natural scenarios. Leveraging this benchmark, we first establish EgoPressureFormer as a discriminative baseline. Beyond this, to explicitly address the uncertainty in partial observations, we propose EgoPressureDiff, a conditional diffusion framework that adapts a large-scale pre-trained video diffusion backbone. By combining rich world knowledge priors with a Physically-Informed Feature Rectification layer to inject semantic constraints, our approach effectively infers plausible contact patterns and resolves visual-physical ambiguities. Extensive experiments demonstrate that our method achieves superior performance on the benchmark and robust transferability to in-the-wild scenarios. Our project page is available at https://egotactile.github.io/.

View PDFOpen arXiv