Physically Grounded 3D Generative Reconstruction under Hand Occlusion using Proprioception and Multi-Contact Touch

2026-04-10Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionRobotics
AI summary

The authors developed a method to better guess the shape and position of objects when a hand is covering most of them. Instead of just using images, they use signals from the hand's position and touch points to understand where the object surface must be, even if hidden. They create a 3D shape model that accounts for the hand's pose and train it with a special neural network using both vision and physical contact info. Their approach results in more accurate and physically possible object shapes at real-world scale, and they tested it successfully both in simulations and on a real robot hand.

amodal object reconstructionpose estimationhand occlusionproprioceptionmulti-contact touchsigned distance field (SDF)Structure-VAEdiffusion modelphysics-based objectivesrobotic manipulation
Authors
Gabriele Mario Caddeo, Pasquale Marra, Lorenzo Natale
Abstract
We propose a multimodal, physically grounded approach for metric-scale amodal object reconstruction and pose estimation under severe hand occlusion. Unlike prior occlusion-aware 3D generation methods that rely only on vision, we leverage physical interaction signals: proprioception provides the posed hand geometry, and multi-contact touch constrains where the object surface must lie, reducing ambiguity in occluded regions. We represent object structure as a pose-aware, camera-aligned signed distance field (SDF) and learn a compact latent space with a Structure-VAE. In this latent space, we train a conditional flow-matching diffusion model, pretraining on vision-only images and finetuning on occluded manipulation scenes while conditioning on visible RGB evidence, occluder/visibility masks, the hand latent representation, and tactile information. Crucially, we incorporate physics-based objectives and differentiable decoder-guidance during finetuning and inference to reduce hand--object interpenetration and to align the reconstructed surface with contact observations. Because our method produces a metric, physically consistent structure estimate, it integrates naturally into existing two-stage reconstruction pipelines, where a downstream module refines geometry and predicts appearance. Experiments in simulation show that adding proprioception and touch substantially improves completion under occlusion and yields physically plausible reconstructions at correct real-world scale compared to vision-only baselines; we further validate transfer by deploying the model on a real humanoid robot with an end-effector different from those used during training.