AHOY! Animatable Humans under Occlusion from YouTube Videos with Gaussian Splatting and Video Diffusion Priors

2026-03-18 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors developed AHOY, a method to create complete 3D models of people from regular single-camera videos, even when parts of the person are hidden. They solve problems like missing body parts by using special AI tools to fill in the gaps and carefully handle how the model changes with different body poses. Their approach also keeps the face looking clear and true to the original person. They tested this on real YouTube videos and other data with lots of hiding and showed it works well. The final 3D avatars can be moved around in new poses and used in other 3D scenes.

3D reconstructionGaussian avatarsmonocular videoocclusion handlingdiffusion modelspose estimationlinear blend skinningidentity preservationanimatable avatarsmulti-view inconsistency

Authors

Aymen Mir, Riza Alp Guler, Xiangjun Tang, Peter Wonka, Gerard Pons-Moll

Abstract

We present AHOY, a method for reconstructing complete, animatable 3D Gaussian avatars from in-the-wild monocular video despite heavy occlusion. Existing methods assume unoccluded input-a fully visible subject, often in a canonical pose-excluding the vast majority of real-world footage where people are routinely occluded by furniture, objects, or other people. Reconstructing from such footage poses fundamental challenges: large body regions may never be observed, and multi-view supervision per pose is unavailable. We address these challenges with four contributions: (i) a hallucination-as-supervision pipeline that uses identity-finetuned diffusion models to generate dense supervision for previously unobserved body regions; (ii) a two-stage canonical-to-pose-dependent architecture that bootstraps from sparse observations to full pose-dependent Gaussian maps; (iii) a map-pose/LBS-pose decoupling that absorbs multi-view inconsistencies from the generated data; (iv) a head/body split supervision strategy that preserves facial identity. We evaluate on YouTube videos and on multi-view capture data with significant occlusion and demonstrate state-of-the-art reconstruction quality. We also demonstrate that the resulting avatars are robust enough to be animated with novel poses and composited into 3DGS scenes captured using cell-phone video. Our project page is available at https://miraymen.github.io/ahoy/

View PDFOpen arXiv