Face Anything: 4D Face Reconstruction from Any Image Sequence
2026-04-21 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors developed a new way to create detailed 3D models of moving human faces from video. They assign each pixel on the face to a fixed 3D coordinate system, making it easier to track changes in expressions and viewpoints over time. Their method uses a single model to predict both the depth and position of facial points, leading to more accurate and stable 3D reconstructions. Tests show their approach is faster and more precise than previous methods. This makes capturing dynamic facial geometry more reliable and efficient.
3D facial reconstructioncanonical spacedepth estimationtransformer modelnon-rigid deformationmulti-view geometrydynamic trackingfeed-forward modelcorrespondence estimationtemporal consistency
Authors
Umut Kocasari, Simon Giebenhain, Richard Shaw, Matthias Nießner
Abstract
Accurate reconstruction and tracking of dynamic human faces from image sequences is challenging because non-rigid deformations, expression changes, and viewpoint variations occur simultaneously, creating significant ambiguity in geometry and correspondence estimation. We present a unified method for high-fidelity 4D facial reconstruction based on canonical facial point prediction, a representation that assigns each pixel a normalized facial coordinate in a shared canonical space. This formulation transforms dense tracking and dynamic reconstruction into a canonical reconstruction problem, enabling temporally consistent geometry and reliable correspondences within a single feed-forward model. By jointly predicting depth and canonical coordinates, our method enables accurate depth estimation, temporally stable reconstruction, dense 3D geometry, and robust facial point tracking within a single architecture. We implement this formulation using a transformer-based model that jointly predicts depth and canonical facial coordinates, trained using multi-view geometry data that non-rigidly warps into the canonical space. Extensive experiments on image and video benchmarks demonstrate state-of-the-art performance across reconstruction and tracking tasks, achieving approximately 3$\times$ lower correspondence error and faster inference than prior dynamic reconstruction methods, while improving depth accuracy by 16%. These results highlight canonical facial point prediction as an effective foundation for unified feed-forward 4D facial reconstruction.