SDTalk: Structured Facial Priors and Dual-Branch Motion Fields for Generalizable Gaussian Talking Head Synthesis

2026-05-11Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence
AI summary

The authors propose SDTalk, a new method to create realistic, moving 3D talking head videos from just one image. Unlike previous methods that need to be trained for each person, SDTalk works on new faces without extra training. It uses a two-step process: first, it builds a full 3D head with special facial features, even for hidden parts; then, it models both broad and subtle facial movements to make lip-sync and details better. Tests show that SDTalk produces clearer videos faster than older approaches.

talking head synthesis3D Gaussian Splattingone-shot learningfacial priorsmotion fieldlip synchronizationreal-time renderingcomputer vision
Authors
Peng Jia, Zhen Xiao, Jia Li, Xueliang Liu, Zhenzhen Hu, Lingyun Yu
Abstract
High-quality, real-time talking head synthesis remains a fundamental challenge in computer vision. Existing reconstruction- and rendering-based methods typically rely on identity-specific models, limiting cross-identity generalization. To address this issue, we propose SDTalk, a one-shot 3D Gaussian Splatting (3DGS)-based framework that generalizes to unseen identities without personalized training or fine-tuning. Our framework comprises two modules with a two-stage training strategy. In the first stage, we incorporate structured facial priors into the reconstruction module and separately predict 3DGS parameters for visible and occluded regions, enabling complete head reconstruction from a single image. In the second stage, we introduce a dual-branch motion field to model coarse and fine facial dynamics, improving detail fidelity and lip synchronization. Experiments demonstrate that SDTalk surpasses existing methods in both visual quality and inference efficiency.