Retrieval-Augmented Gaussian Avatars: Improving Expression Generalization
2026-03-09 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionGraphicsMachine Learning
AI summaryⓘ
The authors developed RAF, a method to improve how digital head avatars mimic facial expressions without using preset templates. Their approach uses a large collection of expression examples to mix and match with a person's own recorded expressions during training, helping the avatar handle more varied facial movements. This makes the avatars better at showing expressions they haven't seen before, without needing extra labels or changes to the model itself. Tests show RAF improves the accuracy of facial expressions both when animating the original person and when transferring motions from others.
head avatarsfacial deformationexpression coveragetemplate-free modelnearest-neighbor retrievalexpression distribution shiftidentity-expression decouplingblendshapesNeRSemble benchmark
Authors
Matan Levy, Gavriel Habib, Issar Tzachor, Dvir Samuel, Rami Ben-Ari, Nir Darshan, Or Litany, Dani Lischinski
Abstract
Template-free animatable head avatars can achieve high visual fidelity by learning expression-dependent facial deformation directly from a subject's capture, avoiding parametric face templates and hand-designed blendshape spaces. However, since learned deformation is supervised only by the expressions observed for a single identity, these models suffer from limited expression coverage and often struggle when driven by motions that deviate from the training distribution. We introduce RAF (Retrieval-Augmented Faces), a simple training-time augmentation designed for template-free head avatars that learn deformation from data. RAF constructs a large unlabeled expression bank and, during training, replaces a subset of the subject's expression features with nearest-neighbor expressions retrieved from this bank while still reconstructing the subject's original frames. This exposes the deformation field to a broader range of expression conditions, encouraging stronger identity-expression decoupling and improving robustness to expression distribution shift without requiring paired cross-identity data, additional annotations, or architectural changes. We further analyze how retrieval augmentation increases expression diversity and validate retrieval quality with a user study showing that retrieved neighbors are perceptually closer in expression and pose. Experiments on the NeRSemble benchmark demonstrate that RAF consistently improves expression fidelity over the baseline, in both self-driving and cross-driving scenarios.