From Tokens to Faces: Investigating Discrete Speech Representations for 3D Facial Animation

2026-06-11Computation and Language

Computation and Language
AI summary

The authors studied different ways to turn speech into 3D facial animations, focusing on how the sound is represented. They compared four types of speech features to see which best predicts realistic facial movements. Their results showed that features capturing phonetic information help create better facial animations. Based on these findings, they created a new method that can use discrete speech representations to generate both speech audio and matching 3D facial motion.

speech representation3D facial animationself-supervised learning (SSL)neural codecsautomatic speech recognition (ASR)phonetic unitsfacial reconstructiondiscrete representationsAudio Visual Text-to-Speech (AVTTS)
Authors
Pedro Correa, Olivier Perrotin, Samir Sadok, Paula Costa, Thomas Hueber
Abstract
The choice of speech representation is critical in speech-driven 3D facial animation. Representations differ in what they encode: SSL features emphasize segmental and semantic cues, neural codecs yield latents optimized for acoustic reconstruction, and ASR-style objectives produce label-based spaces. We evaluate four speech representation families for 3D facial synthesis, comparing their facial reconstruction quality across two facial decoders using objective metrics and a perceptual evaluation. We additionally conduct probing analyses that relate tokenized representations to phonetic units and to articulatory deformations. We found that encoding phonetic classes is beneficial for accurate facial animation prediction on both semantic and label-based representations with comparable facial animation quality. From the latter, we introduce an Audio Visual Text-to-Speech (AVTTS) pipeline that leverages, as a shared space, discrete representations to decode speech and 3D facial motion.