FacePlex: Full-Duplex Joint Speech-Facial Motion Generation for Conversational Avatars
2026-06-29 • Artificial Intelligence
Artificial IntelligenceComputer Vision and Pattern RecognitionMachine Learning
AI summaryⓘ
The authors address the challenge of creating real-time systems that generate both speech and matching facial movements together, instead of separately. They introduce FacePlex, a new method that produces speech sounds and facial motions at the same time during live conversations. Their system uses two key techniques to keep the face and speech in sync while streaming. Experiments show FacePlex creates more natural lip syncing and facial expressions compared to previous methods that only animate faces after audio is generated.
full-duplex speech generationfacial motion synthesisreal-time streamingflow matchingcross-attentionlip-syncaudio-driven animationtoken generationmotion fidelityonline joint generation
Authors
Habin Lim, Jae-Ho Lee, Hah Min Lew, Ji-Su Kang, Gyeong-Moon Park
Abstract
Natural face-to-face conversation requires real-time speech generation together with synchronized facial motion. Existing systems only partially address this problem: speech-only full-duplex models can generate speech in real time but do not produce facial motion, while audio-driven facial motion models animate a face from already available audio rather than jointly generating speech and motion online. To bridge this gap, we first formalize full-duplex joint speech-facial motion generation, where speech tokens and facial motion tokens are produced together every step. Building on this formulation, we propose FacePlex, a unified streaming framework with two key components. First, Rolling Flow Matching adapts flow matching to online motion generation by committing new motion frames at each streaming step. Second, Rolling Cross-Attention couples the streaming audio queue with the motion queue, allowing speech and facial motion to condition each other as generation progresses. Through extensive experiments, ablation studies, and a user study, we show that FacePlex enables full-duplex joint speech-facial motion generation under online streaming constraints, while achieving stronger lip-sync quality and motion fidelity than audio-driven facial motion baselines.