Real-Time Generation of Streamable Talking Portrait Video with Reference-Guided Deep Compression VAEs

2026-06-01Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors developed a new way to create talking portrait videos that can be streamed in real-time using speech and a few images. They designed a special video compression tool (causal video VAE) that focuses on changing parts of the video, making it more efficient to store and recreate. Their method uses a step-by-step generator to produce video frames quickly and with good quality. Tests show their system works as well as or better than bigger, slower models in how realistic and lively the videos look.

video diffusion modelstalking portrait generationcausal video VAElatent compressionautoregressive modelspatial-temporal causalityRectified Flow Transformerblockwise generationreal-time video synthesis
Authors
Sicheng Xu, Yu Deng, Shoukang Hu, Yichuan Wang, Yizhong Zhang, Zhan Chen, Jiaolong Yang, Baining Guo
Abstract
Video diffusion models have significantly advanced portrait video generation, yet their high computational demands limit their use in interactive applications. This work presents a framework for streamable talking portrait video generation conditioned on speech audio and reference images. Designed meticulously for streaming scenarios, it features a causal video VAE for deep latent compression and an autoregressive latent denoising model. Our causal VAE integrates a variable number of reference images as guidance, allowing the network to focus on dynamic information rather than static appearance, thereby enhancing compression efficacy and reconstruction quality. Additionally, we extend the residual auto-encoding paradigm to improve spatial-temporal causality handling in our VAE. The generator is based on a Rectified Flow Transformer architecture and produces video latents in a blockwise auto-regressive manner. Our method enables the real-time generation of high-quality talking portrait videos, achieving speeds significantly faster than baseline models. Furthermore, comprehensive experiments demonstrate that it is on par with or even outperforms these large models in realism, vividness, and video quality.