InteractiveAvatar: Real-Time Streaming Video Generation for Consistent and Intent-Aware Avatars

2026-06-22Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors developed InteractiveAvatar, a system that creates live, continuous videos of avatars that look consistent over time and respond to user intentions. To keep the avatar’s appearance stable, they use a Long-Short Visual Memory method that remembers both recent and older visual details. To make the avatar act and speak according to what the user wants, they designed a Reasoning-Reaction Module that manages the avatar's state and responses. Their tests show it works well for long video streams and interactive situations.

diffusion-based modelsreal-time streamingvisual temporal consistencyautoregressive distillationLong-Short Visual Memory (LSVM)user intent recognitionReasoning-Reaction Module (RRM)state-cyclingcache-switchinginteractive avatar generation
Authors
Quanyue Song, Yishan He, Yanfei Zhang, Shihao Cheng, Zhixiang He, Zhizhi Guo, Chi Zhang, Xuelong Li, Caigui Jiang
Abstract
Recent diffusion-based models have enabled realistic audio-driven avatar generation in real-time streaming. However, existing approaches struggle to maintain visual temporal consistency and fail to explicitly perceive user intent in complex interactive streaming scenarios. To address these challenges, we propose InteractiveAvatar, a real-time infinite-streaming video generation framework that supports visually consistent avatar video generation and intent-aware interactions. With autoregressive distillation, InteractiveAvatar achieves real-time str-eaming generation of human avatars over arbitrarily long durations. For visual consistency, we introduce a Long-Short Visual Memory (LSVM) mechanism that flexibly compresses historical visual information into compact tokens, preserving both short-range coherence and long-term consistency. To generate avatars with speeches and actions aligned with user intent, we propose a Reasoning-Reaction Module (RRM), which incorporates a State-Cycling strategy and a Cache-Switching mechanism. Extensive experimental results over diverse scenarios demonstrate that our method achieves state-of-the-art visual consistency in long-duration generation, while enabling complex user-avatar interaction in real time.