CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives

2026-05-12Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors focus on improving how AI generates videos with multiple shots and changing scenes, not just continuous single shots. They created CausalCine, which lets the AI direct videos step-by-step, adjusting scenes on the fly without redoing earlier parts. To do this, they trained their model on real multi-shot videos and introduced a smart memory system (CAMR) to keep important context while staying efficient. Their experiments show that CausalCine works better than previous methods and can generate videos interactively in real time.

autoregressive video generationmulti-shot videocausal modelContent-Aware Memory Routingonline directingreal-time generationshot transitionsattention mechanisminteractive generationkey-value memory
Authors
Yihao Meng, Zichen Liu, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Yue Yu, Hanlin Wang, Haobo Li, Jiapeng Zhu, Yanhong Zeng, Xing Zhu, Yujun Shen, Qifeng Chen, Huamin Qu
Abstract
Autoregressive video generation aims at real-time, open-ended synthesis. Yet, cinematic storytelling is not merely the endless extension of a single scene; it requires progressing through evolving events, viewpoint shifts, and discrete shot boundaries. Existing autoregressive models often struggle in this setting. Trained primarily for short-horizon continuation, they treat long sequences as extended single shots, inevitably suffering from motion stagnation and semantic drift during long rollouts. To bridge this gap, we introduce CausalCine, an interactive autoregressive framework that transforms multi-shot video generation into an online directing process. CausalCine generates causally across shot changes, accepts dynamic prompts on the fly, and reuses context without regenerating previous shots. To achieve this, we first train a causal base model on native multi-shot sequences to learn complex shot transitions prior to acceleration. We then propose Content-Aware Memory Routing (CAMR), which dynamically retrieves historical KV entries according to attention-based relevance scores rather than temporal proximity, preserving cross-shot coherence under bounded active memory. Finally, we distill the causal base model into a few-step generator for real-time interactive generation. Extensive experiments demonstrate that CausalCine significantly outperforms autoregressive baselines and approaches the capability of bidirectional models while unlocking the streaming interactivity of causal generation. Demo available at https://yihao-meng.github.io/CausalCine/