RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO

2026-05-14 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors present RAVEN, a new training method for video generation models that helps the model better predict future video frames by closely matching the conditions it sees during real use. They do this by mixing clean past frames with noisy intermediate steps during training, improving how the model learns from past information. They also introduce CM-GRPO, a reinforcement learning technique that optimizes the video generation process more efficiently than past methods. Their experiments show that RAVEN and CM-GRPO together improve video quality and consistency compared to previous approaches.

autoregressive modelsvideo diffusioncausal generationtraining-inference mismatchdenoisingreinforcement learninglatent representationsEuler-Maruyama methodconsistency modelspolicy optimization

Authors

Yanzuo Lu, Ronglai Zuo, Jiankang Deng

Abstract

Causal autoregressive video diffusion models support real-time streaming generation by extrapolating future chunks from previously generated content. Distilling such generators from high-fidelity bidirectional teachers yields competitive few-step models, yet a persistent gap between the history distributions encountered during training and those arising at inference constrains generation quality over long horizons. We introduce the Real-time Autoregressive Video Extrapolation Network (RAVEN), a training-time test framework that repacks each self rollout into an interleaved sequence of clean historical endpoints and noisy denoising states. This formulation aligns training attention with inference-time extrapolation and allows downstream chunk losses to supervise the history representations on which future predictions depend. We further propose Consistency-model Group Relative Policy Optimization (CM-GRPO), which reformulates a consistency sampling step as a conditional Gaussian transition and applies online Reinforcement Learning (RL) directly to this kernel, avoiding the Euler-Maruyama auxiliary process adopted in prior flow-model RL formulations. Experiments demonstrate that RAVEN surpasses recent causal video distillation baselines across quality, semantic, and dynamic degree evaluations, and that CM-GRPO provides further gains when combined with RAVEN.

View PDFOpen arXiv