VideoLatent: Video-Language Learning via Latent Self-Forcing

2026-06-22Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence
AI summary

The authors present VideoLatent, a new multimodal large language model designed to understand and reason about videos more efficiently. Unlike previous models that need lots of detailed step-by-step reasoning annotations and extra supervision, VideoLatent learns directly from simple video-question-answer examples using a special training method called latent self-forcing. Their tests show that VideoLatent performs better than existing models on many video tasks while using much less computational power. The authors also show that their approach works well with different model architectures and sizes.

Chain-of-Thought (CoT) reasoningMultimodal Large Language Models (MLLMs)Visual Latent ReasoningLatent Injection ModuleLatent Self-Forcing TrainingVideo UnderstandingVideo ReasoningVideo Question-AnsweringComputational EfficiencyModel Generalizability
Authors
Zi-Yuan Hu, Zicong Tang, Shijia Huang, Yanyang Li, Michael R. Lyu, Liwei Wang
Abstract
Recent advancements in chain-of-thought (CoT) reasoning have shown promise in enhancing video understanding and reasoning capabilities of multimodal large language models (MLLMs). However, existing CoT-based MLLMs require labor-intensive CoT annotations and incur substantial training and inference overhead. While visual latent reasoning has emerged as a more efficient alternative, existing methods primarily focus on image tasks and heavily rely on additional supervision signals for visual latent generation (e.g., CoT traces, auxiliary images, or fine-grained annotations), limiting their scalability and transferability to video tasks. To bridge this gap, we introduce VideoLatent, a novel MLLM equipped with a latent injection module tailored for video understanding and reasoning. Specifically, VideoLatent learns to perform visual latent reasoning using a new latent self-forcing training paradigm, which comprises latent alignment and latent diversity objectives, and relies solely on standard video-question-answer triplets. Extensive experiments across 14 benchmarks demonstrate that our model consistently outperforms existing standard and latent MLLMs on general video understanding and complex video reasoning. Compared with Video-R1, our VideoLatent achieves superior computational efficiency, reducing training/inference overhead by $\sim$6$\times$/$\sim$68$\times$. Moreover, experiments demonstrate that our method has strong generalizability to different MLLM backbones and different model scales.