Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models

2026-05-29Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence
AI summary

The authors present Lumos-Nexus, a new method for creating videos guided by instructions that improves both reasoning and video quality without requiring heavy computation during training. They train a small generator to understand instructions first and then, during video creation, gradually shift to a larger, high-quality generator for better visuals. They also introduce a new benchmark, VR-Bench, to test how well models can turn intended meanings into matching video content. Their experiments show that Lumos-Nexus makes videos that look better and are more consistent over time while still following instructions well.

video generationunified modelssemantic controllatent spaceprogressive refinementvideo synthesis benchmarkstemporal coherencevisual fidelityreasoning-guided generationpretrained generator
Authors
Jiazheng Xing, Hangjie Yuan, Lingling Cai, Xinyu Liu, Yujie Wei, Fei Du, Hai Ci, Tao Feng, Jiasheng Tang, Weihua Chen, Fan Wang, Yong Liu
Abstract
Connector-based video unified models have demonstrated strong capability in instruction-grounded video synthesis, but integrating a large high-fidelity generator into the unified training loop is computationally prohibitive, limiting achievable visual quality. We therefore propose Lumos-Nexus, a training-efficient unified video generation framework that facilitates the development of strong reasoning-driven generation capabilities while significantly enhancing visual fidelity. Lumos-Nexus adopts a two-stage design: 1) During training, only a lightweight generator is aligned with the understanding block to learn to take in reasoning-driven semantic control. 2) During inference, we introduce Unified Progressive Frequency Bridging (UPFB) to progressively hand off generation to a high-capacity pretrained generator in the shared latent space, enabling coarse-to-fine refinement and producing high-fidelity videos without compromising reasoning quality. To fill the gap in reasoning-driven video generation benchmarks, we introduce VR-Bench, which assesses a model's capability to translate inferred intent into coherent and semantically aligned video content. Extensive experiments demonstrate that Lumos-Nexus achieves substantial gains in visual realism and temporal coherence on VBench, while exhibiting strong reasoning-based generative performance on VR-Bench. Code and models are available at https://jiazheng-xing.github.io/nexus-lumos-home/.