Context Unrolling in Omni Models

2026-04-23 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors introduce Omni, a model trained to understand and generate information across multiple types like text, pictures, videos, and 3D shapes all at once. They found that Omni can think about different types of data together before making decisions, which helps it combine information better. This leads to better performance in tasks involving multiple types of media, including creating or understanding text, images, videos, and 3D content. Essentially, the authors show that training on many kinds of data lets Omni reason more accurately across them.

multimodal modelContext Unrollingmultimodal reasoningtext generationimage generationvideo processing3D geometrymodel trainingheterogeneous modalitiesknowledge manifold

Authors

Ceyuan Yang, Zhijie Lin, Yang Zhao, Fei Xiao, Hao He, Qi Zhao, Chaorui Deng, Kunchang Li, Zihan Ding, Yuwei Guo, Fuyun Wang, Fangqi Zhu, Xiaonan Nie, Shenhan Zhu, Shanchuan Lin, Hongsheng Li, Weilin Huang, Guang Shi, Haoqi Fan

Abstract

We present Omni, a unified multimodal model natively trained on diverse modalities, including text, images, videos, 3D geometry, and hidden representations. We find that such training enables Context Unrolling, where the model explicitly reasons across multiple modal representations before producing predictions. This process enables the model to aggregate complementary information across heterogeneous modalities, facilitating a more faithful approximation of the shared multimodal knowledge manifold and improving downstream reasoning fidelity. As a result, Omni achieves strong performance on both multimodal generation and understanding benchmarks, while demonstrating advanced multimodal reasoning capabilities, including in-context generation of text, image, video, and 3D geometry.

View PDFOpen arXiv