OmniRoam: World Wandering via Long-Horizon Panoramic Video Generation

2026-03-31 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors present OmniRoam, a system that generates panoramic videos to model scenes with wide and consistent views over time. Unlike most methods that create videos from limited perspectives, their approach uses panoramic images to cover more of the scene and maintain consistency for longer video sequences. Their method works in two steps: first making a quick video overview based on a path, then enhancing it to a longer, higher-resolution video. They also introduce new panoramic video datasets and show that their system produces better quality and control compared to existing methods. Additionally, they demonstrate that their framework can support real-time video generation and 3D scene reconstruction.

panoramic video generationvideo synthesisscene modelingtrajectory controllong-term consistencyspatial upsamplingreal-time video generation3D reconstructiondataset

Authors

Yuheng Liu, Xin Lin, Xinke Li, Baihan Yang, Chen Wang, Kalyan Sunkavalli, Yannick Hold-Geoffroy, Hao Tan, Kai Zhang, Xiaohui Xie, Zifan Shi, Yiwei Hu

Abstract

Modeling scenes using video generation models has garnered growing research interest in recent years. However, most existing approaches rely on perspective video models that synthesize only limited observations of a scene, leading to issues of completeness and global consistency. We propose OmniRoam, a controllable panoramic video generation framework that exploits the rich per-frame scene coverage and inherent long-term spatial and temporal consistency of panoramic representation, enabling long-horizon scene wandering. Our framework begins with a preview stage, where a trajectory-controlled video generation model creates a quick overview of the scene from a given input image or video. Then, in the refine stage, this video is temporally extended and spatially upsampled to produce long-range, high-resolution videos, thus enabling high-fidelity world wandering. To train our model, we introduce two panoramic video datasets that incorporate both synthetic and real-world captured videos. Experiments show that our framework consistently outperforms state-of-the-art methods in terms of visual quality, controllability, and long-term scene consistency, both qualitatively and quantitatively. We further showcase several extensions of this framework, including real-time video generation and 3D reconstruction. Code is available at https://github.com/yuhengliu02/OmniRoam.

View PDFOpen arXiv