Realizing Immersive Volumetric Video: A Multimodal Framework for 6-DoF VR Engagement

2026-04-10 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors introduce Immersive Volumetric Videos (IVV), a new way to create 3D videos that let users move around freely with both visual and sound interactions, like in VR or AR. They developed ImViD, a special dataset with high-resolution, multi-camera videos and sound captured in real environments, to help build these experiences. Using this data, the authors designed a system that reconstructs detailed 3D scenes over time and also recreates sound fields from multiple viewpoints. Their pipeline produces stable, high-quality immersive videos with 6 degrees of freedom, meaning users can look and move naturally within the scene. This work sets a foundation for making real-world immersive volumetric video content practical and accessible.

Immersive Volumetric Video6-DoF InteractionMulti-view CaptureDynamic Light FieldSpatio-temporal RepresentationSound Field ReconstructionVirtual RealityAugmented RealityHigh-resolution VideoTemporal Calibration

Authors

Zhengxian Yang, Shengqi Wang, Shi Pan, Hongshuai Li, Haoxiang Wang, Lin Li, Guanjun Li, Zhengqi Wen, Borong Lin, Jianhua Tao, Tao Yu

Abstract

Fully immersive experiences that tightly integrate 6-DoF visual and auditory interaction are essential for virtual and augmented reality. While such experiences can be achieved through computer-generated content, constructing them directly from real-world captured videos remains largely unexplored. We introduce Immersive Volumetric Videos, a new volumetric media format designed to provide large 6-DoF interaction spaces, audiovisual feedback, and high-resolution, high-frame-rate dynamic content. To support IVV construction, we present ImViD, a multi-view, multi-modal dataset built upon a space-oriented capture philosophy. Our custom capture rig enables synchronized multi-view video-audio acquisition during motion, facilitating efficient capture of complex indoor and outdoor scenes with rich foreground--background interactions and challenging dynamics. The dataset provides 5K-resolution videos at 60 FPS with durations of 1-5 minutes, offering richer spatial, temporal, and multimodal coverage than existing benchmarks. Leveraging this dataset, we develop a dynamic light field reconstruction framework built upon a Gaussian-based spatio-temporal representation, incorporating flow-guided sparse initialization, joint camera temporal calibration, and multi-term spatio-temporal supervision for robust and accurate modeling of complex motion. We further propose, to our knowledge, the first method for sound field reconstruction from such multi-view audiovisual data. Together, these components form a unified pipeline for immersive volumetric video production. Extensive benchmarks and immersive VR experiments demonstrate that our pipeline generates high-quality, temporally stable audiovisual volumetric content with large 6-DoF interaction spaces. This work provides both a foundational definition and a practical construction methodology for immersive volumetric videos.

View PDFOpen arXiv