Flex4DHuman: Flexible Multi-view Video Diffusion for 4D Human Reconstruction

2026-06-11 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionGraphics

AI summaryⓘ

The authors present Flex4DHuman, a model that converts simple videos of moving people into detailed multi-angle videos without needing complex 3D data like skeletons or depth maps. Instead, it uses information about the camera positions to guide the video creation. Their method can help build dynamic 3D representations called 4D Gaussian splats from regular videos, making it useful for simulations and gaming. They also show that their approach works well on animals after some mixed training. Overall, the authors offer a way to make advanced 4D content from everyday video recordings.

multi-view videocamera-pose conditioning4D Gaussian splattingpositional encodingvideo diffusion modelspatio-temporal RoPESE(3) geometrycurriculum trainingmonocular videotext-to-video generation

Authors

Jen-Hao Cheng, Yipeng Wang, Hao Zhang, Gengshan Yang, Jenq-Neng Hwang

Abstract

We present Flex4DHuman, a multi-view video diffusion model that transforms a monocular or sparse multi-view video of a dynamic subject into synchronized dense multi-view videos using only relative camera-pose conditioning. Unlike prior human-centric methods that rely on skeletons, depth maps, normals, or rendered target-view geometry, Flex4DHuman requires no explicit geometry priors and instead conditions generation through relative camera-pose positional encoding. The generated videos can be directly ingested by downstream reconstruction pipelines to create dynamic 4D Gaussian splats. Built on the Wan 2.1 1.3B text-to-video model, Flex4DHuman preserves the backbone architecture and encodes camera and view information through a five-axis positional encoding that extends spatio-temporal RoPE with view indices and continuous SE(3) relative camera geometry. A three-stage curriculum progressively trains the model for pose following, flexible reference-to-target view generation, and temporal rollout. To support temporal rollout, we train with clean historical target-view tokens. We also add multi-view captions to enable test-time text control. Combined with an off-the-shelf 4D Gaussian Splatting stage, our framework lifts monocular static-camera videos into dynamic 4D Gaussian splats. Experiments on DNA-Rendering and ActorsHQ show that Flex4DHuman surpasses prior state-of-the-art methods, while the same formulation generalizes to animal categories after mixed human-animal training. These capabilities make Flex4DHuman a practical step toward scalable 4D content creation from casual monocular videos for simulation, gaming, AR/VR, and video re-shooting.

View PDFOpen arXiv