MORPHOS: Autoregressive 4D Generation with Temporal Structured Latents

2026-06-01Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors introduce MORPHOS, a new method that creates moving 3D models from videos using different formats like meshes and radiance fields. They developed a special way to represent time and shape together, called Temporal Structured Latents (T-SLAT), which helps the model keep things consistent over time and handle changes in shape. MORPHOS generates each frame based on the previous ones, improving the smoothness and accuracy of the animation. Their approach works well across many tests and can create longer videos without losing quality.

autoregressive model3D assetsmeshes3D Gaussiansradiance fieldstemporal consistencytopological changescausal attentiondynamic geometrytemporal augmentation
Authors
Minkyung Kwon, Jinhyeok Choi, Youngjin Shin, Jaeyeong Kim, JongMin Lee, Seungryong Kim
Abstract
We present MORPHOS, a novel autoregressive framework that generates dynamic 3D assets from videos across diverse representations, including meshes, 3D Gaussians, and radiance fields. Existing methods are typically limited to a single representation, struggle to model topological changes, or fail to maintain temporal consistency over long videos. To address these limitations, we introduce the Temporal Structured Latents (T-SLAT), a unified 4D representation that jointly encodes geometry and appearance along the temporal dimension. Leveraging T-SLAT, MORPHOS autoregressively generates dynamic 3D assets via causal attention, conditioning each frame on its preceding history to ensure temporal consistency while handling evolving topologies. We also propose a temporal-structural augmentation to mitigate error accumulation in autoregressive generation. MORPHOS achieves state-of-the-art performance in appearance and competitive results in geometry across multiple benchmarks, demonstrating superior generalization across various representations and robustness in long-horizon generation.