DVG-WM: Disentangled Video Generation Enables Efficient Embodied World Model for Robotic Manipulation

2026-06-30 • Robotics

Robotics

AI summaryⓘ

The authors developed a new method called DVG-WM to help robots predict and visualize what will happen during tasks more quickly and in more detail. They split the process into two parts: understanding how things will move (dynamics) and creating clear video images (visual synthesis). This separation makes the predictions faster and more detailed, especially for tricky parts like contact between objects. Tests showed their approach improves video quality and speeds up planning on both simulated and real robots.

Embodied world modelsVideo generationRobotic manipulationDynamics learningVisual synthesisFlow matchingLatent degradationTemporal reasoningPlanningHigh-fidelity video

Authors

Ziyu Shan, Zhenyu Wu, Xiaofeng Wang, Zheng Zhu, Ziwei Wang

Abstract

Video-based embodied world models provide an appealing substrate for robotic manipulation by predicting future states, yet current approaches remain limited by a fundamental entanglement: accurately modeling dynamics typically requires low-level temporal reasoning, while producing high-resolution frames demands expansive visual synthesis according to high-level semantics. This entanglement results in slow inference speed for iterative planning or too coarse predictions to retain contact-rich details. To solve this dilemma, we present Disentangled Video Generation World Model (DVG-WM), an efficient framework that explicitly decomposes world modeling into dynamics learning and visual synthesis. Conditioned on an initial observation and a language instruction, our model first generates a plausible sequence of intermediate visual states to preview the physical interaction and refines them to obtain high-fidelity videos. Furthermore, an efficient cascading mechanism is proposed, where DVG-WM uses flow matching to directly map the dynamics to video latents, and introduces a latent degradation mechanism to regenerate contact-rich details. Experiments on LIBERO and real-world platforms demonstrate improved video quality with up to 3.97 times acceleration, validating that disentangled video generation can be an efficient embodied world model for robotic manipulation.

View PDFOpen arXiv