IOI: Decoupling Kinematics and Physics for Interactive World Models

2026-06-22 • Robotics

Robotics

AI summaryⓘ

The authors developed a new simulation model called IOI to help robots learn how to move and interact more realistically. IOI combines known movement rules (kinematics) with learned physical effects, improving accuracy and preventing errors common in purely data-driven models. It uses multiple camera views to guide video generation without complex calibrations and separates predictable motion from unpredictable physical interactions. Tests showed IOI simulates robot behavior well, adapts to new tasks, and helps train policies that work in real-world settings.

embodied agentsinteractive world modelskinematicsphysical dynamicsforward kinematicsmulti-view projectionvideo generationout-of-distribution generalizationpolicy evaluationsimulation fidelity

Authors

Chengyu Bai, Peidong Jia, Tiecheng Guo, Yukai Wang, Rui Ma, Fangyuan Zhao, Chunkai Fan, Xiaobao Wei, Jintao Chen, Hao Wang, Ying Li, Xiaozhu Ju, Jian Tang, Shanghang Zhang

Abstract

Developing generalist embodied agents requires interactive environments providing visually realistic feedback and accurate action-conditioned dynamics. Interactive world models address this by simulating such complex dynamics. However, purely data-driven methods struggle to ensure precise control alignment and physically plausible visual feedback due to a lack of explicit structural constraints. To address this, we propose IOI, a hybrid interactive world model integrating analytical kinematic priors with learned physical dynamics. Unlike data-driven approaches prone to spatiotemporal drift, IOI introduces explicit kinematic guidance, computing forward kinematics from action sequences for accurate motion trajectories. These trajectories are rendered into synchronized front, side, and top orthographic projections, eliminating the need for extrinsic camera calibration. A Multi-view Kinematic Aggregation and Injection module fuses these geometric cues and injects them into the video generator, providing geometry-consistent guidance. Conditioning video generation on these deterministic trajectories establishes a synergy between the analytical simulator and the world model. Decoupling deterministic motion into the kinematic prior frees the generator to model stochastic physical interactions. Experiments on the RoboTwin benchmark validate IOI across kinematic fidelity, out-of-distribution (OOD) generalization, and policy evaluation. IOI achieves state-of-the-art simulation performance and robust zero-shot generalization to unseen OOD tasks. Furthermore, IOI serves as a reliable policy evaluator, yielding success rates closely aligning with ground-truth physics simulators. On real-world platforms, policies trained on IOI-synthesized data match those trained on teleoperation demonstrations, solidifying its practical value for embodied policy learning.

View PDFOpen arXiv