R2RDreamer: 3D-aware Data Augmentation for Spatially-generalized 2D Manipulation Policies
2026-06-15 • Robotics
RoboticsComputer Vision and Pattern Recognition
AI summaryⓘ
The authors propose R2RDreamer, a method to help robots learn to manipulate objects better by expanding a small set of real demonstrations. Instead of relying heavily on complex 3D scene understanding or simulation, their approach edits 3D data lightly and then completes the visuals in 2D video space. This process keeps the 3D consistency of actions while producing realistic RGB videos for training. Their tests show that R2RDreamer helps robots generalize to new object positions more effectively.
Spatial generalizationImitation learningData augmentation3D pointcloudRGB imagesOcclusion reasoningVideo completionEnd-effector trajectorySimulation-to-real gapVision-language-action policies
Authors
Xiuwei Xu, Haowen Sun, Angyuan Ma, Yiwei Zhang, Zhenyu Wu, Xiaofeng Wang, Bingyao Yu, Zheng Zhu, Jie Zhou, Jiwen Lu
Abstract
Spatial generalization is critical for imitation-learned manipulation policies, but achieving it typically requires scaling demonstrations across diverse object poses, robot configurations, and camera viewpoints. Data augmentation from a few source demonstrations offers a practical alternative to costly real-world collection. Simulation-based augmentation can create controllable variation, but requires complex environment and object setup and may introduce a sim-to-real gap. Recent real-to-real methods avoid these issues by jointly editing 3D observations and action trajectories from real demonstrations, yet they still rely on strong 3D scene parsing and geometry completion, and often produce observations tailored to 3D pointcloud policies rather than RGB-based 2D policies. We propose R2RDreamer, a real-to-real demonstration augmentation framework that preserves the geometric consistency of 3D action-observation editing while moving visual completion to 2D video space. Specifically, R2RDreamer first performs lightweight 3D augmentation by editing incomplete object pointclouds and end-effector trajectories in a shared 3D frame; it then projects the edited scene into masked image-space control videos with occlusion-aware reasoning and uses a dense-control image-to-video model to complete temporally coherent RGB observations. Experiments on spatially shifted manipulation tasks with both 2D diffusion-style policies and vision-language-action policies show that R2RDreamer improves spatial generalization from limited source demonstrations, with analyses validating the contributions of 3D editing, occlusion-aware projection, and video completion.