R3D: Revisiting 3D Policy Learning

2026-04-16 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionRobotics

AI summaryⓘ

The authors found that training 3D learning models is tricky because they overfit and become unstable, mainly due to missing 3D data augmentation and problems with Batch Normalization. They created a new model combining a transformer-based 3D encoder and a diffusion decoder to fix these issues and make training more stable. Their model works better than current approaches on difficult manipulation tasks, providing a stronger base for learning 3D actions from examples.

3D policy learningdata augmentationBatch Normalizationtransformerdiffusion decoderpre-trainingimitation learningmanipulation benchmarks

Authors

Zhengdong Hong, Shenrui Wu, Haozhe Cui, Boyi Zhao, Ran Ji, Yiyang He, Hangxing Zhang, Zundong Ke, Jun Wang, Guofeng Zhang, Jiayuan Gu

Abstract

3D policy learning promises superior generalization and cross-embodiment transfer, but progress has been hindered by training instabilities and severe overfitting, precluding the adoption of powerful 3D perception models. In this work, we systematically diagnose these failures, identifying the omission of 3D data augmentation and the adverse effects of Batch Normalization as primary causes. We propose a new architecture coupling a scalable transformer-based 3D encoder with a diffusion decoder, engineered specifically for stability at scale and designed to leverage large-scale pre-training. Our approach significantly outperforms state-of-the-art 3D baselines on challenging manipulation benchmarks, establishing a new and robust foundation for scalable 3D imitation learning. Project Page: https://r3d-policy.github.io/

View PDFOpen arXiv