StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception

2026-05-11 • Robotics

RoboticsComputer Vision and Pattern Recognition

AI summaryⓘ

The authors introduce StereoPolicy, a new method for teaching robots to manipulate objects using pairs of stereo images instead of single camera views. This helps the robot better understand depth and space without needing complex 3D models or camera setup. Their method uses existing 2D image processing tools combined with a special transformer to find spatial relationships between the images. They tested StereoPolicy in simulations and real robots, showing it works better than other common input types like RGB or 3D point clouds. This suggests stereo vision is a useful and practical way to improve robotic manipulation skills.

robot imitation learningvisuomotor policiesstereo visionstereo imagesStereo Transformerpretrained 2D vision encodersdepth cuesgeometric reasoningdiffusion-based policiesvision-language-action policies

Authors

Evans Han, Yunfan Jiang, Yingke Wang, Haoyue Xiao, Huang Huang, Jianwen Xie, Jiajun Wu, Li Fei-Fei, Ruohan Zhang

Abstract

Recent advances in robot imitation learning have yielded powerful visuomotor policies capable of manipulating a wide variety of objects directly from monocular visual inputs. However, monocular observations inherently lack reliable depth cues and spatial awareness, which are critical for precise manipulation in cluttered or geometrically complex scenes. To address this limitation, we introduce StereoPolicy, a new visuomotor policy learning framework that directly leverages synchronized stereo image pairs to strengthen geometric reasoning, without requiring explicit 3D reconstruction or camera calibration. StereoPolicy employs pretrained 2D vision encoders to process each image independently and fuses the resulting representations through a Stereo Transformer. This design implicitly captures spatial correspondence and disparity cues. The framework integrates seamlessly with diffusion-based and pretrained vision-language-action (VLA) policies, delivering consistent improvements over RGB, RGB-D, point cloud, and multi-view baselines across three simulation benchmarks: RoboMimic, RoboCasa, and OmniGibson. We further validate StereoPolicy on real-robot experiments spanning both tabletop and bimanual mobile manipulation settings. Our results underscore stereo vision as a scalable and robust modality that bridges 2D pretrained representations with 3D geometric understanding for robotic manipulation.

View PDFOpen arXiv