Dexterity-BEV: Aligning 3D World and Actions for Generalizable Robot Policies Learning
2026-06-01 • Robotics
Robotics
AI summaryⓘ
The authors address problems in robotic manipulation that come from only using flat 2D camera images, which miss important 3D details. They create a new way to represent images with 3D information by combining camera data and depth, making it easier for robots to understand and act in space. They also align the robot’s inputs and movements into a shared 3D coordinate system, using a bird’s-eye-view frame that stays consistent even if cameras move around. Their system helps robots learn better from different data sources and setups, improving how well they work in the real world.
End-to-end manipulation policiesVision-Language Models (VLMs)3D vertex mapCamera calibrationBird's-Eye-View (BEV) alignmentSpatial-temporal alignmentRobot embodimentTrajectory datasetsDepth sensingCoordinate system alignment
Authors
Huayi Zhou, Wei Gao, Dekun Lu, Ruiji Liu, Zhanqi Zhang, Ziyang Zhang, Jian Chen, Wenlve Zhou, Sheng Xu, Shumin Li, Kangyi Guo, Shichen Xu, Zixin Huang, Yongyi Su, Kui Jia
Abstract
End-to-end manipulation policies, combined with web-scale pretrained Vision-Language Models (VLMs), show the promise for generalizable and dexterous robotic manipulation. However, they inherit two key limitations from 2D foundation models: 1) the reliance on 2D RGB inputs that ignores the intrinsically 3D nature of manipulation; and 2) the lack of spatial 3D alignment between input-output spaces as well as across diverse robot embodiments, camera setups, and trajectory datasets. In this paper, we present a series of contributions to address these issues. First, we introduce aligned vertex map and vertex spectrum -- a pixel-wise 3D representation that elevates 2D visual inputs to 3D, using camera calibration and optional depth. This novel input representation marries 3D awareness with the generalization of 2D large VLMs. Then, we propose to align the inputs and outputs of manipulation policies by expressing per-pixel 3D information of each camera view and robot actions to a shared coordinate. Based on this, we designate a canonical Bird's-Eye-View (BEV) alignment frame and innovatively propose to construct BEV images, producing a view-invariant representation robust to camera pose variations. To enable training and evaluation at scale, we develop a comprehensive data processing pipeline to perform such alignments; we also introduce a novel temporal alignment scheme for trajectories across diverse robots, human operators, and datasets. These contributions collectively mitigate input and output spatial-temporal misalignments, improving the consistency and generalization for real-world manipulation. Pretrained checkpoint, source code and data processing pipeline are available in https://hnuzhy.github.io/projects/Dex-BEV.