Can Single-View Mesh Reconstruction Generalize to Robot Camera Rotation?

2026-06-22Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionRobotics
AI summary

The authors studied how well current methods for reconstructing 3D object shapes and layouts from a single camera view handle different robot camera rotations. They found that camera rotations can cause errors in depth estimation, layout positioning, and physical plausibility, although the object shapes themselves remain fairly stable. Their tests show that methods incorporating gravity information and a two-step process perform better than simpler approaches. Overall, the authors highlight that current single-view 3D reconstruction models struggle with real-world robot camera movements and that considering gravity cues helps improve reliability.

Single-view mesh reconstructionMonocular depth estimationRobot camera rotation3D layout predictionPhysical plausibilityICP (Iterative Closest Point)Aria Digital TwinGravity cuesFeed-forward predictionSAM3D pipeline
Authors
Yu Zhan, Guangcheng Chen, Hanjing Ye, Zhiqin Cheng, Zanjia Tong, Wenjun Xu, Hong Zhang
Abstract
Single-view mesh reconstruction predicts object meshes and spatial layouts from a single observation, making it attractive for fast robot spatial reasoning and real-to-sim digital twins. However, robot-mounted cameras naturally rotate during manipulation and navigation, while learned single-view reconstruction models often rely on view-dependent priors and may generalize poorly to out-of-distribution camera rotations. Such rotations can introduce 3D inconsistencies, incorrect layouts, and violations of physical constraints, but this failure mode remains under-evaluated. We introduce an evaluation protocol with controlled axis-wise roll, pitch, and yaw sweeps to trace errors in monocular depth estimation (MDE), canonical object meshes, camera-space layout, and physical plausibility within a representative SAM3D-style pipeline. On the Aria Digital Twin dataset and a real Franka wrist-camera sequence, camera rotations induce MDE distortion, layout drift, and collision penetration, while canonical mesh predictions remain relatively stable. A two-stage SAM3D+FoundationPose pipeline is more robust than one-stage feed-forward layout prediction, and our Gravity-Aware Refinement reduces one-stage pairwise ICP-based layout-orientation error by 47.1$\%$. Our evaluation reveals that current single-view mesh reconstruction methods generalize poorly to robot camera rotation, and suggests that explicit gravity cues are important for reliable robotic single-view mesh reconstruction.