GARDEN: Gravity-Aligned Reconstruction of Disentangled ENvironments from RGB images

2026-06-02Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors created a method called GARDEN that turns multiple photos into accurate 3D scenes ready for physics simulation. Unlike older methods that mix up objects and backgrounds and use slow replacements, GARDEN uses gravity to orient the scene and separates rigid objects from the background clearly. This makes the scenes easier to interact with physically while keeping them looking real. Their tests showed GARDEN works better and faster than previous approaches that rely on swapping objects with generic models.

multi-view RGB reconstruction3D environmentscene factorizationgravity alignment6-DoF placementrigid object meshpoint classificationphysics simulationscene disentanglementCAD asset retrieval
Authors
Jiahao Sun, Dingkun Wei, Zehong Shen, Hongyu Zhou, Yujun Shen, Liang Li
Abstract
Converting multi-view RGB observations into simulation-ready 3D environments remains challenging because current reconstruction pipelines produce monolithic scene representations without explicit physical structure. They are typically defined up to an arbitrary global rotation and entangle rigid foreground objects with background geometry, which hinders stable physical interaction. Existing solutions often recover interactivity by replacing reconstructed objects with retrieved CAD assets, but this introduces a slow retrieval-and-replacement stage and weakens scene-specific geometric fidelity. We propose GARDEN, an RGB-only framework that reformulates reconstruction as physically-grounded scene factorization and outputs a structured hybrid scene representation. The key idea is to use gravity as a universal physical prior: we first align the reconstruction to a unified Gravity-View frame to resolve gauge ambiguity, then recover object-centric rigid meshes with accurate 6-DoF placement, and finally remove duplicate object geometry from the background through conditional 3D point classification. The resulting representation combines explicit rigid bodies with a decoupled background, enabling direct physics simulation while preserving visual realism. Experiments on both simulated and real multi-view scenes show that GARDEN improves object placement reliability, disentanglement quality, and rendering-simulation efficiency compared with retrieval-based baselines.