GeoEdit: Geometry-Aware Object Editing via Dual-Branch Denoising

2026-06-29 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors address the problem of moving or rotating objects within a single photo while keeping realistic 3D effects and backgrounds intact, which current methods struggle with. They propose GeoEdit, a new approach that separates the object and scene in 3D, aligns them, and uses a special denoising step to keep the object's shape correct while allowing the background to be naturally regenerated. This method avoids common issues like ghost images and blurring by carefully controlling how noise is handled during editing. They also provide a new benchmark called GeoEditBench to test these edits with accurate measures of object movement and camera changes.

diffusion models3D manipulationlatent spacedenoisingself-attentionstructural depth mappose-aware evaluationimage synthesisbackground inpaintingvideo diffusion

Authors

Yi He, Jiangming Wang, Xinyu Wang, Mark Fong, Songchun Zhang, Yuxuan Xue, Hai-Tao Zheng, Yue Ma

Abstract

Precisely manipulating objects in a single photograph (translation, rotation, scaling) while obeying 3D physical constraints remains unsolved for diffusion-based editors. Current 2D methods lack spatial awareness and produce perspective violations. Forcing structural proxies into the latent space also disrupts variance homogeneity, and the resulting self-attention leakage leads to ghosting and background blur. The core difficulty is asymmetric: the relocated object must follow a rigid geometry, yet the uncovered background needs freedom to synthesize plausible content. We present GeoEdit, a training-free Lift-Manipulate-Render-Denoise pipeline that satisfies both constraints. We decouple scene and object in 3D, align them through point correspondence, and render a geometry-aligned proxy with a structural depth map. A Dual-Branch Denoising stage then refines this proxy: a video diffusion backbone preserves object identity, while 3D constraints are injected into the foreground within a narrow denoising window at matching noise variance (variance-homogeneous injection). The background denoises freely. Because the injected signal matches the native latent statistics, self-attention stays undisturbed. We also introduce GeoEditBench, a pose-aware benchmark covering object translation, object rotation, and camera movement with pose-aware evaluation metrics. Experiments confirm consistent gains in geometric accuracy, identity fidelity, and background quality. Our codes are available at https://github.com/Heey731/GeoEdit.

View PDFOpen arXiv