Thinking in Boxes: 3D Editing in Real Images Made Easy

2026-06-18 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors developed a new way to edit images by using 3D boxes to clearly show how objects should move, rotate, or change size, making the editing more precise than just using text or 2D hints. Each box side is color-coded to show direction, helping maintain the object’s true look and scene details even when making big changes. They also add a special 3D floor to help keep everything aligned in the scene. By training their system first on made-up scenes and then on real videos, their method works well on normal photos and beats existing techniques at large 3D edits.

3D bounding boxesimage editingspatial transformationdepth alignmentscene geometryObjectron datasetimage generationrotation and translation3D orientationsynthetic training data

Authors

Pradhaan S Bhat, Naveen Chandra R, Rishubh Parihar, Vaibhav Vavilala, R. Venkatesh Babu, D. A. Forsyth, Anand Bhattad

Abstract

Text and 2D-conditioning interfaces provide weak, ambiguous control over spatial transformations in image editing -- particularly under large object motions and camera changes. Prior work has used 3D primitives such as boxes, but only as loose conditioning signals indicating approximate object location rather than specifying the transformation. We instead use 3D boxes as structured specifications: the user provides the input and output boxes of the edit, casting editing as a well-posed geometry problem. This ``thinking in boxes'' interface, where each box face is color-coded to convey 3D orientation, gives precise control over translation, rotation, scaling, and viewpoint changes in real images while preserving scene and object identity, and recovering previously unseen object regions. To ground transformations in scene appearance, we introduce a depth-aligned planar floor as a global reference frame, shaded with depth-aware cues. Conditioned on this structure, an image generator produces consistent results under large transformations. Trained in two stages -- on synthetic multi-object scenes and a small set of real-world videos from Objectron -- the system generalizes to complex, in-the-wild real images. Our method operates directly on real photographs and substantially outperforms recent state-of-the-art methods on large 3D edits.

View PDFOpen arXiv