BoxCtrl: 3D-Aware Visual Prompting for Geometric Image Editing

2026-06-22 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors present BoxCtrl, a new way to guide image editing by using colored 3D boxes projected onto 2D images. These colored boxes help the model understand how to precisely move, rotate, or resize parts of an image in 3D space, separate from changing appearance. They trained their model first on a large synthetic dataset and then improved it using real-world images with reinforcement learning. Their approach improves accuracy in geometric edits while keeping images realistic.

3D bounding boxvisual promptinggeometric image editingsupervised fine-tuningreinforcement learningsynthetic datasetmultimodal modelsimage translationimage rotationimage scaling

Authors

Feifei Wang, Shiyuan Yang, Xiaoyu Li, Jing Liao

Abstract

As instruction-based editing models and multimodal large language models advance, diverse image editing tasks have become feasible. However, achieving precise and consistent geometric image editing, such as translating, scaling, and rotating in 3D space, remains a major challenge. In this work, we introduce BoxCtrl, a 3D-aware visual prompting framework. Unlike text-only or coarse 2D-guided approaches, our method introduces informative RGB 3D bounding boxes projected onto 2D images as visual prompts. The three orthogonal faces of each box are painted with distinct RGB colors, simultaneously encoding position, size, and orientation to provide a compact, intuitive in-context visual example. The key to BoxCtrl's success lies in these well-designed bounding boxes, which decouple geometric control from appearance control. This enables the model to learn consistent correspondences between faces of the same color in the latent space, leading to a precise understanding of geometric intentions and accurate editing results. We introduce a two-stage training paradigm: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL). To address paired data scarcity, we construct a large-scale synthetic dataset for SFT, equipping the model with fundamental editing capabilities. To bridge the synthetic-to-real domain gap, we incorporate an online RL stage leveraging unpaired real-world data. Guided by a reward function evaluating geometric accuracy and visual fidelity, our SFT-RL strategy significantly enhances geometric precision while maintaining photorealistic quality. Extensive experiments demonstrate that BoxCtrl achieves state-of-the-art performance across translation, rotation, scaling, and composite editing tasks.

View PDFOpen arXiv